1.Seurat包的安裝及數(shù)據(jù)的準備
1.1 加載需要的包和數(shù)據(jù)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Seurat")
library(Seurat)
# devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(ggplot2)
library(patchwork)
library(dplyr)
##如果有未安裝的包影所,運行如下命令安裝:install.packages("包的名字")
1.2 創(chuàng)建Seurat對象
對于之前從CellRanger得到的比對結果纵装,讀取sample/outs/filtered_feature_bc_matrix文件夾下的三個文件:barcodes.tsv(1列钧汹,為barcode名)相满;genes.tsv(2列层亿,第1列為ENS編號,第2列為基因名)立美;matrix.mtx(3列匿又,第1列為基因編號,第2列為細胞編號建蹄,第3列為對應的reads數(shù))
pbmc.data <- Read10X(data.dir = "sample/outs/filtered_feature_bc_matrix")
##使用未經(jīng)標準化的數(shù)據(jù)創(chuàng)建Seurat對象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
##project:項目名碌更;過濾條件:min.cells = 3 一個基因最少在3個細胞中檢測到,min.features = 200一個細胞中最少檢測到200個基因
1.3 認識Seurat對象的結構
Seurat對象中儲存了關于這個單細胞項目的幾乎所有信息洞慎,包括每個細胞的barcodes痛单,原始表達矩陣以及運行過哪些分析等,后續(xù)對單細胞的分群注釋等信息都是保存在Seurat對象中的拢蛋。在R中可以使用str函數(shù)查看數(shù)據(jù)結構桦他。
str(pbmc)
Formal class 'Seurat' [package "SeuratObject"] with 13 slots
..@ assays :List of 1
.. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
.. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. .. .. .. ..@ i : int [1:2282976] 29 73 80 148 163 184 186 227 229 230 ...
.. .. .. .. .. ..@ p : int [1:2701] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ...
.. .. .. .. .. ..@ Dim : int [1:2] 13714 2700
.. .. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ...
.. .. .. .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
.. .. .. .. .. ..@ x : num [1:2282976] 1 1 2 1 1 1 1 41 1 1 ...
.. .. .. .. .. ..@ factors : list()
.. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. .. .. .. ..@ i : int [1:2282976] 29 73 80 148 163 184 186 227 229 230 ...
.. .. .. .. .. ..@ p : int [1:2701] 0 779 2131 3260 4220 4741 5522 6304 7094 7626 ...
.. .. .. .. .. ..@ Dim : int [1:2] 13714 2700
.. .. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ...
.. .. .. .. .. .. ..$ : chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
.. .. .. .. .. ..@ x : num [1:2282976] 1 1 2 1 1 1 1 41 1 1 ...
.. .. .. .. .. ..@ factors : list()
.. .. .. ..@ scale.data : num[0 , 0 ]
.. .. .. ..@ key : chr "rna_"
.. .. .. ..@ assay.orig : NULL
.. .. .. ..@ var.features : logi(0)
.. .. .. ..@ meta.features:'data.frame': 13714 obs. of 0 variables
.. .. .. ..@ misc : list()
..@ meta.data :'data.frame': 2700 obs. of 4 variables:
.. ..$ orig.ident : Factor w/ 1 level "pbmc3k": 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ nCount_RNA : num [1:2700] 2419 4903 3147 2639 980 ...
.. ..$ nFeature_RNA: int [1:2700] 779 1352 1129 960 521 781 782 790 532 550 ...
.. ..$ percent.mt : num [1:2700] 3.02 3.79 0.89 1.74 1.22 ...
..@ active.assay: chr "RNA"
..@ active.ident: Factor w/ 1 level "pbmc3k": 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "names")= chr [1:2700] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
..@ graphs : list()
..@ neighbors : list()
..@ reductions : list()
..@ images : list()
..@ project.name: chr "pbmc3k"
..@ misc : list()
..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
.. ..$ : int [1:3] 4 1 0
..@ commands : list()
..@ tools : list()
以PBMC為例,訪問Seurat對象中的數(shù)據(jù)
輸入pbmc@會自動彈出Seurat對象的二級結構
我們訪問最多的就是assays和meta.data
pbmc@assays
$RNA
Assay data with 13714 features for 2700 cells
First 10 features:
AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115,
NOC2L, KLHL17, PLEKHN1, RP11-54O7.17, HES4
##pbmc@assays$RNA中包含著2700個細胞的13714個基因的信息谆棱,包括原始counts(pbmc@assays$RNA@counts)
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
AAACGCACTGGTAC-1 pbmc3k 2163 781 1.6643551
#pbmc[[]] 可進入meta.data 相當于pbmc@meta.data快压,,對meta.data下級數(shù)據(jù)調用方式跟數(shù)據(jù)框取列類似垃瞧,pbmc@meta.data$percent.mt蔫劣,另外pbmc[["percent.mt"]]也可以表示選擇percent.mt這一列
2.數(shù)據(jù)的過濾
2.1 確定數(shù)據(jù)過濾條件
#計算線粒體基因占總基因數(shù)目的百分比,pattern后可以是正則表達式个从。
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
#小提琴圖展示meta.data中的數(shù)據(jù)脉幢,ncol 表示顯示多個圖時的列數(shù)
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
可以看到基本上線粒體基因都在5%以下,nFeature基本在200-2000之間嗦锐,我們可以根據(jù)圖示結果篩選去除線粒體基因和選擇出基因的表達量區(qū)間嫌松。
2.2 檢查數(shù)據(jù)的可用性
FeatureScatter 可視化meta.data中數(shù)據(jù)的關系,其中細胞的顏色由其identity class決定奕污,圖上方數(shù)字表示其皮爾遜相關系數(shù)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
可以看到nCount_RNA和線粒體基因間無相關萎羔,表明測序得到的Count基本都是細胞的功能基因,nCount_RNA和nFeature_RNA強相關碳默,符合邏輯贾陷。
2.3 數(shù)據(jù)過濾
subset函數(shù)用于取子集缘眶,subset(x, subset, select, drop = FALSE, ...),x表示操作對象髓废,subset表示所取子集的邏輯值
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
3.數(shù)據(jù)的標準化
Seurat包中自帶了標準化函數(shù)巷懈,NormalizeData。
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 1e4)
#normalization.method有LogNormalize慌洪,CLR顶燕,RC三種。
#LogNormalize:每個細胞的基因數(shù)數(shù)除以該細胞的總基因數(shù)蒋譬,再乘以scale.factor割岛。然后使用log(x+1)進行自然對數(shù)轉換。
#CLR:應用居中的對數(shù)比率變換
#RC: 相對計數(shù)犯助。每個細胞的功能計數(shù)除以該細胞的總計數(shù),再乘以scale.factor维咸。沒有應用log轉換剂买。對于每百萬計數(shù)(CPM)設置規(guī)模。系數(shù)= 1 e6
4.識別差異基因
#選取2000個差異基因
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
#selection.method:vst癌蓖,mean.var.plot (mvp)瞬哼,dispersion (disp)。
#nfeatures:選擇差異基因的數(shù)量;僅在selection.method設置為“dispersion”或“vst”時使用租副。
# 選擇前10的差異基因
top10 <- head(VariableFeatures(pbmc), 10)
# 帶標簽與否的展示差異基因
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
根據(jù)項目背景需要坐慰,決定是否用ScaleData 函數(shù)去除細胞周期或線粒體基因的影響,若差異基因中出現(xiàn)線粒體基因或者細胞周期相關基因且與項目背景無關用僧,可以選擇使用此函數(shù)结胀。
#消除線粒體基因的影響
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
5.降維聚類分群
5.1根據(jù)差異基因進行主成分分析
pbmc <- RunPCA(pbmc, features = VariableFeatures( pbmc))
5.2計算給定數(shù)據(jù)的近鄰
FindNeighbors函數(shù)計算給定數(shù)據(jù)集的k.param近鄰。也可以選擇(compute.SNN)责循,通過計算每個cell與其k.param近鄰之間的鄰域重疊(Jaccard index)來構造一個共享近鄰圖糟港。參考鏈接:Seurat識別細胞類群的原理(FindNeighbors和FindClusters),FindNeighbors {Seurat}
#dim輸入降維的維度院仿,resolution分辨率秸抚,判斷近鄰距離大小,值越低歹垫,聚類越少剥汤。
pbmc <- FindNeighbors(pbmc, dims = 1:10)
?FindNeighbors
Description:
Constructs a Shared Nearest Neighbor (SNN) Graph for a given
dataset. We first determine the k-nearest neighbors of each cell.
We use this knn graph to construct the SNN graph by calculating
the neighborhood overlap (Jaccard index) between every cell and
its k.param nearest neighbors.
這個地方說明,這個函數(shù)首先是計算每個細胞的KNN排惨,也就是計算每個細胞之間的相互距離吭敢,依據(jù)細胞之間距離的graph來構建snn graph(依據(jù)細胞之間“鄰居”的overlop)
這里有三個問題:1、knn是什么若贮,2省有、Jaccard index又是什么 3痒留、鄰居的判定
我們來看看參數(shù)(主要參數(shù)):
distance.matrix: Boolean value of whether the provided matrix is a
distance matrix; note, for objects of class ‘dist’, this
parameter will be set automatically
這個參數(shù)我們通常不會設置,但是默認是TRUE蠢沿。
k.param: Defines k for the k-nearest neighbor algorithm
這個參數(shù)就是用來定義最相近的幾個細胞作為鄰居伸头,默認是20
compute.SNN: also compute the shared nearest neighbor graph
計算共享鄰居的數(shù)量,一般不設置
prune.SNN: Sets the cutoff for acceptable Jaccard index when computing
the neighborhood overlap for the SNN construction. Any edges
with values less than or equal to this will be set to 0 and
removed from the SNN graph. Essentially sets the strigency of
pruning (0 - no pruning, 1 - prune everything).
在計算SNN構造的鄰域重疊時舷蟀,為可接受的Jaccard index設置截止值恤磷。 值小于或等于此值的任何邊將被設置為0并從SNN圖中刪除。 本質上設置修剪的嚴格程度(0-不修剪野宜,1-修剪所有內容)扫步。
nn.method: Method for nearest neighbor finding. Options include: rann,
annoy
這個參數(shù)提供了如何判斷鄰居的方法,提供的可選是rann和annoy匈子,**這個我們在后面討論**河胎。
annoy.metric: Distance metric for annoy. Options include: euclidean,
cosine, manhattan, and hamming
(annoy距離的方式)
force.recalc: Force recalculation of SNN
SNN強制重新計算,一般不設置
5.3依據(jù)FindNeighbors函數(shù)的結果虎敦,計算分群聚類
pbmc <- FindClusters(pbmc, resolution = 0.5)
?FindClusters
Description:
Identify clusters of cells by a shared nearest neighbor (SNN)
modularity optimization based clustering algorithm. First
calculate k-nearest neighbors and construct the SNN graph. Then
optimize the modularity function to determine clusters. For a full
description of the algorithms, see Waltman and van Eck (2013) _The
European Physical Journal B_. Thanks to Nigel Delaney
(evolvedmicrobe@github) for the rewrite of the Java modularity
optimizer code in Rcpp!
這個說明游岳,依據(jù)SNN來識別類群,當然算法很復雜其徙,我們可以參考給的網(wǎng)址胚迫。
我們來看看主要參數(shù)
resolution: Value of the resolution parameter, use a value above
(below) 1.0 if you want to obtain a larger (smaller) number
of communities.
這個參數(shù)可以理解為清晰度,值越低唾那,可以容納更少的共享鄰居數(shù)量访锻,聚類數(shù)也會變少。
modularity.fxn: 計算模塊系數(shù)函數(shù)闹获,1為標準函數(shù)期犬;2為備選函數(shù),這里沒有具體說明是什么函數(shù)昌罩,我認為1是上面提到的Kronecker delta函數(shù)哭懈。
method: Method for running leiden (defaults to matrix which is fast
for small datasets). Enable method = "igraph" to avoid
casting large data to a dense matrix.
這個參數(shù)表示leiden算法的計算方式,(我對算法是小白~茎用,求大神告知)
algorithm: 模塊系數(shù)優(yōu)化算法遣总,1使用原始Louvain算法;2使用Louvain algorithm with multilevel refinement轨功;3使用SLM算法旭斥;4使用Leiden算法(注:4需要額外安裝插件)
n.start: 隨機開始的數(shù)量,默認是10
random.seed: 隨機數(shù)種子古涧,默認是0
5.4查看前5個細胞的聚類ID
head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
0 3 2
AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
1 6
Levels: 0 1 2 3 4 5 6 7 8
5.5計算每個聚類包含的細胞數(shù)
table(pbmc$seurat_clusters)
0 1 2 3 4 5 6 7 8
709 480 429 342 316 162 154 32 14
5.6降維
pbmc <- RunUMAP(pbmc, dims = 1:10)
#note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = 'umap',label = TRUE)
降維過后會得到如下所示的分群情況:
6.單細胞亞群注釋
此時需要根據(jù)生物學相關知識垂券,可視化各單細胞亞群的標記基因,并根據(jù)其marker確定細胞分群」阶Γ可根據(jù)參考文獻或者這個數(shù)據(jù)庫算芯。
#FindMarkers查找亞群的標記基因
# 發(fā)現(xiàn)聚類1的所有biomarkers
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
# 查找將聚類1與聚類2和3區(qū)分的所有標記基因
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(2, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
# 與所有其他亞群相比,找到每個亞群的標記凳宙,僅報告陽性細胞
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10markers<-pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
DoHeatmap(pbmc , features = top10markers$gene, size = 3)
根據(jù)熱圖結果熙揍,在CellMarker網(wǎng)站可查詢,熱圖中分不清楚的氏涩,還可以單獨對該亞群運行FindMarkers函數(shù)后作圖查詢届囚。結果如下
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | IL7R, S100A4 | Memory CD4+ |
2 | CD14, LYZ | CD14+ Mono |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A",
"LYZ", "PPBP", "CD8A"))
6.1更改亞群注釋
#新id與之前id(0-8)一一對應
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T","B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
#更改seuratd對象中的Idents
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = 'umap',
label = TRUE, pt.size = 0.5) + NoLegend()
6.2差異基因在不同亞群中可視化
##以下features可換成任意基因
##小提琴圖
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
##坐標映射圖
FeaturePlot(pbmc, features = c("MS4A1", "CD79A"))
##峰巒圖
RidgePlot(pbmc, features = c("MS4A1", "CD79A"), ncol = 1)
features= c('IL7R', 'CCR7','CD14', 'LYZ', 'IL7R', 'S100A4',"MS4A1", "CD8A",'FOXP3',
'FCGR3A', 'MS4A7', 'GNLY', 'NKG7',
'FCER1A', 'CST3','PPBP')
##氣泡圖
DotPlot(pbmc, features = unique(features)) + RotatedAxis()
#downsample 在每個細胞亞群中均抽樣100個細胞做熱圖
DoHeatmap(subset(pbmc, downsample = 100),
features = features, size = 3)
參考來源
Seurat 包圖文詳解 | 單細胞轉錄組(scRNA-seq)分析02_白墨石的博客-CSDN博客_seurat單細胞測序
Seurat識別細胞類群的原理(FindNeighbors和FindClusters) - 簡書 (jianshu.com)
致謝
I thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.
THE END