一、創(chuàng)建 Seurat 對象
使用的示例數(shù)據(jù)集來自10X Genome 測序的 Peripheral Blood Mononuclear Cells (PBMC)。
library(dplyr)
library(Seurat)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
二蹂空、標(biāo)準(zhǔn)預(yù)處理流程
流程包括:
- 基于質(zhì)控指標(biāo)(QC metric)來篩選細(xì)胞
- 數(shù)據(jù)歸一化和縮放
- 高異質(zhì)性基因檢測
1.基因質(zhì)控指標(biāo)來篩選細(xì)胞
質(zhì)控指標(biāo):
-
每個(gè)細(xì)胞中檢測到的基因數(shù)
- 低質(zhì)量的細(xì)胞和空油滴(droplet)只有少量基因
- 兩個(gè)及以上的細(xì)胞會有異常的高基因數(shù)
每個(gè)細(xì)胞中的UMI總數(shù)(與上類似)
-
線粒體基因組的reads比例
低質(zhì)量或死細(xì)胞會有大百分比的線粒體基因組
使用
PercentageFeatureSet
函數(shù)來計(jì)數(shù)線粒體質(zhì)控指標(biāo)MT-
是線粒體基因
# 計(jì)算線粒體read的百分比
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# 顯示前5個(gè)細(xì)胞的質(zhì)控指標(biāo)
head(pbmc@meta.data, 5)
通過上圖,過濾標(biāo)準(zhǔn)設(shè)定為:
- 過濾UMI數(shù)大于2500,小于200的細(xì)胞
- 過濾線粒體百分比大于5%的細(xì)胞
查看特征與特征間的相關(guān)性
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
過濾
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
看看相關(guān)性
p1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
p2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(p1, p2))
2.歸一化數(shù)據(jù)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in
pbmc[["RNA"]]@data
.上述代碼可以替換為:pbmc <- NormalizeData(pbmc)
3.識別高異質(zhì)性特征
高異質(zhì)性:這些特征在有的細(xì)胞中高表達(dá)献丑,有的細(xì)胞中低表達(dá)。在下游分析中關(guān)注這些基因有助于找到單細(xì)胞數(shù)據(jù)集中的生物信號[https://www.nature.com/articles/nmeth.2645 ]
# 識別前2000個(gè)特征
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# 識別前10的高異質(zhì)性基因
top10 <- head(VariableFeatures(pbmc), 10)
# 繪圖看看
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))
4.縮放數(shù)據(jù)
這是在PCA等降維操作前的一個(gè)步驟侠姑,ScaleData
函數(shù):
- 轉(zhuǎn)換每個(gè)基因的表達(dá)值创橄,使每個(gè)細(xì)胞的平均表達(dá)值為0
- 轉(zhuǎn)換每個(gè)基因的表達(dá)值,使細(xì)胞間方差為1
- 此步驟在下游分析中具有相同的權(quán)重莽红,因此高表達(dá)的基因不會占主導(dǎo)地位
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
head(pbmc[["RNA"]]@scale.data,5)
5.線性維度約化 PCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
可視化細(xì)胞與特征間的PCA有三種方式:
VizDimLoadings
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
# 繪圖
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot
DimPlot(pbmc, reduction = "pca")
DimHeatmap
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
主要用來查看數(shù)據(jù)集中的異質(zhì)性的主要來源妥畏,并且可以確定哪些PC維度可以用于下一步的下游分析。
細(xì)胞和特征根據(jù)PCA分?jǐn)?shù)來排序
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
5.確定數(shù)據(jù)集的維度
為了克服在單細(xì)胞數(shù)據(jù)中在單個(gè)特征中的技術(shù)噪音安吁,Seurat 聚類細(xì)胞是基于PCA分?jǐn)?shù)的醉蚁。每個(gè)PC代表著一個(gè)‘元特征’(帶有跨相關(guān)特征集的信息)。因此鬼店,最主要的主成分代表了壓縮的數(shù)據(jù)集馍管。問題是要選多少PC呢?
方法一:
作者受JackStraw procedure 啟發(fā)薪韩。隨機(jī)置換數(shù)據(jù)的一部分子集(默認(rèn)1%)再運(yùn)行PCA确沸,構(gòu)建了一個(gè)'null distribution'的特征分?jǐn)?shù)捌锭,重復(fù)這一步。最終會識別出低P-value特征的顯著PCs
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 繪圖看看
JackStrawPlot(pbmc, dims = 1:15)
In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs
在上圖中展示出在前10到12臺PC之后罗捎,重要性顯著下降
方法二:
“ElbowPlot”:基于每個(gè)分量所解釋的方差百分比對主要成分進(jìn)行排名观谦。 在此示例中,我們可以在PC9-10周圍觀察到“elbow ”桨菜,這表明大多數(shù)真實(shí)信號是在前10臺PC中捕獲的豁状。
為了識別出數(shù)據(jù)的真實(shí)維度,有三種方法:
- 用更加受監(jiān)督的方法來確定PCs的異質(zhì)性倒得,比如可以結(jié)合GSEA來分析( The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example )
- The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.
- The third is a heuristic that is commonly used, and can be calculated instantly.
在這個(gè)例子中三種方法均產(chǎn)生了相似的結(jié)果泻红,以PC 7-12作為閾值。
這個(gè)例子中霞掺,作者選擇10谊路,但是實(shí)際過程中還要考慮:
- 樹突狀細(xì)胞和NK細(xì)胞可能在PCs12和13中識別,這可能定義了罕見的免疫亞群(比如菩彬,MZB1是漿細(xì)胞樣的er)缠劝。但是除非有一定的知識量,否則很難從背景噪音中發(fā)現(xiàn)骗灶。
- 用戶可以選擇不同的PCs再進(jìn)行下游分析惨恭,比如選10,15耙旦,50等脱羡。結(jié)果常常有很多的不同。
- 建議在選擇該參數(shù)時(shí)候免都,盡量偏高一點(diǎn)轻黑。如果僅僅使用5PCs會對下游分析產(chǎn)生不利影響
6.聚類細(xì)胞
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# 查看前5聚類
head(Idents(pbmc), 5)
7.非線性維度約化(UMAP/TSNE)
# 使用UMAP聚類
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
# 顯示在聚類標(biāo)簽
DimPlot(pbmc, reduction = "umap", label = TRUE)
# 使用TSNE聚類
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")
# 顯示在聚類標(biāo)簽
DimPlot(pbmc, reduction = "tsne", label = TRUE)
8.發(fā)現(xiàn)差異表達(dá)特征(cluster bioers)
# find all ers of cluster 1
cluster1.ers <- Finders(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.ers, n = 5)
# find all ers distinguishing cluster 5 from clusters 0 and 3
cluster5.ers <- Finders(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.ers, n = 5)
# find ers for every cluster compared to all remaining cells, report only the positive ones
pbmc.ers <- FindAllers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.ers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
# the ROC test returns the ‘classification power’ for any individual er (ranging from 0 - random, to 1 - perfect)
cluster1.ers <- Finders(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
默認(rèn)情況下,識別單個(gè)分簇的 positive and negative ers
FindAllers
會識別每個(gè)分簇的ers琴昆,可以測試該分簇與其他分簇或全部細(xì)胞的差異
-
min.pct
參數(shù)氓鄙,在兩組細(xì)胞中的任何一組中檢測到的最小百分 -
thresh.test
參數(shù),在兩組細(xì)胞間以一定數(shù)量的差異表達(dá)(平均) -
max.cells.per.ident
參數(shù)业舍,通過降低每個(gè)類的采樣值抖拦,提高計(jì)算速度
# 繪圖看看
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-4zDmBzyp-1573047498978)(C:\Users\baimo\Pictures\1知乎\jvHG5OC0MkXF.png)]
RidgePlot(pbmc, features = c("MS4A1", "CD79A"))
DotPlot(pbmc, features = c("MS4A1", "CD79A"))
top10 <- pbmc.ers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
9.識別細(xì)胞類型
幸運(yùn)的是,在這個(gè)數(shù)據(jù)集的情況下舷暮,我們可以使用規(guī)范標(biāo)記輕松地將無偏聚類與已知的單元類型相匹配态罪。
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()