時(shí)代的洪流奔涌而至,單細(xì)胞技術(shù)也從舊時(shí)王謝堂前燕包帚,飛入尋常百姓家。雪崩的時(shí)候运吓,沒有一片雪花是無辜的渴邦,你我也從素不相識(shí),到被一起卷入單細(xì)胞天地拘哨。R語言和Seurat已以勢(shì)如破竹之勢(shì)進(jìn)入4.0時(shí)代谋梭,天問一號(hào)探測(cè)器已進(jìn)入火星烏托邦平原了,你還不會(huì)單細(xì)胞嗎倦青?那么為了不被時(shí)代拋棄的太遠(yuǎn)瓮床,跟著我們一起通過學(xué)習(xí)seurat系列教程入門單細(xì)胞吧。
首先我們學(xué)習(xí)經(jīng)典的標(biāo)準(zhǔn)流程产镐,這個(gè)教程小編當(dāng)初入門時(shí)候曾經(jīng)花1000購買過纤垂,也曾花6000購買過,不同的單細(xì)胞培訓(xùn)班磷账,買的是一樣的教程。現(xiàn)在免費(fèi)送給你贾虽,別說話逃糟,開始學(xué)吧!
首先設(shè)置Seurat對(duì)象
我們將從 分析10X 外周血單核細(xì)胞 (PBMC) 數(shù)據(jù)集蓬豁。原始數(shù)據(jù)可以在這里找到绰咽。
讀取數(shù)據(jù)。該函數(shù)從 10X 讀取地粪,返回獨(dú)特的分子識(shí)別 (UMI) 計(jì)數(shù)矩陣取募。此矩陣中的值表示每個(gè)功能(即基因;行)在每個(gè)細(xì)胞(列)中檢測(cè)到的分子數(shù)量。
我們接下來使用計(jì)數(shù)矩陣創(chuàng)建一個(gè)對(duì)象蟆技。
library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
標(biāo)準(zhǔn)預(yù)處理工作流程
以下步驟包括 Seurat 中 scRNA-seq 數(shù)據(jù)的標(biāo)準(zhǔn)預(yù)處理工作流程玩敏。包括基于 QC 指標(biāo)的過濾斗忌、數(shù)據(jù)標(biāo)準(zhǔn)化和歸一化,以及檢測(cè)高變異基因的功能旺聚。
QC 和選擇細(xì)胞以供進(jìn)一步分析
Seurat 允許您輕松地探索 QC 指標(biāo)织阳,并根據(jù)任何用戶定義的標(biāo)準(zhǔn)過濾細(xì)胞。常用的一些 QC 指標(biāo)包括
- 在每個(gè)細(xì)胞中檢測(cè)到的基因數(shù)量砰粹。
- 低質(zhì)量細(xì)胞或空液滴通常只有很少的基因
- 細(xì)胞雙倍或多胞可能表現(xiàn)出異常高的基因計(jì)數(shù)
- 同樣唧躲,細(xì)胞內(nèi)檢測(cè)到的分子總數(shù)(與獨(dú)特的基因密切相關(guān))
- 讀取該細(xì)胞中的線粒體基因組的百分比
- 低質(zhì)量/死細(xì)胞經(jīng)常表現(xiàn)出廣泛的線粒體污染
- 我們使用PercentageFeatureSet()函數(shù)計(jì)算線粒體 QC 指標(biāo),該函數(shù)計(jì)算來自一組功能的計(jì)數(shù)百分比
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
在下面的示例中碱璃,我們可視化 QC 指標(biāo)弄痹,并使用這些指標(biāo)來過濾細(xì)胞。
- 過濾具有UMI計(jì)數(shù)超過 2嵌器,500 或小于 200的細(xì)胞
- 過濾具有>5%線粒體的細(xì)胞
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
標(biāo)準(zhǔn)化數(shù)據(jù)
從數(shù)據(jù)集中刪除不需要的細(xì)胞后肛真,下一步是使數(shù)據(jù)標(biāo)準(zhǔn)化。默認(rèn)情況下嘴秸,我們采用全局標(biāo)準(zhǔn)化毁欣。
pbmc <- NormalizeData(pbmc)
高變異基因的選擇
接下來,我們計(jì)算數(shù)據(jù)集中顯示高變異的特征子集(即岳掐,它們?cè)谀承┘?xì)胞中表達(dá)強(qiáng)烈凭疮,在另一些單元格中表達(dá)得很低)。在下游分析中關(guān)注這些基因有助于突出單細(xì)胞數(shù)據(jù)集中的生物信號(hào)串述。
默認(rèn)情況下执解,我們使用每個(gè)數(shù)據(jù)集的 2,000 個(gè)基因纲酗。這些將用于下游分析衰腌,如 PCA。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
歸一化數(shù)據(jù)
接下來觅赊,我們應(yīng)用線性轉(zhuǎn)換("歸一化")ScaleData()繼續(xù)處理數(shù)據(jù)集右蕊。目的是
- 改變每個(gè)基因的表達(dá),使細(xì)胞的平均表達(dá)為0
- 縮放每個(gè)基因的表達(dá)吮螺,使細(xì)胞之間的方差為 1
- 這一步驟在下游分析中具有同等的權(quán)重饶囚,因此高度表達(dá)的基因不會(huì)占主導(dǎo)地位
- 結(jié)果存儲(chǔ)在
pbmc[["RNA"]]@scale.data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
PCA分析
接下來,我們將在縮放數(shù)據(jù)上執(zhí)行PCA鸠补。默認(rèn)情況下萝风,只有先前確定的可變功能用作輸入,但如果您希望選擇不同的子集紫岩,則可以使用參數(shù)進(jìn)行定義规惰。
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
Seurat 提供了幾個(gè)有用的方法來可視化定義 PCA 的單元格和功能,包括 泉蝌,[DimPlot()]歇万,[DimHeatmap()]等
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
特別是揩晴,它允許在數(shù)據(jù)集中輕松探索異質(zhì)性的主要來源,并且在嘗試決定將哪些 PC 用于進(jìn)一步下游分析
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
確定數(shù)據(jù)集的"主成分個(gè)數(shù)"
Seurat 根據(jù) PCA 分?jǐn)?shù)對(duì)單元單元進(jìn)行聚類堕花,每臺(tái) PC 基本上代表一個(gè)"元結(jié)構(gòu)"文狱,該"元結(jié)構(gòu)"將信息組合在相關(guān)功能集中。我們隨機(jī)排列數(shù)據(jù)的子集(默認(rèn)情況下為 1%)并重新運(yùn)行 PCA缘挽,構(gòu)建功能分?jǐn)?shù)的"空分布"瞄崇,并重復(fù)此過程。我們確定"重要"PC壕曼。
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot()功能提供了一個(gè)可視化工具苏研,用于將每個(gè) PC 的 p 值分布與統(tǒng)一分布(虛線)進(jìn)行比較。"重要"PC 將顯示在虛線上方腮郊。
JackStrawPlot(pbmc, dims = 1:15)
另一種啟發(fā)式方法生成"肘部圖":[ElbowPlot()]根據(jù)每個(gè)(函數(shù))解釋的方差百分比對(duì)原則組件進(jìn)行排名摹蘑。在此示例中,我們可以觀察到 PC9-10 周圍的"肘部"轧飞,表明大多數(shù)真實(shí)信號(hào)在前 10 個(gè) PC 中被捕獲衅鹿。
ElbowPlot(pbmc)
我們?cè)谶@里選擇了 10 個(gè),但這不是確定的过咬。
將細(xì)胞聚類
Seurat 采用基于圖形的聚類方法大渤,簡言之,這些方法將細(xì)胞嵌入到圖形結(jié)構(gòu)中掸绞,例如 K 最近的鄰鄰 (KNN) 圖泵三,在具有類似特征表達(dá)模式的單元之間繪制邊緣,然后嘗試將此圖劃分為高度互連的"集團(tuán)"或"社區(qū)"衔掸。
與表象一樣烫幕,我們首先根據(jù) PCA 空間中的歐幾里德距離構(gòu)建 KNN 圖,并根據(jù)當(dāng)?shù)厣鐓^(qū)的共享重疊(Jaccard 相似性)優(yōu)化任意兩個(gè)細(xì)胞之間的邊緣權(quán)重敞映。
接下來將 Louvain 算法(默認(rèn)值)或 SLM 等模塊化優(yōu)化技術(shù)應(yīng)用于迭代組細(xì)胞较曼,以優(yōu)化標(biāo)準(zhǔn)模塊化功能。該函數(shù)實(shí)現(xiàn)此過程振愿,并包含一個(gè)分辨率參數(shù)诗芜,該參數(shù)設(shè)置下游聚類的"數(shù)量",增加值導(dǎo)致更多群集埃疫。我們發(fā)現(xiàn),將此參數(shù)設(shè)置在 0.4-1.2 之間通常會(huì)為大約 3K 細(xì)胞的單細(xì)胞數(shù)據(jù)集提供良好的結(jié)果孩哑。對(duì)于較大的數(shù)據(jù)集栓霜,最佳分辨率通常會(huì)增加。
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
運(yùn)行非線性降維(UMAP/tSNE)
Seurat 提供了多種非線性降維技術(shù)横蜒,如 tSNE 和 UMAP胳蛮,以可視化和探索這些數(shù)據(jù)集销凑。。
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
此時(shí)仅炊,您可以保存對(duì)象斗幼,以便輕松加載,而無需重新運(yùn)行上面執(zhí)行的計(jì)算密集級(jí)步驟抚垄,或輕松與協(xié)作者共享蜕窿。
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")
查找不同表達(dá)的marker
默認(rèn)情況下,F(xiàn)indAllMarkers()與所有其他細(xì)胞相比呆馁,可識(shí)別單個(gè)群集的marker桐经。 但您也可以測(cè)試組群相互對(duì)立,或針對(duì)所有亞群浙滤。
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 2.593535e-91 1.2154360 0.949 0.466 3.556774e-87
## LTB 7.994465e-87 1.2828597 0.981 0.644 1.096361e-82
## CD3D 3.922451e-70 0.9359210 0.922 0.433 5.379250e-66
## IL7R 1.130870e-66 1.1776027 0.748 0.327 1.550876e-62
## LDHB 4.082189e-65 0.8837324 0.953 0.614 5.598314e-61
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
## IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
## CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
## CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
## RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 2 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 3 0\. 5.57 0.996 0.215 0\. 1 S100A9
## 4 0\. 5.48 0.975 0.121 0\. 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0\. 4.31 0.936 0.041 0\. 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 10 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 14 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
## 18 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
幾個(gè)可視化marker表達(dá)的工具阴挣。
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
[DoHeatmap()](https://satijalab.org/seurat/reference/DoHeatmap.html)
generates an expression heatmap for given cells and features. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
亞群重命名
我們可以使用marker來輕松地將聚類與已知的單元類型進(jìn)行匹配:
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 CD14, LYZ CD14+ Mono
2 IL7R, S100A4 Memory CD4+
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
最后保存數(shù)據(jù)。
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")