2018年8月份的時候脆霎,我也使用過seurat來分析單細(xì)胞測序數(shù)據(jù)涡驮,然后最近也需要使用seurat包來分析實(shí)驗(yàn)室的單細(xì)胞測序數(shù)據(jù)固耘,在R中安裝完seurat的包后眷射,我到網(wǎng)站上下載了pbmc3k_tutorial.Rmd的指導(dǎo)文件,下載下來按著這個markdown文件執(zhí)行雕沿,發(fā)現(xiàn)好像跟我暑期的代碼不大一樣啊练湿,然后我又返回到網(wǎng)站上重新看了下,發(fā)現(xiàn)原來是“On April 16, 2019 - we officially updated the Seurat CRAN repository to release 3.0!” 在2019年4月16的時候更新版本啦审轮。好吧肥哎,R包更新的話,那么之前的代碼就是以前的版本了断国,如果使用新的版本贤姆,那么代碼就不能用,但是呢稳衬,原理是一樣的霞捡,而且發(fā)現(xiàn)新版本更好用了,自己敲代碼都少了哈哈哈薄疚。
一碧信、總的分析流程
Note:自己的一個理解但是可能并不準(zhǔn)確
二、具體的步驟
使用的數(shù)據(jù)是Seurat提供的pbmc_3k的10x單細(xì)胞數(shù)據(jù)街夭,共有32738基因和2700x細(xì)胞
1.Setup the Seurat Object(創(chuàng)建Seurat對象)
library(dplyr)
library(Seurat)
rm(list = ls())
pbmc.data <- Read10X(data.dir = "~/seurat/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) # gene:13714,cells:2700
#比較稀疏矩陣和普通矩陣使用的內(nèi)存大小
dense.size <- object.size(x = as.matrix(x = pbmc.data))
dense.size
sparse.size <- object.size(x = pbmc.data)
sparse.size
dense.size / sparse.size
1)先是用Read10x讀入數(shù)據(jù)砰碴,10xGenomics得到的數(shù)據(jù)是用稀疏矩陣存放的,因?yàn)閱渭?xì)胞數(shù)據(jù)又很多0板丽,也就是很多未檢測到的數(shù)據(jù)呈枉,稀疏矩陣存放的話可以節(jié)省內(nèi)存
2)接著用CreateSeuratObject創(chuàng)建Seurat對象趁尼,min.cells = 3, min.features = 200篩選的條件是min.cells = 3基因至少在3個細(xì)胞中表達(dá), min.features=200猖辫,每個細(xì)胞中至少表達(dá)200個基因
2.Standard pre-processing workflow(數(shù)據(jù)預(yù)處理流程)
可以分為四個部分:
-
Selection and filtration of the cells based on the QC metrics 基于QC的結(jié)果選擇和過濾細(xì)胞
-
Data normalization 數(shù)據(jù)標(biāo)準(zhǔn)化
-
scaling 量化(中心化)
-
Detection of the highly variable feature 高變異基因的檢測
(1)Selection and filtration of the cells based on the QC metrics 基于QC的結(jié)果選擇和過濾細(xì)胞
1)QC的標(biāo)準(zhǔn)
- The number of unique genes detected in each cell 每個細(xì)胞中的唯一基因的數(shù)目
- Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes) 每個細(xì)胞中的UMI數(shù)目
- The percentage of reads that map to the mitochondrial genome 比對到線粒體基因中的read數(shù)的百分比(因?yàn)榈唾|(zhì)量/將死的細(xì)胞通常表現(xiàn)出廣泛的線粒體污染)
2)代碼實(shí)現(xiàn)
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-") #線粒體基因以MT開頭的選擇出來
##nFeature_RNA就是每個細(xì)胞檢測到的基因數(shù)目酥泞;nCount_RNA就是這些基因數(shù)目一共測到的count數(shù)目,也就是總的UMI數(shù)目;percent.mt(使用'PercentageFeatureSet`函數(shù)計(jì)算線粒體QC指標(biāo)啃憎,該函數(shù)計(jì)算源自一組特征的計(jì)數(shù)百分比)
# Show QC metrics for the first 5 cells
head(x = pbmc@meta.data, 5)
#Visualize QC metrics as a violin plot 用小提琴圖可視化QC特征
VlnPlot(object = pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
# FeatureScatter通常用于可視化特征 - 特征關(guān)系
plot1 <- FeatureScatter(object = pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(object = pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1,plot2))
#選擇gene數(shù)目小于2500 以及大于200 以及 線粒體數(shù)目小于5%的細(xì)胞 #可以基于violin圖來選擇你要卡的閾值
#過濾完之后還剩余 genes:13714,cells:2638
pbmc <- subset(x = pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
3)結(jié)果
4)解釋
對于小提琴圖的幾個參數(shù)的解釋:nFeature_RNA就是每個細(xì)胞檢測到的基因數(shù)目芝囤,就是以前版本的nGene;nCount_RNA就是這些基因數(shù)目一共測到的count數(shù)目,也就是以前版本的UMI數(shù)目辛萍;percent.mt(使用'PercentageFeatureSet`函數(shù)計(jì)算線粒體QC指標(biāo)悯姊,該函數(shù)計(jì)算源自一組特征的計(jì)數(shù)百分比)。
對于QC標(biāo)準(zhǔn)的選擇:根據(jù)小提琴圖中的總的基因數(shù)目的分布以及線粒體所占的百分比贩毕,比如這邊大多數(shù)基因落在200-2500之間悯许,線粒體所占百分比小于5%這些是要保留下來的。
(2)數(shù)據(jù)標(biāo)準(zhǔn)化
1)對于數(shù)據(jù)標(biāo)準(zhǔn)化的理解
其實(shí)原先我已經(jīng)忘了數(shù)據(jù)標(biāo)準(zhǔn)化是干啥的了耳幢,我搞不明白為什么要用Log轉(zhuǎn)化岸晦,與我們的FPKM有什么區(qū)別欧啤,之后到網(wǎng)上去看了下睛藻,發(fā)現(xiàn)數(shù)據(jù)標(biāo)準(zhǔn)化消除數(shù)據(jù)量綱之間的不同;消除數(shù)量級之間差別很大邢隧;避免數(shù)值問題:太大的數(shù)會引發(fā)數(shù)值問題店印;平衡各特征的貢獻(xiàn);一些模型求解的需要:加快了梯度下降求最優(yōu)解的速度等需要
Seurat: 從數(shù)據(jù)集中刪除不需要的cell后倒慧,下一步是標(biāo)準(zhǔn)化數(shù)據(jù)按摘。 默認(rèn)情況下,Seurat采用全局標(biāo)準(zhǔn)化方法“LogNormalize”纫谅,通過總表達(dá)式對每個cell的表達(dá)值進(jìn)行標(biāo)準(zhǔn)化炫贤,將其乘以比例因子(默認(rèn)為10,000),并對結(jié)果進(jìn)行對數(shù)轉(zhuǎn)換付秕。 歸一化值存儲在pbmc [[“RNA”]] @ data
中兰珍。
2)代碼實(shí)現(xiàn)
#使用log轉(zhuǎn)化,度量單位是10000
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 1e4)
(3)Identification of highly variable features (feature selection) 識別高度變異基因(特征選擇的過程)
1)解釋
seurat: a subset of features that exhibit high cell-to-cell variation in the dataset 在數(shù)據(jù)集中表現(xiàn)出細(xì)胞間高變異的特征子集
2)代碼實(shí)現(xiàn)
pbmc <- FindVariableFeatures(object = pbmc,selection.method = 'vst', nfeatures = 2000)
# Identify the 10 most highly variable genes 取前十個變化最大的基因
top10 <- head(x = VariableFeatures(object = pbmc), 10)
# plot variable features with and without labels 畫出2000個高變異基因
plot1 <- VariableFeaturePlot(object = pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))
3) 結(jié)果展示
4)解釋
q其實(shí)我不是很理解這邊的高變異基因和差異基因的區(qū)別询吴,變異基因是說在每個細(xì)胞之間變異比較大的基因嗎掠河?其實(shí)不是很理解,這邊是什么意思猛计,而且我覺得變異最大的基因會不會是系統(tǒng)誤差導(dǎo)致的呢唠摹,并不是真正的生物學(xué)變異。我對這步選擇高變異基因認(rèn)為是在做特征選擇的過程奉瘤,選出變異比較大的基因用于后續(xù)分析勾拉。
(4)Scaling 中心化
1)理解
應(yīng)用線性變換('縮放'),這是在PCA等降維技術(shù)之前的標(biāo)準(zhǔn)預(yù)處理步驟(目的:所謂去中心化,就是將樣本X中的每個觀測值都減掉樣本均值藕赞,這樣做的好處是能夠使得求解協(xié)方差矩陣變得更容易苛秕。)。
ScaleData
函數(shù)
該步驟在下游分析中給予相同的權(quán)重找默,因此高表達(dá)的基因不占優(yōu)勢
2)代碼
all.genes <- rownames(x = pbmc)
pbmc <- ScaleData(object = pbmc, features = all.genes)
#- `ScaleData` function to remove unwanted sources of variation from a single-cell dataset, 比如mitochondrial contamination線粒體污染
pbmc <- ScaleData(object = pbmc, vars.to.regress = 'percent.mt')
3)結(jié)果展示
3.Perform linear dimensional reduction線性降維 + Cluster the cells 對細(xì)胞進(jìn)行聚類分析
(1) 線性降維:主成分分析
1)代碼實(shí)現(xiàn)
#接下來艇劫,我們對中心化的數(shù)據(jù)執(zhí)行PCA。 默認(rèn)情況下惩激,只有先前確定的變量要素用作輸入店煞,但如果要選擇其他子集,則可以使用`features`參數(shù)定義风钻。
pbmc <- RunPCA(object = pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(x = pbmc[['pca']], dims = 1:5, nfeatures = 5)
VizDimLoadings(object = pbmc, dims = 1:2, reduction = 'pca')
DimPlot(object = pbmc, reduction = 'pca') #每個細(xì)胞在PC1 和 PC2中的位置圖
# Doheatmap 畫熱圖
DimHeatmap(object = pbmc, dims = 1, cells = 500, balanced = TRUE) #可以選擇所要呈現(xiàn)的細(xì)胞數(shù)目
DimHeatmap(object = pbmc, dims = 1:3, cells = 500, balanced = TRUE)
# Determine the 'dimensionality' of the dataset 決定選取多少主成分
# 使用jackStraw方法 來估計(jì)多少主成分比較合適
# We identify 'significant' PCs as those who have a strong enrichment of low p-value features. 顯著的主成分是強(qiáng)烈富集 低P值的特征
pbmc <- JackStraw(object = pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(object = pbmc, dims = 1:20)
# JackStrawplot
JackStrawPlot(object = pbmc, dims = 1:15) #JackStrawPlot功能提供了一個可視化工具顷蟀,用于比較每個PC的p值分布和均勻分布
# 碎石圖
ElbowPlot(object = pbmc) #基于每一個解釋的方差百分比(“ElbowPlot”函數(shù))對主成分進(jìn)行排序
2)結(jié)果展示
3)PCA中RunPCA時遇到的坑
pbmc <- RunPCA(object = pbmc, features = VariableFeatures(object = pbmc))
Error in irlba(A = t(x = object), nv = npcs, ...) :
max(nu, nv) must be strictly less than min(nrow(A), ncol(A))
心中納悶啊,為啥報(bào)錯啊,然后打開RunPCA的幫助文檔
發(fā)現(xiàn)是npcs的選擇是默認(rèn)為50的骡技。而我的矩陣是2000基因*24細(xì)胞的矩陣鸣个。我心中納悶,我不是有2000個基因可以供選擇嗎布朦?怎么我選取個主成分等于50就不行嘛囤萤?后來明白了PC的選擇是小于你給的矩陣的行數(shù)或列數(shù)中的最小值的。因?yàn)镻C計(jì)算過程是要計(jì)算奇異矩陣的是趴,然后奇異矩陣是個對稱矩陣哦涛舍。
(2)Cluster the cells 對細(xì)胞進(jìn)行聚類分析
1)理解
- 在這里是對數(shù)據(jù)降維后在聚類,這里使用的是KNN聚類方法
- 與PhenoGraph一樣唆途,我們首先根據(jù)PCA空間中的歐氏距離構(gòu)建KNN圖富雅,并根據(jù)其局部鄰域中的共享重疊(Jaccard相似性)細(xì)化任意兩個單元之間的邊權(quán)重。 此步驟使用
FindNeighbors
函數(shù)執(zhí)行肛搬,并將先前定義的數(shù)據(jù)集維度(前10個PC)作為輸入没佑。FindClusters
函數(shù)實(shí)現(xiàn)了此過程,并包含一個分辨率參數(shù)温赔,用于設(shè)置下游群集的“簇?cái)?shù)目”蛤奢,增加的值會導(dǎo)致更多的簇。- 對于大約3000數(shù)目的單細(xì)胞數(shù)據(jù)集让腹,我們發(fā)現(xiàn)將此參數(shù)設(shè)置在0.4-1.2之間結(jié)果較為良好远剩。 較大的數(shù)據(jù)集通常會增加最佳分辨率。 可以使用
Idents
函數(shù)找到簇骇窍。
2)代碼
pbmc <- FindNeighbors(object = pbmc, dims = 1:10)
pbmc <- FindClusters(object = pbmc, resolution = 0.5)
# Look at cluster IDs of the first 5 cells
head(x = Idents(object = pbmc), 5)
3)結(jié)果
4.Run non-linear dimensional reduction (UMAP/tSNE) 非線性降維: tSNE and UMAP
1)理解
- The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.
-算法的目的:了解數(shù)據(jù)的多樣性瓜晤,以便在低維空間中將相似的細(xì)胞放在一起
- 作為UMAP和tSNE的輸入,我們建議使用相同的PC作為聚類分析的輸入腹纳。
- 降維有助于數(shù)據(jù)可視化
- UMAP(Uniform Manifold Approximation and Projection)是一種用于降維的新型流形學(xué)習(xí)技術(shù)痢掠。 UMAP由基于黎曼幾何和代數(shù)拓?fù)涞睦碚摽蚣軜?gòu)成驱犹。 結(jié)果是適用于現(xiàn)實(shí)世界數(shù)據(jù)的實(shí)用可擴(kuò)展算法。 UMAP算法與t-SNE競爭可視化質(zhì)量足画,并且可以保留更多的全局結(jié)構(gòu)雄驹,具有出色的運(yùn)行時性能。 此外淹辞,UMAP對嵌入維度沒有計(jì)算限制医舆,使其可用作機(jī)器學(xué)習(xí)的通用降維技術(shù)
UMAP(https://github.com/lmcinnes/umap)
2)代碼實(shí)現(xiàn)
reticulate::py_install(packages = "umap-learn") #安裝UMAP
# UMAP
pbmc <- RunUMAP(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = 'umap',label = TRUE )
# TSNE
pbmc <- RunTSNE(object = pbmc, dims = 1:10)
DimPlot(object = pbmc, reduction = 'tsne',label = TRUE )
# 您可以在此時保存對象,以便可以輕松地將其重新加載象缀,而無需重新運(yùn)行上面執(zhí)行的計(jì)算密集型步驟蔬将,或者可以輕松地與協(xié)作者共享
saveRDS(pbmc, file = "G:/Ajitongjiuniversity/Bioinformatics/Single cell sequencing/scRNA_seq workflow/seurat/pbmc_tutorial.rds")
3)結(jié)果
4)收獲
哈哈無意中指導(dǎo)了UMAP方法開心
5. Finding differentially expressed features 尋找差異表達(dá)的特征(聚類生物標(biāo)志物)
1)理解
- Seurat可以幫助您找到通過差異表達(dá)定義聚類的標(biāo)記兜挨。
- 默認(rèn)情況下喧兄,它標(biāo)識單個cluster的正marker和負(fù)marker(在
ident.1
中指定)并鸵,與所有其他cluster進(jìn)行比較伴找。FindAllMarkers
為所有cluster自動執(zhí)行此過程,但您也可以測試集群彼此之間或所有cluster楷怒。- FindAllMarkers參數(shù):min.pct參數(shù)在兩組cluster中的任意一組中以最小百分比檢測到特征炼蹦;ident.1:用于定義標(biāo)記的標(biāo)識類逮壁;ident.2:用于和ident.1比較的標(biāo)識類颓遏;logfc.threshold: Log FC的倍數(shù)值
- FindAllMarkers(參數(shù)是test.use)中用于找兩個cluster之間的差異基因的方法有:wilcox(默認(rèn))徐矩;bimod;roc州泊;t丧蘸;negbinom;poisson遥皂;LR;MAST刽漂;DESeq2
- %>% R中的管道傳輸函數(shù)
2)代碼實(shí)現(xiàn)
#- (1)找出所有cluster中的marker
# find all markers of cluster 1 找出cluster1 的所有marker
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
head(x = cluster1.markers, n = 5)
# find all markers distinguishing cluster 5 from clusters 0 and 3 找到cluster 5 和cluster(0,3)比較的所有marker
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(x = cluster5.markers, n = 5)
# find markers for every cluster compared to all remaining cells, report only the positive ones 找所有cluster的marker
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC) #將pbmc.markers傳給group_by演训,group_by按cluster 排序,再將結(jié)果傳給top_n贝咙,top_n按avg_logFC排序样悟,顯示每個類中的前兩個
#- (2)可視化這些marker在這些集群中的表達(dá)情況
#--1)用VlnPlot(顯示跨集群的表達(dá)概率分布)
#--2)FeaturePlot(在tSNE或PCA圖上可視化特征表達(dá))這兩種是比較常用的
#--3)還可以用`RidgePlot`,`CellScatter`和`DotPlot`這幾種方法查看數(shù)據(jù)集情況
# VlnPlot
VlnPlot(object = pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(object = pbmc, features = c("NKG7", "PF4"), slot = 'counts', log = TRUE)
# FeaturePlot
FeaturePlot(object = pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
#RidgePlot
RidgePlot(object = pbmc, features = c("MS4A1", "CD79A"))
#CellScatter
CellScatter(object = pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"),cell1 = "CATTACACGGAGTG", cell2 = "CATTACACCAACTG")
# DotPlot
DotPlot(object = pbmc, features = c("MS4A1", "CD79A"))
# DoHeatmap 熱圖
# 觀察每個cluster中的TOP10的基因在每個cluster中的表達(dá)情況
pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) -> top10
DoHeatmap(object = pbmc, features = top10$gene) + NoLegend()
3)結(jié)果
6.Assigning cell type identity to clusters 鑒定每個cluster的身份
(1)使用經(jīng)典的marker
1)代碼實(shí)現(xiàn)
#我們可以使用經(jīng)典的標(biāo)記輕松地將無偏聚類與已知細(xì)胞類型相匹配
new.cluster.ids <- c("Memory CD4 T", "CD14+ Mono", "Naive CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Mk")
names(x = new.cluster.ids) <- levels(x = pbmc)
pbmc <- RenameIdents(object = pbmc, new.cluster.ids)
DimPlot(object = pbmc, reduction = 'umap', label = TRUE, pt.size = 0.5) + NoLegend()
library(ggplot2)
plot <- DimPlot(object = pbmc, reduction = "umap", label = TRUE, label.size = 4.5) + xlab("UMAP 1") + ylab("UMAP 2") +
theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18)) +
guides(colour = guide_legend(override.aes = list(size = 10)))
ggsave(filename = "G:/Ajitongjiuniversity/Bioinformatics/Single cell sequencing/scRNA_seq workflow/seurat/pbmc3k_umap.png", height = 7, width = 12, plot = plot)
saveRDS(pbmc, file = "G:/Ajitongjiuniversity/Bioinformatics/Single cell sequencing/scRNA_seq workflow/seurat/pbmc_tutorial.rds")
2)結(jié)果展示
(2)全自動細(xì)胞注釋-SingleR(暫未跑通)
7.總結(jié)
- 今天下午13.30到晚上20.26的收獲吧庭猩,算是把seurat搞清楚了窟她,比之前剛用得時候清楚。
- 收獲的點(diǎn)在于:再寫一遍筆記的時候會比較著急蔼水,我自己是用notepaid寫了一遍了震糖,然后想著用簡書也整理一遍,才發(fā)現(xiàn)原來自己很著急哦趴腋,很多筆記都是直接copy自己寫在notepaid中的吊说,希望自己下次能夠把簡書和Notepaid同步起來论咏,開心
- 不理解的地方:高度變異基因和差異基因的區(qū)別?還有SingleR是如何注釋的還是需要搞清楚颁井,這個還是很有用的