?在看完seurat官方分群教程后
https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
感覺又收獲很多東西搪柑,決定寫下筆記,更好地去了解這個包索烹,和單細(xì)胞測序數(shù)據(jù)的分析原理工碾。圖碼較多~
數(shù)據(jù)為10x genomics官方測序數(shù)據(jù)(Illumina NextSeq 500),人外周血單核細(xì)胞百姓,經(jīng)過cell ranger過濾后渊额,載入seurat 包進(jìn)行分析,最終為13714genes*2638cells,分為9個亞群結(jié)合數(shù)據(jù)庫分別鑒定為"Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet"垒拢。下面將根據(jù)官方教程和網(wǎng)上資料進(jìn)行解析旬迹。
一、創(chuàng)建seurat對象求类、數(shù)據(jù)過濾出圖
.矩陣中的值表示0(沒有檢測到分子)奔垦。由于scRNA-seq矩陣中的大多數(shù)值為0,因此Seurat盡可能使用稀疏矩陣表示尸疆。這樣可以為Drop-seq / inDrop / 10x數(shù)據(jù)節(jié)省大量內(nèi)存并節(jié)省速度拷恨。
```
# 指定(你的)數(shù)據(jù)所在目錄;
setwd("")
data_dir <- ""
#載入seurat包九杂;
library(dplyr)
library(Seurat)
#讀入pbmc數(shù)據(jù)祷嘶;
pbmc.data <- Read10X(data.dir = data_dir)
#查看稀疏矩陣的維度,即基因數(shù)和細(xì)胞數(shù)脖捻;
dim(pbmc.data)
#預(yù)覽稀疏矩陣(1~10行阔逼,1~6列),. 表示0地沮;
pbmc.data[1:10,1:6]
#2.創(chuàng)建Seurat對象與數(shù)據(jù)過濾
# 在使用CreateSeuratObject()創(chuàng)建對象的同時嗜浮,過濾數(shù)據(jù)質(zhì)量差的細(xì)胞。保留在>=3個細(xì)胞中表達(dá)的基因摩疑;保留能檢測到>=200個基因的細(xì)胞危融。
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc2700", min.cells = 3, min.features = 200)
pbmc
# Lets examine a few genes in the first thirty cells
#pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
#看一般矩陣和稀疏矩陣的大小
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
sparse.size <- object.size(pbmc.data)
sparse.size
# 計算每個細(xì)胞的線粒體基因轉(zhuǎn)錄本數(shù)的百分比(%),使用[[ ]] 操作符存放到metadata中雷袋;
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2
CombinePlots(plots = list(plot1, plot2))
```
過濾前基因吉殃、轉(zhuǎn)錄本和線粒體比例分布
過濾前轉(zhuǎn)錄本和線粒體基因、基因數(shù)散點圖(可以發(fā)現(xiàn)基因數(shù)和轉(zhuǎn)錄本數(shù)成正相關(guān)楷怒,系數(shù)為0.95)
過濾后分布蛋勺,可以發(fā)現(xiàn)基因數(shù)在200-2500,線粒體基因數(shù)在5%以下
二鸠删、數(shù)據(jù)標(biāo)準(zhǔn)化抱完,pca降維,爪圖肘圖確定維度
```
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# 對過濾后的 QC metrics進(jìn)行可視化(繪制散點圖)刃泡;
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")+ NoLegend()
plot1
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+ NoLegend()
plot2
CombinePlots(plots = list(plot1, plot2))
#表達(dá)量數(shù)據(jù)標(biāo)準(zhǔn)化:LogNormalize的算法:A = log( 1 + ( UMIA ÷ UMITotal ) × 10000 )
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#pbmc <- NormalizeData(pbmc)
#鑒定表達(dá)高變基因(2000個)巧娱,用于下游分析碉怔,如PCA;
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# 提取表達(dá)量變變化最高的10個基因禁添;
top10 <- head(VariableFeatures(pbmc), 10)
top10
plot3 <- VariableFeaturePlot(pbmc)
plot4 <- LabelPoints(plot = plot3, points = top10, repel = TRUE,xnudge=0,ynudge=0)
plot4
#3. PCA分析眨层;
#PCA分析數(shù)據(jù)準(zhǔn)備,使用ScaleData()進(jìn)行數(shù)據(jù)歸一化上荡;
#默認(rèn)只是標(biāo)準(zhǔn)化高變基因(2000個)趴樱,速度更快,不影響PCA和分群酪捡,但影響熱圖的繪制叁征;
pbmc <- ScaleData(pbmc,vars.to.regress = "percent.mt")
#而對所有基因進(jìn)行標(biāo)準(zhǔn)化的方法如下:
#all.genes <- rownames(pbmc)
#pbmc <- ScaleData(pbmc,features = all.genes,vars.to.regress = "percent.mt")
#線性降維(PCA),默認(rèn)用高變基因集,但也可通過features參數(shù)自己指定逛薇;
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# 檢查PCA 分群結(jié)果捺疼, 這里只展示前12個PC,每個PC只顯示3個基因;
print(pbmc[["pca"]], dims = 1:12, nfeatures = 3)
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
#繪制pca散點圖永罚;
DimPlot(pbmc, reduction = "pca")
#畫前2個主成分的熱圖啤呼;
DimHeatmap(pbmc, dims = 1:2, cells = 500, balanced = TRUE)
#4.確定數(shù)據(jù)集的維度個數(shù);
#方法1:Jackstraw置換檢驗算法呢袱;重復(fù)取樣(原數(shù)據(jù)的1%)官扣,重跑PCA,鑒定p-value較小的PC;計算‘null distribution’(即零假設(shè)成立時)時的基因scores;
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
#方法2:肘部圖(碎石圖)羞福,基于每個主成分對方差解釋率的排名惕蹄;
ElbowPlot(pbmc)
```
高可變基因集,橫坐標(biāo)為基因平均表達(dá)水平治专,縱坐標(biāo)為基因表達(dá)標(biāo)準(zhǔn)差
前兩個主成分所含的基因集
前兩個主成分基因集熱圖
爪圖以確定分多少個亞群
肘狀圖(到10時平緩卖陵,說明再分主成分意義不大)
三、非線性降維张峰、差異基因熱圖
```
#基于PCA空間中的歐氏距離計算nearest neighbor graph泪蔫,優(yōu)化任意兩個細(xì)胞間的距離權(quán)重(輸入上一步得到的PC維數(shù));
pbmc <- FindNeighbors(pbmc, dims = 1:10)
#接著優(yōu)化模型喘批,resolution參數(shù)決定下游聚類分析得到的分群數(shù)撩荣,對于3K左右的細(xì)胞,設(shè)為0.4-1.2 能得到較好的結(jié)果谤祖;如果數(shù)據(jù)量增大婿滓,該參數(shù)也應(yīng)該適當(dāng)增大老速;
pbmc <- FindClusters(pbmc, resolution = 0.5)
#使用Idents()函數(shù)可查看不同細(xì)胞的分群粥喜;
head(Idents(pbmc), 7)
#Seurat提供了幾種非線性降維的方法進(jìn)行數(shù)據(jù)可視化(在低維空間把相似的細(xì)胞聚在一起),比如UMAP和t-SNE橘券,额湘;
#運行UMAP需要先安裝'umap-learn'包,這里不做介紹卿吐;
#umap聚類
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "ump.rds")
###tsne聚類
pbmc <- RunTSNE(pbmc, dims = 1:10)
# 用DimPlot()函數(shù)繪制散點圖, reduction = "tsne"锋华,指定繪制類型嗡官;如果不指定,默認(rèn)先從搜索 umap,
#然后 tsne, 再然后 pca毯焕;也可以直接使用這3個函數(shù)PCAPlot()衍腥、TSNEPlot()、UMAPPlot()
# cols纳猫,pt.size 分別調(diào)整分組顏色和點的大衅畔獭;
tsneplot<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()
tsneplot
#繪制Marker 基因的基因映射圖芜辕;
FeaturePlot(pbmc, features = c("MS4A1", "CD14"))
# 查找差異基因(biomarkers)尚骄,可以使用FindAllMarkers自動查找差異基因,也可以逐一查找:方法是將當(dāng)前cluster的細(xì)胞作為一組侵续,其他cluster的細(xì)胞作為另一組倔丈,然后進(jìn)行差異分析;
library(ggplot2)
# 19s計算1個cluster状蜗;min.pct=0.25 (minimum percentage)兩個組中至少25%的細(xì)胞能檢測到這個基因需五;用來過濾基因,加快運算速度轧坎;
###ident.1參數(shù)為比較的亞群如警儒,亞群1和所有亞群
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 7)
#查找cluster 5 的差異基因(與clusters 0 和 3相比);
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
#差異分析的方法設(shè)置:默認(rèn)秩和檢驗眶根,也可設(shè)為ROC蜀铲、t、DESeq2等多種方法属百;
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
###查找全部marker
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25,return.thresh = 0.01)
markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
# find markers for every cluster compared to all remaining cells, report only the positive ones
#pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
#pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
#取top10的基因记劝,用差異倍數(shù)排序; 備注:過濾過程用的是dplyr包的命令;%>%管道函數(shù)傳輸執(zhí)行族扰,類似于linux|
top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
top10[c("cluster","gene")]
write.table(top10, file ="top10.txt", sep =" ", row.names =TRUE, col.names =TRUE, quote =TRUE)
# 用top10基因繪制熱圖厌丑,不畫圖例
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
#繪制基因小提琴圖
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = T)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "CD8A"))
#5.為分群重新指定細(xì)胞類型;
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
? ? ? ? ? ? ? ? ? ? "NK", "DC", "Platelet")
#將pbmc的水平屬性負(fù)值給new.cluster.ids的names屬性渔呵;
names(new.cluster.ids)
#NULL
levels(pbmc)
#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
names(new.cluster.ids) <- levels(pbmc)
names(new.cluster.ids)
#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
new.cluster.ids
pbmc <- RenameIdents(pbmc, new.cluster.ids)
#繪制tsne圖(修改標(biāo)簽后的)怒竿;
tsneplot2<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()
tsneplot2
#畫umap圖
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
DimPlot
saveRDS(pbmc, file = "ump_end.rds")
```
umap降維結(jié)果圖(圖放大后點太小了,沒調(diào)了扩氢,這里跟官方有點差異耕驰,因為坐標(biāo)軸的原因,但是問題不大~)
t-sne降維聚類圖
標(biāo)記基因映射圖
標(biāo)記基因小提琴圖
top10差異基因熱圖(這里繪圖時有些基因沒顯示,The following features were omitted as they were not found in the scale.data slot for the RNA assay: CD8A, VPREB3, CD40LG, PIK3IP1, PRKCQ-AS1, NOSIP, LEF1, CD3E, CD3D, CCR7, LDHB, RPS3A,它們在scale.data插槽里沒被發(fā)現(xiàn)录豺,因此被忽略了)
通過基因熱圖朦肘,結(jié)合cellmarker數(shù)據(jù)庫和相關(guān)文獻(xiàn)饭弓,給分群后的圖加上細(xì)胞名稱(此為seurat官方圖片)
亞群加名
下次繪圖點要調(diào)大些~