聚類分析&maker define cluster - Seurat

########################################################

在完成filter工作以后,我們進行下一步的聚類、marker鑒定

PART1. 聚類

step0. count normalization

normalize:把count值轉(zhuǎn)化為log值

0.1 normalization主要考慮的因素有兩個:

(1)測序深度(sequencing depth):

我的理解是把一個細(xì)胞的reads數(shù)目normalize成10000(default),實際上比較的基因百分?jǐn)?shù)的變化

(2)基因長度(Gene length)

0.2 normalizing the data

# normalizing the data
# default setting: normalization.method = "LogNormalize" and scale.factor = 10000  

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

step1. feature selection - 找clusters之間表達水平差異化最大的基因feature

  • Identification of highly variable features
    這里指的是颤介,單一樣本,不同cluster之間相比次伶,表達水平差異化最大的基因feature
# 加載Seurat對象/加載保存為.Rdata的Seurat對象
Seurat
load("data/pbmc_seurat_filtered.RData")

# 用 FindVariableFeatures 找variable gene米间;default 值如下
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# 列出 cluster 之間變化最大的gene - Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# 畫圖 plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc) # without labels
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) # with labels
plot1 + plot2

step2. scaling the data

  • 為 PCA 做準(zhǔn)備

如果是為了PCA做準(zhǔn)備,那么對 all.genes scaling data 則耗時太長扶平;一個解決辦法是渣聚,使用 scaledata 默認(rèn)值(依照上一步鑒定的 2000 variable features)隔显,這樣耗時短也能達到分群的目的

# perform scale data on the previously identified variable features(2000 by default)
pbmc <- ScaleData(pbmc)

  • DoHeatmap

To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.

# DoHeatmap by scaling all genes
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

step3. PCA—主成分分析

用tSNE較多,因為tSNE分的比較開饵逐,uMAP的優(yōu)點在于能看到譜系來源括眠,但是分的不開,所以如果要做分cluster的話倍权,用tSNE更多掷豺。

先用默認(rèn)的 previously determined variable features 作為input,but can be defined using features argument if you wish to choose a different subset.

# runPCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

3.1 識別重要主成分 PC (該部分需要加強理解 - option)

為了確保分群所依照的主成分基因薄声,它不是由于單個基因表達的技術(shù)性噪聲帶來的干擾当船;所以要在PCA將維之前,識別重要的主成分基因默辨,也就是德频,用 PC 去代表聚合的變異最大的基因。

然后用 PC得分缩幸,衡量該 cluster 在每個PC 上的權(quán)重壹置,也就是,PC 代表的“元基因”(變異最大的那些基因)能解釋該cluster 百分之幾的差異度表谊。

所以钞护,在確定下游聚類分析之前,有一個問題爆办,我們該如何識別PC 呢难咕?

目前Seurat提供有以下三種方法:VizDimReduction, DimPlot, and DimHeatmap

3.1.1 找合適重要主成分的方法
  • 熱圖可視化 DimHeatmap

cells參數(shù)指定了用于繪圖的最大正負(fù)PCA分?jǐn)?shù)的細(xì)胞數(shù)。我們要尋找的是,在熱圖開始看起來更 "模糊 "的PC余佃,也就是說暮刃,各組基因之間的區(qū)分不是那么明顯的PC。

# 研究PC熱圖
DimHeatmap(seurat_integrated, 
           dims = 1:9, 
           cells = 500, 
           balanced = TRUE)

如果我們想探索大量的PCA爆土,這種方法可能會很慢沾歪,而且很難將單個基因可視化。同樣的雾消,為了探索大量的PC,我們可以通過PCA分?jǐn)?shù)驅(qū)動PC挫望,打印出前10個(或更多的)正基因和負(fù)基因立润。

The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.

  • VizDimReduction

The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.

# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7


# 可視化 VizDimReduction PCA results
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

  • DimPlot

The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.

DimPlot(pbmc, reduction = "pca")

3.1.2 Determine the ‘dimensionality’ of the dataset

彎頭圖(elbow plot)直觀地顯示了每個PC的標(biāo)準(zhǔn)差,我們要尋找標(biāo)準(zhǔn)差開始趨于平穩(wěn)的地方媳板。從本質(zhì)上說桑腮,彎頭出現(xiàn)的地方通常是識別大多數(shù)變異的閾值。然而蛉幸,這種方法可能是相當(dāng)主觀的破讨。

根據(jù)這個圖,我們可以通過PC8-PC10附近的肘部發(fā)生的位置來大致確定大部分的變化奕纫,或者可以認(rèn)為應(yīng)該是數(shù)據(jù)點開始接近X軸的時候提陶,PC30左右。

ElbowPlot(pbmc)

step4. 聚類細(xì)胞 cluster the cell

Seurat使用了一種基于圖的聚類方法匹层,它將細(xì)胞嵌入到一個圖結(jié)構(gòu)中隙笆,使用K-近鄰(KNN)圖(默認(rèn)情況下),并在具有相似基因表達模式的細(xì)胞之間畫出邊緣升筏。然后撑柔,它試圖將這個圖分割成高度相互關(guān)聯(lián)的 "準(zhǔn)聚類 "或 "群落"[Seurat-Guided-Clustering-Tutorial]。

我們將使用FindClusters()函數(shù)來執(zhí)行基于圖的聚類您访。resolution是一個重要的參數(shù)铅忿,它設(shè)置了下行聚類的 "粒度 (granularity)",需要對每個單獨的實驗進行優(yōu)化灵汪。對于3,000-5,000個細(xì)胞的數(shù)據(jù)集檀训,resolution設(shè)置在0.4-1.4之間,一般可以獲得良好的聚類效果享言。分辨率的增加會導(dǎo)致更多的聚類肢扯,這通常是較大的數(shù)據(jù)集所需要的。

FindClusters()函數(shù)允許我們輸入一系列的分辨率担锤,并將計算出聚類的 "粒度"蔚晨。這便于測試哪種分辨率對于分群更有幫助,而不需要為每個分辨率分別運行函數(shù)。

# 確定k-近鄰圖
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:40)
                                
# 確定聚類的不同分辨率                               
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

# 如果我們看一下Seurat對象的元數(shù)據(jù)(seurat_integrated@metadata)铭腕,每一個不同的分辨率都有一個單獨的列來計算银择。
# 探索分辨率
seurat_integrated@meta.data %>% 
        View()

# 我們通常會選擇一個中間范圍的分辨率開始,比如0.6或0.8累舷。我們將通過使用Idents()函數(shù)賦予聚類身份浩考,從0.8的分辨率開始。
# 賦予聚類的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"

# 舉例
# ----------------
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                1                3                1                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

step5.Run non-linear dimensional reduction (UMAP/tSNE)

為了使細(xì)胞類群可視化被盈,有一些不同的降維技術(shù)可以使用析孽。最流行的方法包括t分布隨機鄰域嵌入(t-SNE)和統(tǒng)一模態(tài)近似與投影(UMAP)技術(shù)。

這兩種方法的目的是將高維空間中具有相似局部鄰域的細(xì)胞一起放置在低維空間中只怎。這兩種方法都需要輸入PCA維度的數(shù)量來進行可視化袜瞬,我們建議使用相同數(shù)量的PC作為聚類分析的輸入。這里身堡,我們將用UMAP方法來進行可視化聚類分析邓尤。

# umap
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, 
        reduction = "umap", 
        label = TRUE,
        label.size = 6)

# tsne
pbmc <- RunTSNE(pbmc, features = VariableFeatures(object = pbmc))
DimPlot(pbmc, reduction = "tsne")

這對于研究其他分辨率也是很有用的。它可以讓你快速了解基于分辨率參數(shù)的類群會有怎樣的變化贴谎。例如汞扎,讓我們換成0.4的分辨率。

# 分配類群的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.4"

# 繪制UMAP圖
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)

# 保存為.RData
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

PARTII marker define cluster

step0. Find clusters vs. each other, or against all cells.

By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.

FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells.

# find all markers of cluster 1 (vs. all other cluster by default)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# IL32  0   0.8373872   0.948   0.464   0
# LTB   0   0.8921170   0.981   0.642   0
# CD3D  0   0.6436286   0.919   0.431   0
# IL7R  0   0.8147082   0.747   0.325   0
# LDHB  0   0.6253110   0.950   0.613   0

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# FCGR3A    0   2.963144    0.975   0.037   0
# IFITM3    0   2.698187    0.975   0.046   0
# CFD       0   2.362381    0.938   0.037   0
# CD68  0   2.087366    0.926   0.036   0
# RP11-290F20.3 0   1.886288    0.840   0.016   0

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

step1. visualizing marker expression

VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring RidgePlot, CellScatter, and DotPlot as additional methods to view your dataset.

1.1 VinPlot the expression level

# you can plot avg_logFC
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

1.2 FeaturePlot the distribution

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",  "CD8A"))

1.3 DoHeatmap for given cells and features

# In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

step2. 根據(jù)找到的marker/canonical markers 導(dǎo)出csv

##找到每個cluster的marker
markers <- FindMarkers(pbmc, ident.1 = 0,  only.pos = TRUE,min.pct = 0.25)
write.table(markers, file="/Users/apple/Desktop/Result/markers_0.tsv", sep="\t", quote=FALSE, col.names=NA)

step3. 根據(jù)marker.csv 畫圖

##直接根據(jù)marker畫圖

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:4,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_1.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[8:14,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_2.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[15:21,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_3.png", a)
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[22:28,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_4.png", a)

a <- FeaturePlot(pbmc, reduction = "tsne",features = "Cidec", cols = c("grey", "red"))
ggsave("genemap_1.png", a)


step4. 根據(jù)不同的 cluster 指定細(xì)胞類型 Assigning cell type identity to clusters

##標(biāo)記
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:8,1])
new.cluster.ids <- genes
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
a<-DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = 0.5) + NoLegend()
ggsave("tsne.png", a)

# or 直接按順序給cluster0 -> 8 指定細(xì)胞類型
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# 存儲為RDS
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末擅这,一起剝皮案震驚了整個濱河市澈魄,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌仲翎,老刑警劉巖一忱,帶你破解...
    沈念sama閱讀 216,372評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異谭确,居然都是意外死亡帘营,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評論 3 392
  • 文/潘曉璐 我一進店門逐哈,熙熙樓的掌柜王于貴愁眉苦臉地迎上來芬迄,“玉大人,你說我怎么就攤上這事昂秃≠魇幔” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評論 0 353
  • 文/不壞的土叔 我叫張陵肠骆,是天一觀的道長算途。 經(jīng)常有香客問我,道長蚀腿,這世上最難降的妖魔是什么嘴瓤? 我笑而不...
    開封第一講書人閱讀 58,157評論 1 292
  • 正文 為了忘掉前任扫外,我火速辦了婚禮,結(jié)果婚禮上廓脆,老公的妹妹穿的比我還像新娘筛谚。我一直安慰自己,他們只是感情好停忿,可當(dāng)我...
    茶點故事閱讀 67,171評論 6 388
  • 文/花漫 我一把揭開白布驾讲。 她就那樣靜靜地躺著,像睡著了一般席赂。 火紅的嫁衣襯著肌膚如雪吮铭。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,125評論 1 297
  • 那天颅停,我揣著相機與錄音谓晌,去河邊找鬼。 笑死便监,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的碳想。 我是一名探鬼主播烧董,決...
    沈念sama閱讀 40,028評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼胧奔!你這毒婦竟也來了逊移?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,887評論 0 274
  • 序言:老撾萬榮一對情侶失蹤龙填,失蹤者是張志新(化名)和其女友劉穎胳泉,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體岩遗,經(jīng)...
    沈念sama閱讀 45,310評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡扇商,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,533評論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了宿礁。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片案铺。...
    茶點故事閱讀 39,690評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖梆靖,靈堂內(nèi)的尸體忽然破棺而出控汉,到底是詐尸還是另有隱情,我是刑警寧澤返吻,帶...
    沈念sama閱讀 35,411評論 5 343
  • 正文 年R本政府宣布姑子,位于F島的核電站,受9級特大地震影響测僵,放射性物質(zhì)發(fā)生泄漏街佑。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,004評論 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望舆乔。 院中可真熱鬧岳服,春花似錦、人聲如沸希俩。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽颜武。三九已至璃搜,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間鳞上,已是汗流浹背这吻。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評論 1 268
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留篙议,地道東北人唾糯。 一個月前我還...
    沈念sama閱讀 47,693評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像鬼贱,于是被迫代替她去往敵國和親移怯。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,577評論 2 353