聚類分析&maker define cluster - Seurat

########################################################

題目：1.聚類分析&maker define cluster - Seurat
語言：R
官方教程：https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
日期：April 17, 2020
Author: YQ
########################################################

在完成filter工作以后，我們進行下一步的聚類、marker鑒定

PART1. 聚類

step0. count normalization

normalize：把count值轉(zhuǎn)化為log值

0.1 normalization主要考慮的因素有兩個：

(1)測序深度（sequencing depth）:

我的理解是把一個細(xì)胞的reads數(shù)目normalize成10000(default),實際上比較的基因百分?jǐn)?shù)的變化

(2)基因長度（Gene length）

0.2 normalizing the data

# normalizing the data
# default setting: normalization.method = "LogNormalize" and scale.factor = 10000  

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

pbmc <- NormalizeData(pbmc)

step1. feature selection - 找clusters之間表達水平差異化最大的基因feature

Identification of highly variable features
這里指的是颤介，單一樣本，不同cluster之間相比次伶，表達水平差異化最大的基因feature

# 加載Seurat對象/加載保存為.Rdata的Seurat對象
Seurat
load("data/pbmc_seurat_filtered.RData")

# 用 FindVariableFeatures 找variable gene米间；default 值如下
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# 列出 cluster 之間變化最大的gene - Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# 畫圖 plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc) # without labels
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) # with labels
plot1 + plot2

step2. scaling the data

為 PCA 做準(zhǔn)備

如果是為了PCA做準(zhǔn)備，那么對 all.genes scaling data 則耗時太長扶平；一個解決辦法是渣聚，使用 scaledata 默認(rèn)值（依照上一步鑒定的 2000 variable features）隔显，這樣耗時短也能達到分群的目的

# perform scale data on the previously identified variable features(2000 by default)
pbmc <- ScaleData(pbmc)

DoHeatmap

To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.

# DoHeatmap by scaling all genes
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

step3. PCA—主成分分析

用tSNE較多，因為tSNE分的比較開饵逐，uMAP的優(yōu)點在于能看到譜系來源括眠，但是分的不開，所以如果要做分cluster的話倍权，用tSNE更多掷豺。

先用默認(rèn)的 previously determined variable features 作為input，but can be defined using features argument if you wish to choose a different subset.

# runPCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

3.1 識別重要主成分 PC (該部分需要加強理解 - option)

為了確保分群所依照的主成分基因薄声，它不是由于單個基因表達的技術(shù)性噪聲帶來的干擾当船；所以要在PCA將維之前，識別重要的主成分基因默辨，也就是德频，用 PC 去代表聚合的變異最大的基因。

然后用 PC得分缩幸，衡量該 cluster 在每個PC 上的權(quán)重壹置，也就是，PC 代表的“元基因”（變異最大的那些基因）能解釋該cluster 百分之幾的差異度表谊。

所以钞护，在確定下游聚類分析之前，有一個問題爆办，我們該如何識別PC 呢难咕？

目前Seurat提供有以下三種方法：VizDimReduction, DimPlot, and DimHeatmap

3.1.1 找合適重要主成分的方法

熱圖可視化 DimHeatmap

cells參數(shù)指定了用于繪圖的最大正負(fù)PCA分?jǐn)?shù)的細(xì)胞數(shù)。我們要尋找的是，在熱圖開始看起來更 "模糊 "的PC余佃，也就是說暮刃，各組基因之間的區(qū)分不是那么明顯的PC。

# 研究PC熱圖
DimHeatmap(seurat_integrated, 
           dims = 1:9, 
           cells = 500, 
           balanced = TRUE)

如果我們想探索大量的PCA爆土，這種方法可能會很慢沾歪，而且很難將單個基因可視化。同樣的雾消，為了探索大量的PC，我們可以通過PCA分?jǐn)?shù)驅(qū)動PC挫望，打印出前10個（或更多的）正基因和負(fù)基因立润。

The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.

VizDimReduction

The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example.

# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7


# 可視化 VizDimReduction PCA results
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot

The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.

DimPlot(pbmc, reduction = "pca")

3.1.2 Determine the ‘dimensionality’ of the dataset

彎頭圖（elbow plot）直觀地顯示了每個PC的標(biāo)準(zhǔn)差，我們要尋找標(biāo)準(zhǔn)差開始趨于平穩(wěn)的地方媳板。從本質(zhì)上說桑腮，彎頭出現(xiàn)的地方通常是識別大多數(shù)變異的閾值。然而蛉幸，這種方法可能是相當(dāng)主觀的破讨。

根據(jù)這個圖，我們可以通過PC8-PC10附近的肘部發(fā)生的位置來大致確定大部分的變化奕纫，或者可以認(rèn)為應(yīng)該是數(shù)據(jù)點開始接近X軸的時候提陶，PC30左右。

ElbowPlot(pbmc)

step4. 聚類細(xì)胞 cluster the cell

Seurat使用了一種基于圖的聚類方法匹层，它將細(xì)胞嵌入到一個圖結(jié)構(gòu)中隙笆，使用K-近鄰（KNN）圖（默認(rèn)情況下），并在具有相似基因表達模式的細(xì)胞之間畫出邊緣升筏。然后撑柔，它試圖將這個圖分割成高度相互關(guān)聯(lián)的 "準(zhǔn)聚類 "或 "群落"[Seurat-Guided-Clustering-Tutorial]。

我們將使用FindClusters()函數(shù)來執(zhí)行基于圖的聚類您访。resolution是一個重要的參數(shù)铅忿，它設(shè)置了下行聚類的 "粒度 (granularity)"，需要對每個單獨的實驗進行優(yōu)化灵汪。對于3,000-5,000個細(xì)胞的數(shù)據(jù)集檀训，resolution設(shè)置在0.4-1.4之間，一般可以獲得良好的聚類效果享言。分辨率的增加會導(dǎo)致更多的聚類肢扯，這通常是較大的數(shù)據(jù)集所需要的。

FindClusters()函數(shù)允許我們輸入一系列的分辨率担锤，并將計算出聚類的 "粒度"蔚晨。這便于測試哪種分辨率對于分群更有幫助，而不需要為每個分辨率分別運行函數(shù)。

# 確定k-近鄰圖
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:40)
                                
# 確定聚類的不同分辨率                               
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

# 如果我們看一下Seurat對象的元數(shù)據(jù)(seurat_integrated@metadata)铭腕，每一個不同的分辨率都有一個單獨的列來計算银择。
# 探索分辨率
seurat_integrated@meta.data %>% 
        View()

# 我們通常會選擇一個中間范圍的分辨率開始，比如0.6或0.8累舷。我們將通過使用Idents()函數(shù)賦予聚類身份浩考，從0.8的分辨率開始。
# 賦予聚類的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"

# 舉例
# ----------------
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                1                3                1                2 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

step5.Run non-linear dimensional reduction (UMAP/tSNE)

為了使細(xì)胞類群可視化被盈，有一些不同的降維技術(shù)可以使用析孽。最流行的方法包括t分布隨機鄰域嵌入（t-SNE）和統(tǒng)一模態(tài)近似與投影（UMAP）技術(shù)。

這兩種方法的目的是將高維空間中具有相似局部鄰域的細(xì)胞一起放置在低維空間中只怎。這兩種方法都需要輸入PCA維度的數(shù)量來進行可視化袜瞬，我們建議使用相同數(shù)量的PC作為聚類分析的輸入。這里身堡，我們將用UMAP方法來進行可視化聚類分析邓尤。

# umap
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, 
        reduction = "umap", 
        label = TRUE,
        label.size = 6)

# tsne
pbmc <- RunTSNE(pbmc, features = VariableFeatures(object = pbmc))
DimPlot(pbmc, reduction = "tsne")

這對于研究其他分辨率也是很有用的。它可以讓你快速了解基于分辨率參數(shù)的類群會有怎樣的變化贴谎。例如汞扎，讓我們換成0.4的分辨率。

# 分配類群的身份
Idents(object = seurat_integrated) <- "integrated_snn_res.0.4"

# 繪制UMAP圖
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)

# 保存為.RData
saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

PARTII marker define cluster

step0. Find clusters vs. each other, or against all cells.

By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.

FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells.

# find all markers of cluster 1 (vs. all other cluster by default)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# IL32  0   0.8373872   0.948   0.464   0
# LTB   0   0.8921170   0.981   0.642   0
# CD3D  0   0.6436286   0.919   0.431   0
# IL7R  0   0.8147082   0.747   0.325   0
# LDHB  0   0.6253110   0.950   0.613   0

# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# p_val avg_logFC   pct.1   pct.2   p_val_adj
# FCGR3A    0   2.963144    0.975   0.037   0
# IFITM3    0   2.698187    0.975   0.046   0
# CFD       0   2.362381    0.938   0.037   0
# CD68  0   2.087366    0.926   0.036   0
# RP11-290F20.3 0   1.886288    0.840   0.016   0

# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

step1. visualizing marker expression

VlnPlot (shows expression probability distributions across clusters), and FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring RidgePlot, CellScatter, and DotPlot as additional methods to view your dataset.

1.1 VinPlot the expression level

# you can plot avg_logFC
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

1.2 FeaturePlot the distribution

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",  "CD8A"))

1.3 DoHeatmap for given cells and features

# In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

step2. 根據(jù)找到的marker/canonical markers 導(dǎo)出csv

##找到每個cluster的marker
markers <- FindMarkers(pbmc, ident.1 = 0,  only.pos = TRUE,min.pct = 0.25)
write.table(markers, file="/Users/apple/Desktop/Result/markers_0.tsv", sep="\t", quote=FALSE, col.names=NA)

step3. 根據(jù)marker.csv 畫圖

##直接根據(jù)marker畫圖

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:4,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_1.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[8:14,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_2.png", a)

table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[15:21,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_3.png", a)
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[22:28,1])
plots <- VlnPlot(pbmc, features = genes, group.by = "seurat_clusters", pt.size = 0, combine = FALSE)
a <- CombinePlots(plots = plots, ncol = 1)
ggsave("violin_test_4.png", a)

a <- FeaturePlot(pbmc, reduction = "tsne",features = "Cidec", cols = c("grey", "red"))
ggsave("genemap_1.png", a)

step4. 根據(jù)不同的 cluster 指定細(xì)胞類型 Assigning cell type identity to clusters

##標(biāo)記
table <- read.table ( '/Users/apple/Desktop/Rwork/test1.csv', header = FALSE)
genes <- as.vector(table[1:8,1])
new.cluster.ids <- genes
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
a<-DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = 0.5) + NoLegend()
ggsave("tsne.png", a)

# or 直接按順序給cluster0 -> 8 指定細(xì)胞類型
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# 存儲為RDS
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")