Seurat安裝
想要安裝舊版本(目前是3.X门驾,與2.X有較大差別)
install.packages('devtools')
# 替換成2.3.0版本
>devtools::install_version(package = 'Seurat', version = package_version('2.3.0'))
> library(Seurat)
Seurat2.X與Seurat3.X妄圖切換(湊字?jǐn)?shù)警告旭贬,并沒(méi)有成功佃延,我安裝了個(gè)寂寞)
原路徑
.libPaths()
[1] "C:/Users/kongrui/Documents/R/win-library/4.0"
[2] "C:/Program Files/R/R-4.0.0/library"
#安裝Seurat2在D盤(pán)
.libPaths("D:/Program Files/R/R-4.0.0/library2")
install.packages("devtools")
require(devtools)
.libPaths("D:/Program Files/R/R-4.0.0/library2")
require(devtools)
devtools::install_version(package = 'Seurat', version = package_version('2.3.0'))
幾個(gè)概念
- UMI (Unique Molecular Identifier)是一段固定長(zhǎng)度的隨機(jī)小片段用含,用以區(qū)別不同的PCR擴(kuò)增模板
- barcode標(biāo)記不同的樣品或者細(xì)胞
- sparse-matrix representation:稀疏矩陣的概念羊始,簡(jiǎn)單理解在Seurat中基因表達(dá)量為0者以
.
表示,可節(jié)省空間
Seurat2.X
-
counts
矩陣進(jìn)來(lái)后被包裝為對(duì)象笙纤,方便操作耗溜。 - 然后一定要經(jīng)過(guò)
NormalizeData
和ScaleData
的操作 - 函數(shù)
FindVariableGenes
可以挑選適合進(jìn)行下游分析的基因集。 - 函數(shù)
RunPCA
和RunTSNE
進(jìn)行降維 - 函數(shù)
FindClusters
直接就分群了省容,非常方便 函數(shù)FindAllMarkers
可以對(duì)分群后各個(gè)亞群找標(biāo)志基因抖拴。 - 函數(shù)
FeaturePlot
可以展示不同基因在所有細(xì)胞的表達(dá)量 - 函數(shù)
VlnPlot
可以展示不同基因在不同分群的表達(dá)量差異情況 函數(shù)DoHeatmap
可以選定基因集后繪制熱圖
Seurat3.X
- 與V2版本區(qū)別初了解
- 函數(shù)名稱的變化
- genes —>features
- v2的nUMI和nGene分別改為了nCount_RNA、nFeature_RNA
- SubsetData函數(shù) —> subset函數(shù)
- Seurat采用的是graph-based聚類方法腥椒,k-means方法在V3中不存在了
- 可視化降維 - UMAP(V3新出)
非線性降維——這個(gè)目的是為了可視化阿宅,而不是特征提取(PCA)笼蛛,雖然它也可以用來(lái)做特征提燃叶帷;UMAP方法更加緊湊——在降維圖上伐弹,同一cluster離得更近拉馋,不同cluster離得更遠(yuǎn),作為一種后來(lái)的算法有一定的優(yōu)點(diǎn)惨好,但是t-SNE結(jié)構(gòu)也能很好地反映cluster的空間結(jié)構(gòu)
V3.X Workflow
pbmc.counts <- Read10X(data.dir = "~/Downloads/pbmc3k/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.counts)
pbmc <- NormalizeData(object = pbmc)
pbmc <- FindVariableFeatures(object = pbmc)
pbmc <- ScaleData(object = pbmc)
pbmc <- RunPCA(object = pbmc)
pbmc <- FindNeighbors(object = pbmc)
pbmc <- FindClusters(object = pbmc)
pbmc <- RunTSNE(object = pbmc)
DimPlot(object = pbmc, reduction = "tsne")
3.X版本
-
CreateSeuratObject() Setup the Seurat Object
Arguments
counts:Unnormalized data such as raw counts or TPMs未歸一化的原始數(shù)據(jù)
project:Sets the project name for the Seurat object
assay:Name of the assay corresponding to the initial input data
min.cells:Include features detected in at least this many cells. Will subset the counts matrix as well. To reintroduce excluded features, create a new object with a lower cutoff. (設(shè)定閾值/門(mén)檻標(biāo)準(zhǔn))每個(gè)基因至少要在三個(gè)細(xì)胞中有表達(dá)
min.features:Include cells where at least this many features are detected.每個(gè)細(xì)胞中至少可檢測(cè)到200個(gè)基因(示例數(shù)據(jù)中的過(guò)濾條件煌茴,具體可見(jiàn)下文或官網(wǎng)教程)
meta.data:Additional cell-level metadata to add to the Seurat object. Should be a data frame where the rows are cell names and the columns are additional metadata fields.(可為臨床數(shù)據(jù)或其他數(shù)據(jù),信息的信息日川,數(shù)據(jù)的數(shù)據(jù)---屬性等蔓腐,要求行為細(xì)胞、列為信息的數(shù)據(jù)框龄句,示例數(shù)據(jù)中為默認(rèn)meta.data = NULL) -
Standard pre-processing workflow:selection and filtration of cells data—— normalization and scaling—— detection of highly variable features
2.1 QC and selecting cells for further analysis
2.11PercentageFeatureSet():
Calculate the percentage of all counts that belong to a given set of features
相當(dāng)于增加一個(gè)亞組回论,示例數(shù)據(jù)中為線粒體基因(scRNA-seq常用指標(biāo)),即計(jì)算每個(gè)細(xì)胞中線粒體基因所占的百分比
PercentageFeatureSet(object, pattern = NULL, features = NULL, col.name = NULL, assay = NULL)
eg:pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
注意MT為大寫(xiě)分歇,通配符“^”表示查找行首符合指定內(nèi)容的行傀蓉,這里為MT。
2.12可視化相關(guān)待過(guò)濾的指標(biāo)
VlnPlot()
nCount_RNA
列:每個(gè)細(xì)胞的UMI數(shù)量职抡,詳細(xì)數(shù)據(jù)存貯于metadata中
nFeature_RNA
列:每個(gè)細(xì)胞的基因數(shù)葬燎,詳細(xì)數(shù)據(jù)存貯于metadata中
FeatureScatter()
展示相關(guān)特征之間的關(guān)系,如UMI和基因數(shù)的關(guān)系缚甩,UMI和線粒體百分比
subset() (過(guò)濾函數(shù)谱净,上文說(shuō)過(guò)不同于2.X版本的)
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
將基因count數(shù)超過(guò)2500 & 小于200 & 線粒體占比5%以上的細(xì)胞過(guò)濾掉
參數(shù)選擇,由上圖 QC1可見(jiàn)擅威,nFeature_RNA基本在1000-2000之間壕探,2500及以上可能是測(cè)序過(guò)程中將2個(gè)細(xì)胞算為1個(gè)細(xì)胞,所以Feature_RNA偏離主體郊丛,所以選用2500(可根據(jù)自己數(shù)據(jù)真實(shí)情況調(diào)整摸索)李请。同樣的派继,MT基本集中在5%以下,將偏離主體的刪除掉捻艳。
***以上操作均為標(biāo)準(zhǔn)化前的簡(jiǎn)單準(zhǔn)備工作,去掉偏離的細(xì)胞庆猫。
2.2 Normalizing the data
默認(rèn)方法: “LogNormalize” ,結(jié)果存儲(chǔ)在pbmc[["RNA"]]@data
log(表達(dá)量 X “scale factor (10,000 by default)”)
2.3 Identification of highly variable features (feature selection)
尋找cell to cell varaition 基因(Features that are highly expressed in some cells, and lowly expressed in others)
FindVariableFeatures()函數(shù)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
By default, we return 2,000 features per dataset.默認(rèn)每個(gè)dataset返回2000個(gè)基因
2.4 Scaling the data
The results of this are stored in pbmc[["RNA"]]@scale.data
線性轉(zhuǎn)化使得基因在細(xì)胞的平均表達(dá)值為0认轨,細(xì)胞間的差異值為1,主要為了降低高表達(dá)基因在下游分析中所占權(quán)重過(guò)大月培,掩蓋其他基因間的差異(例如熱圖嘁字,有一個(gè)基因表達(dá)值極高,導(dǎo)致繪制的熱圖出現(xiàn)除這個(gè)基因?yàn)榧t色外其他全部為低表達(dá)顏色藍(lán)色杉畜,掩蓋差異纪蜒,無(wú)法進(jìn)行進(jìn)一步分析)
pbmc <- ScaleData(pbmc, features = all.genes)
features = Vector of features names to scale/center. Default is variable features.
pbmc <- ScaleData(pbmc, vars.to.regress = c("percent.mt",nCount_RNA))
刪除不必要技術(shù)噪音影響:vars.to.regress
:Variables to regress out (previously latent.vars in RegressOut). For example, nUMI, or percent.mito. -
Perform linear dimensional reduction線性降維
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
3.1 PCA后可視化
VizDimLoadings()
Visualize top genes associated with reduction components
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
VizDimLoadings(object, dims = 1:5, nfeatures = 30, col = "blue", reduction = "pca", projected = FALSE, balanced = FALSE, ncol = NULL, combine = TRUE)
dims:Number of dimensions to display 展示維度的數(shù)量
nfeatures:Number of genes to display
VizDimLoadings(pbmc, dims = 1:3, nfeatures = 10, col = "pink", reduction = "pca")
改變參數(shù)后繪圖
DimPlot(pbmc, reduction = "pca")
DimHeatmap()
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
dims = 1 只畫(huà)第一個(gè)主成分
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
DimHeatmap's Arguments
object Seurat object
dims Dimensions to plot
nfeatures Number of genes to plot(default 30)
cells A list of cells to plot. If numeric, just plots the top cells.
3.2 Determine the ‘dimensionality’ of the dataset如何根據(jù)PCA值確定維度對(duì)細(xì)胞進(jìn)行分類
方法1:JackStraw()函數(shù)
In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs.(PC13較之前的PC,p-value值有明顯波動(dòng))
方法2:ElbowPlot()函數(shù)
此處一般可選10此叠,即到PC10基本可以涵蓋數(shù)據(jù)全貌纯续,類似于WGCNA的β值 -
Cluster the cells
細(xì)胞聚類
Seurat v3應(yīng)用graph-based clustering方法,官方稱其“diatance metric”距離度量方法較之前有較大進(jìn)步灭袁,被很多人應(yīng)用巴拉猬错。。茸歧。(https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html)官網(wǎng)曰這種方法將細(xì)胞嵌入到一種圖形結(jié)構(gòu)中(例如 K-nearest neighbor (KNN) graph)倦炒,這種圖形會(huì)將具有相似的基因表達(dá)屬性的細(xì)胞們繪制到一個(gè)社區(qū)內(nèi)。
graph的繪制其實(shí)都是基于一定的統(tǒng)計(jì)學(xué)算法软瞎,例如KNNgraph是基于euclidean distance算法逢唤,并且重新定義細(xì)胞與細(xì)胞直接的連線(想象一下如果可視化的話聯(lián)系的緊密程度可以簡(jiǎn)單畫(huà)個(gè)線,用線的粗細(xì)表示相應(yīng)距離的遠(yuǎn)近涤浇,或者社區(qū)內(nèi)的細(xì)胞之間的重疊度大小表示緊密程度)鳖藕,PS:各種算法解釋其實(shí)一定程度上可以參考dist()函數(shù),如果統(tǒng)計(jì)及數(shù)學(xué)足夠好的話可以深入研究下只锭;上述工作可用FindNeighbors()函數(shù)實(shí)現(xiàn)
FindClusters()函數(shù):完成細(xì)胞聚類
應(yīng)用模塊優(yōu)化方法更好的完成細(xì)胞聚類吊奢,其中一個(gè)重要的參數(shù)resolution一般3K左右細(xì)胞resolution參數(shù)值可選擇0.4-1.2之間(根據(jù)數(shù)據(jù)調(diào)整,不斷調(diào)試)纹烹。應(yīng)用Idents函數(shù)可以查看細(xì)胞被分到哪個(gè)組页滚。 -
Run non-linear dimensional reduction (UMAP/tSNE)
結(jié)果好像比tSNE聚的更緊湊 - 下游分析:尋找差異表達(dá)基因
6.1 FindMarkers()函數(shù)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
ident.1第一群細(xì)胞的marker基因,并沒(méi)有給出ident.2參數(shù)铺呵,說(shuō)明是與其他所有組相比
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
ident.2 = c(0,3)說(shuō)明是將第五群與第3及0群相比
將已分類好的群再次進(jìn)行分類裹驰,要靈活掌握該函數(shù)的參數(shù)了或者強(qiáng)行進(jìn)行亞分群后再走一遍Seurat流程。參考大神講解https://cloud.tencent.com/developer/article/1621989
6.2 FindAllMarkers()
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
差異檢驗(yàn):
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE) ; head(cluster1.markers)
應(yīng)用ROC方法檢驗(yàn)片挂,返回值“classification power”(范圍0~1之間幻林,越接近于1效果越好)
可視化
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
整體代碼
rm(list = ls())
packageVersion("Seurat")
library(Seurat)
library(dplyr)
library(patchwork)
library(ggsci)###改變一下配色
#導(dǎo)入PBMC數(shù)據(jù)贞盯,用Read10×這個(gè)參數(shù)
pbmc.data <- Read10X(data.dir = "D:pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
?CreateSeuratObject
pbmc
#查看pbmc.data某個(gè)基因在前三十個(gè)細(xì)胞中的表達(dá)
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
# 把表達(dá)矩陣?yán)锏木€粒體QC單獨(dú)存放在Seurat對(duì)象里,等于添加一列到metadata的對(duì)象里
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data)
#展示數(shù)據(jù)中每個(gè)細(xì)胞UMI沪饺,基因數(shù)即線粒體比例
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#展示相關(guān)特征之間的關(guān)系
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
#將unique基因count數(shù)超過(guò)2500躏敢,或者小于200的細(xì)胞過(guò)濾掉,把線粒體count數(shù)占5%以上的細(xì)胞過(guò)濾掉
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#標(biāo)準(zhǔn)化:表達(dá)量*10000后取log整葡,結(jié)果存儲(chǔ)在pbmc[["RNA"]]@data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#選取高變異基因
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
?FindVariableFeatures
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
#scale線性轉(zhuǎn)換
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
?ScaleData
#降維 scale后PCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
?VizDimLoadings()
DimPlot(pbmc, reduction = "pca")
?DimPlot
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
?DimHeatmap
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
?JackStraw
?FindClusters
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
head(Idents(pbmc), 5)
#降維可視化
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "pbmc_tutorial.rds")
#差異表達(dá)基因
# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
?FindMarkers
markers <- FindMarkers(pbmc_small, ident.1 = "g1", group.by = 'groups', subset.ident = "2")
head(x = markers)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
head(cluster1.markers)
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "pbmc3k_final.rds")