Single cell RNA-seq analysis workshop
該系列課程由瑞典國家生物信息學基礎(chǔ)設(shè)施(NBIS) 負責設(shè)置,為瑞典Phd學生蜕青,博士后,研究人員和瑞典學術(shù)界的其他工作人員開設(shè)的全國性研討會糊渊。
本課程將通過演講和實戰(zhàn)練習涵蓋了單細胞轉(zhuǎn)錄組(scRNA-seq)數(shù)據(jù)分析的基本步驟右核。涵蓋的主題包括:
- An overview of the current scRNAseq technologies(當前常用scRNA-seq技術(shù)概述)
- Basic overview of pipelines for processing raw reads into expression values(原始測序數(shù)據(jù)比對轉(zhuǎn)換為基因表達矩陣的分析流程概述)
- Quality control of scRNAseq data(scRNA-seq數(shù)據(jù)的質(zhì)量控制)
- Dimensionality reduction and clustering techniques(數(shù)據(jù)降維和聚類分析技術(shù))
- Data normalization(數(shù)據(jù)歸一化處理)
- Differential gene expression for scRNAseq data(scRNA-seq數(shù)據(jù)的差異基因表達分析)
- Celltype prediction(細胞類型預測)
- Trajectory analysis(細胞軌跡擬時分析)
- Comparison of different analysis pipelines such as Seurat, Scran and Scanpy(不同分析工具流程的比較,如Seurat渺绒,Scran和Scanpy)
實戰(zhàn)練習
在該實戰(zhàn)練習中贺喝,我們使用了3種常用的scRNA-seq分析工具流程(Seurat菱鸥,Scran和Scanpy),分別提供了不同分析工具的簡短教程躏鱼,我們可以任選自己熟悉的工具進行scRNA-seq數(shù)據(jù)分析氮采。原則上,我們對這三種不同的分析工具執(zhí)行相同的步驟染苛,但是仍會存在一些細微的差異鹊漠,因為并非在所有的分析工具中實現(xiàn)了所有不同的方法。
在本實戰(zhàn)中茶行,我選用了scRNA-seq數(shù)據(jù)分析中最常用的Seurat包進行練習演示躯概。
第一節(jié):數(shù)據(jù)質(zhì)控實戰(zhàn)
下載示例數(shù)據(jù)
在本教程中,我們將使用來自3個covid-19患者和3個健康對照的6個PBMC的10x數(shù)據(jù)集運行所有的教程畔师,每個樣本已被二次采樣為1500個細胞娶靡。
# 新建數(shù)據(jù)存放的文件夾
mkdir -p data/raw
# first check if the files are there
count=$(ls -l data/raw/*.h5 | grep -v ^d | wc -l )
echo $count
# if not 4 files, fetch the files from github.
# 下載示例數(shù)據(jù)
if (("$count" < 6)); then
cd data/raw
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_13.h5
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_14.h5
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_5.h5
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_15.h5
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_17.h5
curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_1.h5
cd ../..
fi
# 查看文件大小
ls -lGa data/raw
安裝并加載所需的R包
# 安裝所需的R包
install.packages('Seurat')
install.packages('Matrix')
remotes::install_github("chris-mcginnis-ucsf/DoubletFinder", upgrade = F)
# 加載所需的R包
suppressMessages(require(Seurat))
suppressMessages(require(Matrix))
suppressMessages(require(DoubletFinder))
讀取示例數(shù)據(jù)并構(gòu)建seuart對象
# 使用Read10X_h5函數(shù)讀取h5文件
cov.15 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_15.h5", use.names = T)
cov.1 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_1.h5", use.names = T)
cov.17 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_17.h5", use.names = T)
ctrl.5 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_5.h5", use.names = T)
ctrl.13 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_13.h5", use.names = T)
ctrl.14 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_14.h5", use.names = T)
# 查看數(shù)據(jù)信息
dim(cov.1)
#[1] 33538 1500
dim(cov.15)
#[1] 33538 1500
dim(cov.17)
#[1] 33538 1500
dim(ctrl.5)
#[1] 33538 1500
dim(ctrl.13)
#[1] 33538 1500
dim(ctrl.14)
#[1] 33538 1500
# 使用CreateSeuratObject函數(shù)構(gòu)建seurat對象
sdata.cov15 <- CreateSeuratObject(cov.15, project = "covid_15")
sdata.cov1 <- CreateSeuratObject(cov.1, project = "covid_1")
sdata.cov17 <- CreateSeuratObject(cov.17, project = "covid_17")
sdata.ctrl5 <- CreateSeuratObject(ctrl.5, project = "ctrl_5")
sdata.ctrl13 <- CreateSeuratObject(ctrl.13, project = "ctrl_13")
sdata.ctrl14 <- CreateSeuratObject(ctrl.14, project = "ctrl_14")
# 添加樣本分組信息
sdata.cov1$type = "Covid"
sdata.cov15$type = "Covid"
sdata.cov17$type = "Covid"
sdata.ctrl5$type = "Ctrl"
sdata.ctrl13$type = "Ctrl"
sdata.ctrl14$type = "Ctrl"
# 查看seurat對象
sdata.cov1
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
sdata.cov15
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
sdata.cov17
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl5
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl13
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl14
# An object of class Seurat
# 33538 features across 1500 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
合并所有樣本的數(shù)據(jù)
# Merge datasets into one single seurat object
# 合并所有樣本
alldata <- merge(x = sdata.cov15,
y = c(sdata.cov1, sdata.cov17, sdata.ctrl5,
sdata.ctrl13, sdata.ctrl14),
add.cell.ids = c("covid_15", "covid_1", "covid_17",
"ctrl_5", "ctrl_13", "ctrl_14"))
# remove all objects that will not be used.
# 刪除不用的數(shù)據(jù)對象
rm(cov.15, cov.1, cov.17,
ctrl.5, ctrl.13, ctrl.14,
sdata.cov15, sdata.cov1, sdata.cov17,
sdata.ctrl5, sdata.ctrl13, sdata.ctrl14)
# run garbage collect to free up memory
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 11991984 640.5 22255031 1188.6 22255031 1188.6
# Vcells 96687060 737.7 169043189 1289.7 165738512 1264.5
# 查看數(shù)據(jù)
alldata
# An object of class Seurat
# 33538 features across 9000 samples within 1 assay
# Active assay: RNA (33538 features, 0 variable features)
as.data.frame(alldata@assays$RNA@counts[1:10, 1:2])
# covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15
#MIR1302-2HG 0 0
#FAM138A 0 0
#OR4F5 0 0
#AL627309.1 0 0
#AL627309.3 0 0
#AL627309.2 0 0
#AL627309.4 0 0
#AL732372.1 0 0
#OR4F29 0 0
#AC114498.1 0 0
head(alldata@meta.data, 10)
# orig.ident nCount_RNA nFeature_RNA type
#covid_15_CTCCATGTCAACGTGT-15 covid_15 14911 3526 Covid
#covid_15_CATAAGCAGGAACGAA-15 covid_15 338 203 Covid
#covid_15_TTCACCGTCAGGAAGC-15 covid_15 28486 4542 Covid
#covid_15_CGTCCATGTCCGGACT-15 covid_15 1318 539 Covid
#covid_15_GTCCACTAGTCGCCCA-15 covid_15 4805 1493 Covid
#covid_15_ATCCATTGTTGATGTC-15 covid_15 5386 1617 Covid
#covid_15_AGAAGCGAGGGCCTCT-15 covid_15 686 407 Covid
#covid_15_GAGGGTAGTAGGTTTC-15 covid_15 2155 1116 Covid
#covid_15_CAAGACTTCTGCTTTA-15 covid_15 1216 128 Covid
#covid_15_GCCAACGAGCTCTATG-15 covid_15 729 356 Covid
計算數(shù)據(jù)質(zhì)量控制指標
- 計算線粒體相關(guān)基因含量
# 計算線粒體相關(guān)基因百分比
# Way1: Doing it using Seurat function
# 使用PercentageFeatureSet函數(shù)計算線粒體基因含量
alldata <- PercentageFeatureSet(alldata, "^MT-", col.name = "percent_mito")
# Way2: Doing it manually
# 手動計算線粒體基因含量
total_counts_per_cell <- colSums(alldata@assays$RNA@counts)
head(total_counts_per_cell)
#covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15
# 14911 338
#covid_15_TTCACCGTCAGGAAGC-15 covid_15_CGTCCATGTCCGGACT-15
# 28486 1318
#covid_15_GTCCACTAGTCGCCCA-15 covid_15_ATCCATTGTTGATGTC-15
# 4805 5386
# 獲取線粒體相關(guān)基因
mito_genes <- rownames(alldata)[grep("^MT-", rownames(alldata))]
head(mito_genes)
# [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6"
# 計算線粒體基因含量
alldata$percent_mito <- colSums(alldata@assays$RNA@counts[mito_genes, ])/total_counts_per_cell
- 計算核糖體相關(guān)基因含量
# Way1: Doing it using Seurat function
# 使用PercentageFeatureSet函數(shù)計算核糖體基因含量
alldata <- PercentageFeatureSet(alldata, "^RP[SL]", col.name = "percent_ribo")
# Way2: Doing it manually
# 手動計算核糖體相關(guān)基因含量
ribo_genes <- rownames(alldata)[grep("^RP[SL]", rownames(alldata))]
head(ribo_genes, 10)
#[1] "RPL22" "RPL11" "RPS6KA1" "RPS8" "RPL5" "RPS27" "RPS6KC1"
#[8] "RPS7" "RPS27A" "RPL31"
alldata$percent_ribo <- colSums(alldata@assays$RNA@counts[ribo_genes, ])/total_counts_per_cell
- 計算血紅蛋白相關(guān)基因含量
# Percentage hemoglobin genes - includes all genes starting with HB except HBP.
alldata <- PercentageFeatureSet(alldata, "^HB[^(P)]", col.name = "percent_hb")
數(shù)據(jù)質(zhì)量控制指標的可視化
feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")
# 使用VlnPlot函數(shù)繪制小提琴圖可視化相關(guān)質(zhì)控指標
VlnPlot(alldata, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) +
NoLegend()
# 使用FeatureScatter函數(shù)繪制散點圖
FeatureScatter(alldata, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5)
數(shù)據(jù)過濾篩選
- 基于檢測的過濾
在這里,我們將僅考慮至少檢測到200個基因的細胞看锉,并且至少需要在3個細胞中表達這些基因固蛾。請注意,這些值在很大程度上取決于所使用的單細胞文庫制備方法度陆。
# 過濾至少檢測到200個基因的細胞
selected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)
# 過濾至少在3個細胞中能檢測到的基因
selected_f <- rownames(alldata)[Matrix::rowSums(alldata) > 3]
data.filt <- subset(alldata, features = selected_f, cells = selected_c)
dim(data.filt)
## [1] 18147 7973
注意:檢測到極高數(shù)量基因的細胞可能一些doublet。但是献幔,根據(jù)樣本中細胞類型的組成懂傀,也可能會存在一種具有更多基因(以及更高計數(shù))的細胞。在這種情況下蜡感,我們將需要進一步進行doublet預測蹬蚁。
此外,我們還可以看看哪些基因?qū)@類reads貢獻最大郑兴。例如犀斋,我們可以繪制每個基因計數(shù)的百分比。
# Compute the relative expression of each gene per cell Use sparse matrix
# operations, if your dataset is large, doing matrix devisions the regular way
# will take a very long time.
par(mar = c(4, 8, 2, 1))
C <- data.filt@assays$RNA@counts
C <- Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100
most_expressed <- order(apply(C, 1, median), decreasing = T)[20:1]
boxplot(as.matrix(t(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell",
col = (scales::hue_pal())(20)[20:1], horizontal = TRUE)
如圖所示情连,我們可以看到MALAT1基因占單個細胞UMI的30%叽粹,其他最重要的基因是線粒體和核糖體相關(guān)基因。核內(nèi)lincRNA與數(shù)據(jù)質(zhì)量和線粒體基因reads相關(guān)聯(lián)是非常普遍的却舀,因此對MALAT1的高檢測可能是一個技術(shù)問題虫几。我們可以收集有關(guān)此類基因的一些信息,這些信息對于質(zhì)量控制和下游過濾是非常重要挽拔。
- 基與線粒體/核糖體基因含量進行過濾
我們可以通過數(shù)據(jù)質(zhì)控小提琴圖辆脸,選擇過濾線粒體或核糖體基因含量相關(guān)的閾值。在此例中螃诅,我們選擇過濾掉線粒體讀數(shù)低于20%啡氢,核糖體讀數(shù)少于5%的細胞状囱。
selected_mito <- WhichCells(data.filt, expression = percent_mito < 0.2)
selected_ribo <- WhichCells(data.filt, expression = percent_ribo > 0.05)
# and subset the object to only keep those cells
data.filt <- subset(data.filt, cells = selected_mito)
data.filt <- subset(data.filt, cells = selected_ribo)
dim(data.filt)
## [1] 18147 5762
table(data.filt$orig.ident)
## covid_1 covid_15 covid_17 ctrl_13 ctrl_14 ctrl_5
## 878 585 1042 1154 1063 1040
繪制過濾后的質(zhì)量控制圖
讓我們再次繪制相同的QC統(tǒng)計信息。
feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")
VlnPlot(data.filt, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) +
NoLegend()
過濾篩選相關(guān)基因
由于線粒體和MALAT1等基因的高表達被認為主要是技術(shù)性的倘是,因此在進行任何下一步分析之前亭枷,將其從數(shù)據(jù)集中刪除是明智的。
dim(data.filt)
## [1] 18147 5762
# Filter MALAT1
# 過濾MALAT1基因
data.filt <- data.filt[!grepl("MALAT1", rownames(data.filt)), ]
# Filter Mitocondrial
# 過濾線粒體基因
data.filt <- data.filt[!grepl("^MT-", rownames(data.filt)), ]
# Filter Ribossomal gene (optional if that is a problem on your data) data.filt
# 過濾核糖體基因
# <- data.filt[ ! grepl('^RP[SL]', rownames(data.filt)), ]
# Filter Hemoglobin gene (optional if that is a problem on your data)
# 過濾血紅蛋白基因
data.filt <- data.filt[!grepl("^HB[^(P)]", rownames(data.filt)), ]
dim(data.filt)
## [1] 18121 5762
查看樣本的性別信息
在處理人類或動物樣本數(shù)據(jù)時辨绊,理想情況下奶栖,應將實驗設(shè)置為單性別,以避免在結(jié)論中出現(xiàn)性別偏見门坷。但是宣鄙,這并不總是可能的。
通過查看來自Y染色體(雄性)和XIST基因表達(主要是雌性)的讀數(shù)默蚌,我們可以很容易地確定每個樣本的性別冻晤。如果樣本元數(shù)據(jù)性別與計算預測不一致,它也是一種檢測樣本是否存在混合的好方法绸吸。
在這里鼻弧,我們下載了公共數(shù)據(jù)的基因注釋文件信息,請確保將其放置在正確的位置以使路徑genes.file
起作用锦茁。
genes.file = "data/results/genes.table.csv"
if (!file.exists(genes.file)) {
suppressMessages(require(biomaRt))
# initialize connection to mart, may take some time if the sites are
# unresponsive.
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
# fetch chromosome info plus some other annotations
genes.table <- try(biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name",
"description", "gene_biotype", "chromosome_name", "start_position"), mart = mart,
useCache = F))
if (!dir.exists("data/results")) {
dir.create("data/results")
}
if (is.data.frame(genes.table)) {
write.csv(genes.table, file = genes.file)
}
if (!file.exists(genes.file)) {
download.file("https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/master/labs/misc/genes.table.csv",
destfile = "data/results/genes.table.csv")
genes.table = read.csv(genes.file)
}
} else {
genes.table = read.csv(genes.file)
}
genes.table <- genes.table[genes.table$external_gene_name %in% rownames(data.filt), ]
head(genes.table)
# X ensembl_gene_id external_gene_name
#1 1 ENSG00000281486 SNTG2
#2 2 ENSG00000262826 INTS3
#3 3 ENSG00000274891 TRAPPC12
#4 4 ENSG00000280830 ADI1
#5 5 ENSG00000263163 SLC27A3
#6 6 ENSG00000261992 GATAD2B
# description
#1 syntrophin gamma 2 [Source:HGNC Symbol;Acc:HGNC:13741]
#2 integrator complex subunit 3 [Source:HGNC Symbol;Acc:HGNC:26153]
#3 trafficking protein particle complex 12 [Source:HGNC Symbol;Acc:HGNC:24284]
#4 acireductone dioxygenase 1 [Source:HGNC Symbol;Acc:HGNC:30576]
#5 solute carrier family 27 member 3 [Source:HGNC Symbol;Acc:HGNC:10997]
#6 GATA zinc finger domain containing 2B [Source:HGNC Symbol;Acc:HGNC:30778]
# gene_biotype chromosome_name start_position
#1 protein_coding CHR_HSCHR2_3_CTG1 1209102
#2 protein_coding CHR_HSCHR1_1_CTG31 153745298
#3 protein_coding CHR_HSCHR2_1_CTG1 3401777
#4 protein_coding CHR_HSCHR2_1_CTG1 3498728
#5 protein_coding CHR_HSCHR1_1_CTG31 153791585
#6 protein_coding CHR_HSCHR1_1_CTG31 153821956
現(xiàn)在我們有了染色體信息攘轩,可以計算每個細胞中來自Y染色體讀數(shù)的比例。
chrY.gene = genes.table$external_gene_name[genes.table$chromosome_name == "Y"]
chrY.gene
#[1] "RPS4Y1" "AC006157.1" "ZFY" "ZFY-AS1" "LINC00278"
#[6] "PCDH11Y" "USP9Y" "DDX3Y" "UTY" "TMSB4Y"
#[11] "TTTY14" "AC010889.1" "KDM5D" "EIF1AY" "RPS4Y2"
data.filt$pct_chrY = colSums(data.filt@assays$RNA@counts[chrY.gene, ])/colSums(data.filt@assays$RNA@counts)
繪制XIST表達與chrY比例的關(guān)系散點圖码俩。
FeatureScatter(data.filt, feature1 = "XIST", feature2 = "pct_chrY")
繪制小提琴圖查看樣本性別度帮。
VlnPlot(data.filt, features = c("XIST", "pct_chrY"))
如圖所示,我們可以清楚地看到這里有兩個男性和四個女性稿存,您可以看到它們是哪些樣本嗎笨篷?您認為這會對下游分析造成任何問題嗎?與您的小組討論:應對這種性別bias的最佳方法是什么瓣履?
計算細胞周期評分
在這里率翅,我們計算細胞周期評分。為了對特定基因列表進行評分袖迎,該算法計算給定基因列表的平均表達與參考基因集的平均表達之差冕臭。為了建立參考基因集,該函數(shù)隨機選擇一堆與給定基因列表的表達分布相匹配的基因燕锥。計算完細胞周期評分后浴韭,將會在數(shù)據(jù)中添加三個metadata信息,即S期評分脯宿,G2M期評分和預測的細胞周期念颈。
# Before running CellCycleScoring the data need to be normalized and logtransformed.
# 進行數(shù)據(jù)標準化處理
data.filt = NormalizeData(data.filt)
# 使用CellCycleScoring函數(shù)計算細胞周期評分
data.filt <- CellCycleScoring(object = data.filt, g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
現(xiàn)在,我們還可以為細胞周期得分繪制小提琴圖连霉。
VlnPlot(data.filt, features = c("S.Score", "G2M.Score"), group.by = "orig.ident",
ncol = 4, pt.size = 0.1)
預測雙細胞doublet
在同一小孔/液滴中榴芳,細胞的Doublets/Mulitples是scRNA-seq文庫構(gòu)建中的常見問題嗡靡。尤其是在基于液滴的方法中,會導致細胞超負荷窟感。在10x實驗中讨彼,雙細胞的比例與加載細胞的數(shù)量呈線性相關(guān)。如Chromium用戶指南所述柿祈,雙峰速率大約如下:
大多數(shù)雙細胞檢測工具通過合并細胞的計數(shù)來模擬雙細胞哈误,并將預測到的與模擬雙細胞相似的細胞歸為doublets。大多數(shù)此類軟件包都需要對數(shù)據(jù)集中預期雙細胞數(shù)的數(shù)量/比例進行假設(shè)躏嚎。我們這里使用的數(shù)據(jù)是經(jīng)過二次抽樣的蜜自,但是原始數(shù)據(jù)集中每個樣本包含了大約5000個細胞,因此我們可以假設(shè)它們加載了大約9000個細胞卢佣,并且其雙細胞率約為4%重荠。
OBS!理想情況下虚茶,我們應該對每個樣本分別進行doublet預測戈鲁,尤其是在您不同的樣本具有不同比例的細胞類型的情況下。在這里嘹叫,我們對數(shù)據(jù)進行了二次采樣婆殿,因此每個樣本中只有很少的細胞,并且所有樣本都是經(jīng)過sorted的PBMC罩扇,因此可以將它們一起運行婆芦。
在這里,我們將使用DoubletFinder
包來預測雙細胞暮蹂。
# remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
suppressMessages(require(DoubletFinder))
# 篩選高變基因
data.filt = FindVariableFeatures(data.filt, verbose = F)
# 進行數(shù)據(jù)歸一化
data.filt = ScaleData(data.filt, vars.to.regress = c("nFeature_RNA", "percent_mito"),
verbose = F)
# PCA降維
data.filt = RunPCA(data.filt, verbose = F, npcs = 20)
# UMAP可視化
data.filt = RunUMAP(data.filt, dims = 1:10, verbose = F)
然后,我們運行doubletFinder癌压,選擇前10個PC仰泻,pK值設(shè)為0.9。
# Can run parameter optimization with paramSweep, but skip for now.
# sweep.res <- paramSweep_v3(data.filt)
# sweep.stats <- summarizeSweep(sweep.res, GT = FALSE)
# bcmvn <- find.pK(sweep.stats)
# barplot(bcmvn$BCmetric, names.arg = bcmvn$pK, las=2)
# define the expected number of doublet cellscells.
nExp <- round(ncol(data.filt) * 0.04) # expect 4% doublets
# 使用doubletFinder_v3函數(shù)預測雙細胞
data.filt <- doubletFinder_v3(data.filt, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:10)
## [1] "Creating 1921 artificial doublets..."
## [1] "Creating Seurat object..."
## [1] "Normalizing Seurat object..."
## [1] "Finding variable genes..."
## [1] "Scaling data..."
## [1] "Running PCA..."
## [1] "Calculating PC distance matrix..."
## [1] "Computing pANN..."
## [1] "Classifying doublets.."
# name of the DF prediction can change, so extract the correct column name.
DF.name = colnames(data.filt@meta.data)[grepl("DF.classification", colnames(data.filt@meta.data))]
cowplot::plot_grid(ncol = 2, DimPlot(data.filt, group.by = "orig.ident") + NoAxes(),
DimPlot(data.filt, group.by = DF.name) + NoAxes())
理想情況下滩届,雙細胞中比單細胞檢查到更多的基因集侯,讓我們檢查一下我們預測的雙細胞中是否也具有更多的檢測到的基因。
VlnPlot(data.filt, features = "nFeature_RNA", group.by = DF.name, pt.size = 0.1)
現(xiàn)在帜消,我們可以將預測的雙細胞從數(shù)據(jù)中刪除棠枉。
data.filt = data.filt[, data.filt@meta.data[, DF.name] == "Singlet"]
dim(data.filt)
## [1] 18121 5532
保存質(zhì)控過濾后的數(shù)據(jù)
最后,讓我們保存經(jīng)過QC過濾的數(shù)據(jù)以進行后續(xù)的分析泡挺。創(chuàng)建輸出目錄results
并將數(shù)據(jù)保存到該文件夾中辈讶。
dir.create("data/results", showWarnings = F)
saveRDS(data.filt, "data/results/seurat_covid_qc.rds")
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.5
##
## Matrix products: default
## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] KernSmooth_2.23-18 fields_11.6 spam_2.6-0
## [4] dotCall64_1.0-0 DoubletFinder_2.0.3 Matrix_1.3-2
## [7] Seurat_3.2.3 RJSONIO_1.3-1.4 optparse_1.6.6
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_2.0-0 deldir_0.2-9
## [4] ellipsis_0.3.1 ggridges_0.5.3 spatstat.data_1.7-0
## [7] farver_2.0.3 leiden_0.3.6 listenv_0.8.0
## [10] remotes_2.2.0 bit64_4.0.5 getopt_1.20.3
## [13] ggrepel_0.9.1 RSpectra_0.16-0 codetools_0.2-18
## [16] splines_4.0.3 knitr_1.30 polyclip_1.10-0
## [19] jsonlite_1.7.2 ica_1.0-2 cluster_2.1.0
## [22] png_0.1-7 uwot_0.1.10 shiny_1.5.0
## [25] sctransform_0.3.2 compiler_4.0.3 httr_1.4.2
## [28] assertthat_0.2.1 fastmap_1.0.1 lazyeval_0.2.2
## [31] later_1.1.0.1 formatR_1.7 htmltools_0.5.1
## [34] tools_4.0.3 rsvd_1.0.3 igraph_1.2.6
## [37] gtable_0.3.0 glue_1.4.2 RANN_2.6.1
## [40] reshape2_1.4.4 dplyr_1.0.3 maps_3.3.0
## [43] Rcpp_1.0.6 spatstat_1.64-1 scattermore_0.7
## [46] vctrs_0.3.6 nlme_3.1-151 lmtest_0.9-38
## [49] xfun_0.20 stringr_1.4.0 globals_0.14.0
## [52] mime_0.9 miniUI_0.1.1.1 lifecycle_0.2.0
## [55] irlba_2.3.3 goftest_1.2-2 future_1.21.0
## [58] MASS_7.3-53 zoo_1.8-8 scales_1.1.1
## [61] promises_1.1.1 spatstat.utils_1.20-2 parallel_4.0.3
## [64] RColorBrewer_1.1-2 yaml_2.2.1 curl_4.3
## [67] reticulate_1.18 pbapply_1.4-3 gridExtra_2.3
## [70] ggplot2_3.3.3 rpart_4.1-15 stringi_1.5.3
## [73] rlang_0.4.10 pkgconfig_2.0.3 matrixStats_0.57.0
## [76] evaluate_0.14 lattice_0.20-41 ROCR_1.0-11
## [79] purrr_0.3.4 tensor_1.5 labeling_0.4.2
## [82] patchwork_1.1.1 htmlwidgets_1.5.3 bit_4.0.4
## [85] cowplot_1.1.1 tidyselect_1.1.0 parallelly_1.23.0
## [88] RcppAnnoy_0.0.18 plyr_1.8.6 magrittr_2.0.1
## [91] R6_2.5.0 generics_0.1.0 DBI_1.1.1
## [94] withr_2.4.0 pillar_1.4.7 mgcv_1.8-33
## [97] fitdistrplus_1.1-3 survival_3.2-7 abind_1.4-5
## [100] tibble_3.0.5 future.apply_1.7.0 crayon_1.3.4
## [103] hdf5r_1.3.3 plotly_4.9.3 rmarkdown_2.6
## [106] data.table_1.13.6 digest_0.6.27 xtable_1.8-4
## [109] tidyr_1.1.2 httpuv_1.5.5 munsell_0.5.0
## [112] viridisLite_0.3.0
參考來源:https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_01_qc.html