NBIS系列單細胞轉(zhuǎn)錄組數(shù)據(jù)分析實戰(zhàn)(一):數(shù)據(jù)質(zhì)控

Single cell RNA-seq analysis workshop

該系列課程由瑞典國家生物信息學基礎(chǔ)設(shè)施(NBIS) 負責設(shè)置,為瑞典Phd學生蜕青,博士后,研究人員和瑞典學術(shù)界的其他工作人員開設(shè)的全國性研討會糊渊。

本課程將通過演講和實戰(zhàn)練習涵蓋了單細胞轉(zhuǎn)錄組(scRNA-seq)數(shù)據(jù)分析的基本步驟右核。涵蓋的主題包括:

  • An overview of the current scRNAseq technologies(當前常用scRNA-seq技術(shù)概述)
  • Basic overview of pipelines for processing raw reads into expression values(原始測序數(shù)據(jù)比對轉(zhuǎn)換為基因表達矩陣的分析流程概述)
  • Quality control of scRNAseq data(scRNA-seq數(shù)據(jù)的質(zhì)量控制)
  • Dimensionality reduction and clustering techniques(數(shù)據(jù)降維和聚類分析技術(shù))
  • Data normalization(數(shù)據(jù)歸一化處理)
  • Differential gene expression for scRNAseq data(scRNA-seq數(shù)據(jù)的差異基因表達分析)
  • Celltype prediction(細胞類型預測)
  • Trajectory analysis(細胞軌跡擬時分析)
  • Comparison of different analysis pipelines such as Seurat, Scran and Scanpy(不同分析工具流程的比較,如Seurat渺绒,Scran和Scanpy)

實戰(zhàn)練習

在該實戰(zhàn)練習中贺喝,我們使用了3種常用的scRNA-seq分析工具流程(Seurat菱鸥,ScranScanpy),分別提供了不同分析工具的簡短教程躏鱼,我們可以任選自己熟悉的工具進行scRNA-seq數(shù)據(jù)分析氮采。原則上,我們對這三種不同的分析工具執(zhí)行相同的步驟染苛,但是仍會存在一些細微的差異鹊漠,因為并非在所有的分析工具中實現(xiàn)了所有不同的方法。

image.png

在本實戰(zhàn)中茶行,我選用了scRNA-seq數(shù)據(jù)分析中最常用的Seurat包進行練習演示躯概。

第一節(jié):數(shù)據(jù)質(zhì)控實戰(zhàn)

下載示例數(shù)據(jù)

在本教程中,我們將使用來自3個covid-19患者和3個健康對照的6個PBMC的10x數(shù)據(jù)集運行所有的教程畔师,每個樣本已被二次采樣為1500個細胞娶靡。

# 新建數(shù)據(jù)存放的文件夾
mkdir -p data/raw

# first check if the files are there
count=$(ls -l data/raw/*.h5 | grep -v ^d | wc -l )
echo $count

# if not 4 files, fetch the files from github.
# 下載示例數(shù)據(jù)
if (("$count" <  6)); then
  cd data/raw
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_13.h5
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_14.h5
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_5.h5
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_15.h5
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_17.h5
  curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_1.h5
  cd ../..
fi

# 查看文件大小
ls -lGa data/raw

安裝并加載所需的R包

# 安裝所需的R包
install.packages('Seurat')
install.packages('Matrix')
remotes::install_github("chris-mcginnis-ucsf/DoubletFinder", upgrade = F)

# 加載所需的R包
suppressMessages(require(Seurat))
suppressMessages(require(Matrix))
suppressMessages(require(DoubletFinder))

讀取示例數(shù)據(jù)并構(gòu)建seuart對象

# 使用Read10X_h5函數(shù)讀取h5文件
cov.15 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_15.h5", use.names = T)
cov.1 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_1.h5", use.names = T)
cov.17 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_17.h5", use.names = T)
ctrl.5 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_5.h5", use.names = T)
ctrl.13 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_13.h5", use.names = T)
ctrl.14 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_14.h5", use.names = T)

# 查看數(shù)據(jù)信息
dim(cov.1)
#[1] 33538  1500
dim(cov.15)
#[1] 33538  1500
dim(cov.17)
#[1] 33538  1500
dim(ctrl.5)
#[1] 33538  1500
dim(ctrl.13)
#[1] 33538  1500
dim(ctrl.14)
#[1] 33538  1500

# 使用CreateSeuratObject函數(shù)構(gòu)建seurat對象
sdata.cov15 <- CreateSeuratObject(cov.15, project = "covid_15")
sdata.cov1 <- CreateSeuratObject(cov.1, project = "covid_1")
sdata.cov17 <- CreateSeuratObject(cov.17, project = "covid_17")
sdata.ctrl5 <- CreateSeuratObject(ctrl.5, project = "ctrl_5")
sdata.ctrl13 <- CreateSeuratObject(ctrl.13, project = "ctrl_13")
sdata.ctrl14 <- CreateSeuratObject(ctrl.14, project = "ctrl_14")

# 添加樣本分組信息
sdata.cov1$type = "Covid"
sdata.cov15$type = "Covid"
sdata.cov17$type = "Covid"
sdata.ctrl5$type = "Ctrl"
sdata.ctrl13$type = "Ctrl"
sdata.ctrl14$type = "Ctrl"

# 查看seurat對象
sdata.cov1
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)
sdata.cov15
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)
sdata.cov17
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl5
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl13
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)
sdata.ctrl14
# An object of class Seurat 
# 33538 features across 1500 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)

合并所有樣本的數(shù)據(jù)

# Merge datasets into one single seurat object
# 合并所有樣本
alldata <- merge(x = sdata.cov15, 
                 y = c(sdata.cov1, sdata.cov17, sdata.ctrl5, 
                       sdata.ctrl13, sdata.ctrl14), 
                 add.cell.ids = c("covid_15", "covid_1", "covid_17",
                                  "ctrl_5", "ctrl_13", "ctrl_14"))

# remove all objects that will not be used.
# 刪除不用的數(shù)據(jù)對象
rm(cov.15, cov.1, cov.17, 
   ctrl.5, ctrl.13, ctrl.14, 
   sdata.cov15, sdata.cov1, sdata.cov17, 
   sdata.ctrl5, sdata.ctrl13, sdata.ctrl14)

# run garbage collect to free up memory
gc()
#          used  (Mb) gc trigger   (Mb)  max used   (Mb)
# Ncells 11991984 640.5   22255031 1188.6  22255031 1188.6
# Vcells 96687060 737.7  169043189 1289.7 165738512 1264.5

# 查看數(shù)據(jù)
alldata
# An object of class Seurat 
# 33538 features across 9000 samples within 1 assay 
# Active assay: RNA (33538 features, 0 variable features)

as.data.frame(alldata@assays$RNA@counts[1:10, 1:2])
#            covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15
#MIR1302-2HG                            0                            0
#FAM138A                                0                            0
#OR4F5                                  0                            0
#AL627309.1                             0                            0
#AL627309.3                             0                            0
#AL627309.2                             0                            0
#AL627309.4                             0                            0
#AL732372.1                             0                            0
#OR4F29                                 0                            0
#AC114498.1                             0                            0

head(alldata@meta.data, 10)
#                            orig.ident nCount_RNA nFeature_RNA  type
#covid_15_CTCCATGTCAACGTGT-15   covid_15      14911         3526 Covid
#covid_15_CATAAGCAGGAACGAA-15   covid_15        338          203 Covid
#covid_15_TTCACCGTCAGGAAGC-15   covid_15      28486         4542 Covid
#covid_15_CGTCCATGTCCGGACT-15   covid_15       1318          539 Covid
#covid_15_GTCCACTAGTCGCCCA-15   covid_15       4805         1493 Covid
#covid_15_ATCCATTGTTGATGTC-15   covid_15       5386         1617 Covid
#covid_15_AGAAGCGAGGGCCTCT-15   covid_15        686          407 Covid
#covid_15_GAGGGTAGTAGGTTTC-15   covid_15       2155         1116 Covid
#covid_15_CAAGACTTCTGCTTTA-15   covid_15       1216          128 Covid
#covid_15_GCCAACGAGCTCTATG-15   covid_15        729          356 Covid

計算數(shù)據(jù)質(zhì)量控制指標

  • 計算線粒體相關(guān)基因含量
# 計算線粒體相關(guān)基因百分比
# Way1: Doing it using Seurat function
# 使用PercentageFeatureSet函數(shù)計算線粒體基因含量
alldata <- PercentageFeatureSet(alldata, "^MT-", col.name = "percent_mito")

# Way2: Doing it manually
# 手動計算線粒體基因含量
total_counts_per_cell <- colSums(alldata@assays$RNA@counts)
head(total_counts_per_cell)
#covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15 
#                      14911                          338 
#covid_15_TTCACCGTCAGGAAGC-15 covid_15_CGTCCATGTCCGGACT-15 
#                       28486                         1318 
#covid_15_GTCCACTAGTCGCCCA-15 covid_15_ATCCATTGTTGATGTC-15 
#                        4805                         5386 
# 獲取線粒體相關(guān)基因
mito_genes <- rownames(alldata)[grep("^MT-", rownames(alldata))]
head(mito_genes)
# [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6"
# 計算線粒體基因含量
alldata$percent_mito <- colSums(alldata@assays$RNA@counts[mito_genes, ])/total_counts_per_cell
  • 計算核糖體相關(guān)基因含量
# Way1: Doing it using Seurat function
# 使用PercentageFeatureSet函數(shù)計算核糖體基因含量
alldata <- PercentageFeatureSet(alldata, "^RP[SL]", col.name = "percent_ribo")

# Way2: Doing it manually
# 手動計算核糖體相關(guān)基因含量 
ribo_genes <- rownames(alldata)[grep("^RP[SL]", rownames(alldata))]
head(ribo_genes, 10)
#[1] "RPL22"   "RPL11"   "RPS6KA1" "RPS8"    "RPL5"    "RPS27"   "RPS6KC1"
#[8] "RPS7"    "RPS27A"  "RPL31" 
alldata$percent_ribo <- colSums(alldata@assays$RNA@counts[ribo_genes, ])/total_counts_per_cell
  • 計算血紅蛋白相關(guān)基因含量
# Percentage hemoglobin genes - includes all genes starting with HB except HBP.
alldata <- PercentageFeatureSet(alldata, "^HB[^(P)]", col.name = "percent_hb")

數(shù)據(jù)質(zhì)量控制指標的可視化

feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")
# 使用VlnPlot函數(shù)繪制小提琴圖可視化相關(guān)質(zhì)控指標
VlnPlot(alldata, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) + 
    NoLegend()
image.png
# 使用FeatureScatter函數(shù)繪制散點圖
FeatureScatter(alldata, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5)
image.png

數(shù)據(jù)過濾篩選

  • 基于檢測的過濾
    在這里,我們將僅考慮至少檢測到200個基因的細胞看锉,并且至少需要在3個細胞中表達這些基因固蛾。請注意,這些值在很大程度上取決于所使用的單細胞文庫制備方法度陆。
# 過濾至少檢測到200個基因的細胞
selected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)
# 過濾至少在3個細胞中能檢測到的基因
selected_f <- rownames(alldata)[Matrix::rowSums(alldata) > 3]

data.filt <- subset(alldata, features = selected_f, cells = selected_c)
dim(data.filt)
## [1] 18147  7973

注意:檢測到極高數(shù)量基因的細胞可能一些doublet。但是献幔,根據(jù)樣本中細胞類型的組成懂傀,也可能會存在一種具有更多基因(以及更高計數(shù))的細胞。在這種情況下蜡感,我們將需要進一步進行doublet預測蹬蚁。

此外,我們還可以看看哪些基因?qū)@類reads貢獻最大郑兴。例如犀斋,我們可以繪制每個基因計數(shù)的百分比。

# Compute the relative expression of each gene per cell Use sparse matrix
# operations, if your dataset is large, doing matrix devisions the regular way
# will take a very long time.
par(mar = c(4, 8, 2, 1))
C <- data.filt@assays$RNA@counts
C <- Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100
most_expressed <- order(apply(C, 1, median), decreasing = T)[20:1]

boxplot(as.matrix(t(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell", 
    col = (scales::hue_pal())(20)[20:1], horizontal = TRUE)
image.png

如圖所示情连,我們可以看到MALAT1基因占單個細胞UMI的30%叽粹,其他最重要的基因是線粒體和核糖體相關(guān)基因。核內(nèi)lincRNA與數(shù)據(jù)質(zhì)量和線粒體基因reads相關(guān)聯(lián)是非常普遍的却舀,因此對MALAT1的高檢測可能是一個技術(shù)問題虫几。我們可以收集有關(guān)此類基因的一些信息,這些信息對于質(zhì)量控制和下游過濾是非常重要挽拔。

  • 基與線粒體/核糖體基因含量進行過濾
    我們可以通過數(shù)據(jù)質(zhì)控小提琴圖辆脸,選擇過濾線粒體或核糖體基因含量相關(guān)的閾值。在此例中螃诅,我們選擇過濾掉線粒體讀數(shù)低于20%啡氢,核糖體讀數(shù)少于5%的細胞状囱。
selected_mito <- WhichCells(data.filt, expression = percent_mito < 0.2)
selected_ribo <- WhichCells(data.filt, expression = percent_ribo > 0.05)

# and subset the object to only keep those cells
data.filt <- subset(data.filt, cells = selected_mito)
data.filt <- subset(data.filt, cells = selected_ribo)

dim(data.filt)
## [1] 18147  5762

table(data.filt$orig.ident)
##  covid_1 covid_15 covid_17  ctrl_13  ctrl_14   ctrl_5 
##      878      585     1042     1154     1063     1040

繪制過濾后的質(zhì)量控制圖

讓我們再次繪制相同的QC統(tǒng)計信息。

feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")

VlnPlot(data.filt, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) + 
    NoLegend()
image.png

過濾篩選相關(guān)基因

由于線粒體和MALAT1等基因的高表達被認為主要是技術(shù)性的倘是,因此在進行任何下一步分析之前亭枷,將其從數(shù)據(jù)集中刪除是明智的。

dim(data.filt)
## [1] 18147  5762

# Filter MALAT1
# 過濾MALAT1基因
data.filt <- data.filt[!grepl("MALAT1", rownames(data.filt)), ]

# Filter Mitocondrial
# 過濾線粒體基因
data.filt <- data.filt[!grepl("^MT-", rownames(data.filt)), ]

# Filter Ribossomal gene (optional if that is a problem on your data) data.filt
# 過濾核糖體基因
# <- data.filt[ ! grepl('^RP[SL]', rownames(data.filt)), ]

# Filter Hemoglobin gene (optional if that is a problem on your data)
# 過濾血紅蛋白基因
data.filt <- data.filt[!grepl("^HB[^(P)]", rownames(data.filt)), ]

dim(data.filt)
## [1] 18121  5762

查看樣本的性別信息

在處理人類或動物樣本數(shù)據(jù)時辨绊,理想情況下奶栖,應將實驗設(shè)置為單性別,以避免在結(jié)論中出現(xiàn)性別偏見门坷。但是宣鄙,這并不總是可能的。

通過查看來自Y染色體(雄性)和XIST基因表達(主要是雌性)的讀數(shù)默蚌,我們可以很容易地確定每個樣本的性別冻晤。如果樣本元數(shù)據(jù)性別與計算預測不一致,它也是一種檢測樣本是否存在混合的好方法绸吸。

在這里鼻弧,我們下載了公共數(shù)據(jù)的基因注釋文件信息,請確保將其放置在正確的位置以使路徑genes.file起作用锦茁。

genes.file = "data/results/genes.table.csv"

if (!file.exists(genes.file)) {
    suppressMessages(require(biomaRt))

    # initialize connection to mart, may take some time if the sites are
    # unresponsive.
    mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

    # fetch chromosome info plus some other annotations
    genes.table <- try(biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
        "description", "gene_biotype", "chromosome_name", "start_position"), mart = mart, 
        useCache = F))

    if (!dir.exists("data/results")) {
        dir.create("data/results")
    }
    if (is.data.frame(genes.table)) {
        write.csv(genes.table, file = genes.file)
    }

    if (!file.exists(genes.file)) {
        download.file("https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/master/labs/misc/genes.table.csv", 
            destfile = "data/results/genes.table.csv")
        genes.table = read.csv(genes.file)
    }

} else {
    genes.table = read.csv(genes.file)
}

genes.table <- genes.table[genes.table$external_gene_name %in% rownames(data.filt), ]
head(genes.table)
#  X ensembl_gene_id external_gene_name
#1 1 ENSG00000281486              SNTG2
#2 2 ENSG00000262826              INTS3
#3 3 ENSG00000274891           TRAPPC12
#4 4 ENSG00000280830               ADI1
#5 5 ENSG00000263163            SLC27A3
#6 6 ENSG00000261992            GATAD2B
#                                                                  description
#1                      syntrophin gamma 2 [Source:HGNC Symbol;Acc:HGNC:13741]
#2            integrator complex subunit 3 [Source:HGNC Symbol;Acc:HGNC:26153]
#3 trafficking protein particle complex 12 [Source:HGNC Symbol;Acc:HGNC:24284]
#4              acireductone dioxygenase 1 [Source:HGNC Symbol;Acc:HGNC:30576]
#5       solute carrier family 27 member 3 [Source:HGNC Symbol;Acc:HGNC:10997]
#6   GATA zinc finger domain containing 2B [Source:HGNC Symbol;Acc:HGNC:30778]
#    gene_biotype    chromosome_name start_position
#1 protein_coding  CHR_HSCHR2_3_CTG1        1209102
#2 protein_coding CHR_HSCHR1_1_CTG31      153745298
#3 protein_coding  CHR_HSCHR2_1_CTG1        3401777
#4 protein_coding  CHR_HSCHR2_1_CTG1        3498728
#5 protein_coding CHR_HSCHR1_1_CTG31      153791585
#6 protein_coding CHR_HSCHR1_1_CTG31      153821956

現(xiàn)在我們有了染色體信息攘轩,可以計算每個細胞中來自Y染色體讀數(shù)的比例。

chrY.gene = genes.table$external_gene_name[genes.table$chromosome_name == "Y"]
chrY.gene
#[1] "RPS4Y1"     "AC006157.1" "ZFY"        "ZFY-AS1"    "LINC00278" 
#[6] "PCDH11Y"    "USP9Y"      "DDX3Y"      "UTY"        "TMSB4Y"    
#[11] "TTTY14"     "AC010889.1" "KDM5D"      "EIF1AY"     "RPS4Y2"

data.filt$pct_chrY = colSums(data.filt@assays$RNA@counts[chrY.gene, ])/colSums(data.filt@assays$RNA@counts)

繪制XIST表達與chrY比例的關(guān)系散點圖码俩。

FeatureScatter(data.filt, feature1 = "XIST", feature2 = "pct_chrY")
image.png

繪制小提琴圖查看樣本性別度帮。

VlnPlot(data.filt, features = c("XIST", "pct_chrY"))
image.png

如圖所示,我們可以清楚地看到這里有兩個男性和四個女性稿存,您可以看到它們是哪些樣本嗎笨篷?您認為這會對下游分析造成任何問題嗎?與您的小組討論:應對這種性別bias的最佳方法是什么瓣履?

計算細胞周期評分

在這里率翅,我們計算細胞周期評分。為了對特定基因列表進行評分袖迎,該算法計算給定基因列表的平均表達與參考基因集的平均表達之差冕臭。為了建立參考基因集,該函數(shù)隨機選擇一堆與給定基因列表的表達分布相匹配的基因燕锥。計算完細胞周期評分后浴韭,將會在數(shù)據(jù)中添加三個metadata信息,即S期評分脯宿,G2M期評分和預測的細胞周期念颈。

# Before running CellCycleScoring the data need to be normalized and logtransformed.
# 進行數(shù)據(jù)標準化處理
data.filt = NormalizeData(data.filt)

# 使用CellCycleScoring函數(shù)計算細胞周期評分
data.filt <- CellCycleScoring(object = data.filt, g2m.features = cc.genes$g2m.genes, 
    s.features = cc.genes$s.genes)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms

## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms

現(xiàn)在,我們還可以為細胞周期得分繪制小提琴圖连霉。

VlnPlot(data.filt, features = c("S.Score", "G2M.Score"), group.by = "orig.ident", 
    ncol = 4, pt.size = 0.1)
image.png

預測雙細胞doublet

在同一小孔/液滴中榴芳,細胞的Doublets/Mulitples是scRNA-seq文庫構(gòu)建中的常見問題嗡靡。尤其是在基于液滴的方法中,會導致細胞超負荷窟感。在10x實驗中讨彼,雙細胞的比例與加載細胞的數(shù)量呈線性相關(guān)。如Chromium用戶指南所述柿祈,雙峰速率大約如下:

image.png

大多數(shù)雙細胞檢測工具通過合并細胞的計數(shù)來模擬雙細胞哈误,并將預測到的與模擬雙細胞相似的細胞歸為doublets。大多數(shù)此類軟件包都需要對數(shù)據(jù)集中預期雙細胞數(shù)的數(shù)量/比例進行假設(shè)躏嚎。我們這里使用的數(shù)據(jù)是經(jīng)過二次抽樣的蜜自,但是原始數(shù)據(jù)集中每個樣本包含了大約5000個細胞,因此我們可以假設(shè)它們加載了大約9000個細胞卢佣,并且其雙細胞率約為4%重荠。

OBS!理想情況下虚茶,我們應該對每個樣本分別進行doublet預測戈鲁,尤其是在您不同的樣本具有不同比例的細胞類型的情況下。在這里嘹叫,我們對數(shù)據(jù)進行了二次采樣婆殿,因此每個樣本中只有很少的細胞,并且所有樣本都是經(jīng)過sorted的PBMC罩扇,因此可以將它們一起運行婆芦。

在這里,我們將使用DoubletFinder包來預測雙細胞暮蹂。

# remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
suppressMessages(require(DoubletFinder))

# 篩選高變基因
data.filt = FindVariableFeatures(data.filt, verbose = F)
# 進行數(shù)據(jù)歸一化
data.filt = ScaleData(data.filt, vars.to.regress = c("nFeature_RNA", "percent_mito"), 
    verbose = F)

# PCA降維
data.filt = RunPCA(data.filt, verbose = F, npcs = 20)
# UMAP可視化
data.filt = RunUMAP(data.filt, dims = 1:10, verbose = F)

然后,我們運行doubletFinder癌压,選擇前10個PC仰泻,pK值設(shè)為0.9。

# Can run parameter optimization with paramSweep, but skip for now.
# sweep.res <- paramSweep_v3(data.filt) 
# sweep.stats <- summarizeSweep(sweep.res, GT = FALSE) 
# bcmvn <- find.pK(sweep.stats) 
# barplot(bcmvn$BCmetric, names.arg = bcmvn$pK, las=2)

# define the expected number of doublet cellscells.
nExp <- round(ncol(data.filt) * 0.04)  # expect 4% doublets
# 使用doubletFinder_v3函數(shù)預測雙細胞
data.filt <- doubletFinder_v3(data.filt, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:10)
## [1] "Creating 1921 artificial doublets..."
## [1] "Creating Seurat object..."
## [1] "Normalizing Seurat object..."
## [1] "Finding variable genes..."
## [1] "Scaling data..."
## [1] "Running PCA..."
## [1] "Calculating PC distance matrix..."
## [1] "Computing pANN..."
## [1] "Classifying doublets.."

# name of the DF prediction can change, so extract the correct column name.
DF.name = colnames(data.filt@meta.data)[grepl("DF.classification", colnames(data.filt@meta.data))]

cowplot::plot_grid(ncol = 2, DimPlot(data.filt, group.by = "orig.ident") + NoAxes(), 
    DimPlot(data.filt, group.by = DF.name) + NoAxes())
image.png

理想情況下滩届,雙細胞中比單細胞檢查到更多的基因集侯,讓我們檢查一下我們預測的雙細胞中是否也具有更多的檢測到的基因。

VlnPlot(data.filt, features = "nFeature_RNA", group.by = DF.name, pt.size = 0.1)
image.png

現(xiàn)在帜消,我們可以將預測的雙細胞從數(shù)據(jù)中刪除棠枉。

data.filt = data.filt[, data.filt@meta.data[, DF.name] == "Singlet"]
dim(data.filt)
## [1] 18121  5532

保存質(zhì)控過濾后的數(shù)據(jù)

最后,讓我們保存經(jīng)過QC過濾的數(shù)據(jù)以進行后續(xù)的分析泡挺。創(chuàng)建輸出目錄results并將數(shù)據(jù)保存到該文件夾中辈讶。

dir.create("data/results", showWarnings = F)

saveRDS(data.filt, "data/results/seurat_covid_qc.rds")
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] KernSmooth_2.23-18  fields_11.6         spam_2.6-0         
## [4] dotCall64_1.0-0     DoubletFinder_2.0.3 Matrix_1.3-2       
## [7] Seurat_3.2.3        RJSONIO_1.3-1.4     optparse_1.6.6     
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-0      deldir_0.2-9         
##   [4] ellipsis_0.3.1        ggridges_0.5.3        spatstat.data_1.7-0  
##   [7] farver_2.0.3          leiden_0.3.6          listenv_0.8.0        
##  [10] remotes_2.2.0         bit64_4.0.5           getopt_1.20.3        
##  [13] ggrepel_0.9.1         RSpectra_0.16-0       codetools_0.2-18     
##  [16] splines_4.0.3         knitr_1.30            polyclip_1.10-0      
##  [19] jsonlite_1.7.2        ica_1.0-2             cluster_2.1.0        
##  [22] png_0.1-7             uwot_0.1.10           shiny_1.5.0          
##  [25] sctransform_0.3.2     compiler_4.0.3        httr_1.4.2           
##  [28] assertthat_0.2.1      fastmap_1.0.1         lazyeval_0.2.2       
##  [31] later_1.1.0.1         formatR_1.7           htmltools_0.5.1      
##  [34] tools_4.0.3           rsvd_1.0.3            igraph_1.2.6         
##  [37] gtable_0.3.0          glue_1.4.2            RANN_2.6.1           
##  [40] reshape2_1.4.4        dplyr_1.0.3           maps_3.3.0           
##  [43] Rcpp_1.0.6            spatstat_1.64-1       scattermore_0.7      
##  [46] vctrs_0.3.6           nlme_3.1-151          lmtest_0.9-38        
##  [49] xfun_0.20             stringr_1.4.0         globals_0.14.0       
##  [52] mime_0.9              miniUI_0.1.1.1        lifecycle_0.2.0      
##  [55] irlba_2.3.3           goftest_1.2-2         future_1.21.0        
##  [58] MASS_7.3-53           zoo_1.8-8             scales_1.1.1         
##  [61] promises_1.1.1        spatstat.utils_1.20-2 parallel_4.0.3       
##  [64] RColorBrewer_1.1-2    yaml_2.2.1            curl_4.3             
##  [67] reticulate_1.18       pbapply_1.4-3         gridExtra_2.3        
##  [70] ggplot2_3.3.3         rpart_4.1-15          stringi_1.5.3        
##  [73] rlang_0.4.10          pkgconfig_2.0.3       matrixStats_0.57.0   
##  [76] evaluate_0.14         lattice_0.20-41       ROCR_1.0-11          
##  [79] purrr_0.3.4           tensor_1.5            labeling_0.4.2       
##  [82] patchwork_1.1.1       htmlwidgets_1.5.3     bit_4.0.4            
##  [85] cowplot_1.1.1         tidyselect_1.1.0      parallelly_1.23.0    
##  [88] RcppAnnoy_0.0.18      plyr_1.8.6            magrittr_2.0.1       
##  [91] R6_2.5.0              generics_0.1.0        DBI_1.1.1            
##  [94] withr_2.4.0           pillar_1.4.7          mgcv_1.8-33          
##  [97] fitdistrplus_1.1-3    survival_3.2-7        abind_1.4-5          
## [100] tibble_3.0.5          future.apply_1.7.0    crayon_1.3.4         
## [103] hdf5r_1.3.3           plotly_4.9.3          rmarkdown_2.6        
## [106] data.table_1.13.6     digest_0.6.27         xtable_1.8-4         
## [109] tidyr_1.1.2           httpuv_1.5.5          munsell_0.5.0        
## [112] viridisLite_0.3.0

參考來源:https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_01_qc.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市娄猫,隨后出現(xiàn)的幾起案子贱除,更是在濱河造成了極大的恐慌生闲,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件月幌,死亡現(xiàn)場離奇詭異碍讯,居然都是意外死亡,警方通過查閱死者的電腦和手機扯躺,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進店門捉兴,熙熙樓的掌柜王于貴愁眉苦臉地迎上來录语,“玉大人倍啥,你說我怎么就攤上這事∏瘴蓿” “怎么了?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵失暂,是天一觀的道長彼宠。 經(jīng)常有香客問我,道長弟塞,這世上最難降的妖魔是什么凭峡? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任决记,我火速辦了婚禮摧冀,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘索昂。我一直安慰自己,他們只是感情好扩借,可當我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布椒惨。 她就那樣靜靜地躺著,像睡著了一般潮罪。 火紅的嫁衣襯著肌膚如雪康谆。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天嫉到,我揣著相機與錄音沃暗,去河邊找鬼。 笑死何恶,一個胖子當著我的面吹牛孽锥,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播细层,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼忱叭,長吁一口氣:“原來是場噩夢啊……” “哼隔崎!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起韵丑,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤爵卒,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后撵彻,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體钓株,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年陌僵,在試婚紗的時候發(fā)現(xiàn)自己被綠了轴合。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡碗短,死狀恐怖受葛,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情偎谁,我是刑警寧澤总滩,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布,位于F島的核電站巡雨,受9級特大地震影響闰渔,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜铐望,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一冈涧、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧正蛙,春花似錦督弓、人聲如沸乒验。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽奸攻。三九已至蒜危,卻和暖如春虱痕,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背辐赞。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工部翘, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留响委,地道東北人新思。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓,卻偏偏與公主長得像夹囚,于是被迫代替她去往敵國和親纵刘。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 42,700評論 2 345

推薦閱讀更多精彩內(nèi)容