使用scater包進(jìn)行單細(xì)胞測序分析(二):數(shù)據(jù)質(zhì)量控制

scater包提供了一系列的數(shù)據(jù)質(zhì)量控制方法剃斧,可以對(duì)單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)進(jìn)行嚴(yán)格的質(zhì)量控制,它主要從以下三個(gè)方面進(jìn)行質(zhì)量控制QC:

  • QC and filtering of cells(細(xì)胞水平的QC和過濾)
  • QC and filtering of features (genes)(基因水平的QC和過濾)
  • QC of experimental variables(實(shí)驗(yàn)變量的QC)

加載所需的R包和數(shù)據(jù)集

library(scater)
data("sc_example_counts")
data("sc_example_cell_info")

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

example_sce
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

計(jì)算QC metrics

scater使用calculateQCMetrics函數(shù)計(jì)算QC metrics籍嘹,它可以對(duì)細(xì)胞和基因進(jìn)行一系列的數(shù)據(jù)質(zhì)量控制,其結(jié)果分別存儲(chǔ)在colData和rowData中量蕊。默認(rèn)情況下谒臼,calculateQCMetrics函數(shù)使用原始的count值計(jì)算這些QC metrics,也可以通過exprs_values參數(shù)進(jìn)行修改酬屉。

# 使用calculateQCMetrics函數(shù)計(jì)算QC metrics
example_sce <- calculateQCMetrics(example_sce)

# 查看細(xì)胞水平的QC metrics
colnames(colData(example_sce))
 [1] "Cell"                           "Mutation_Status"               
 [3] "Cell_Cycle"                     "Treatment"                     
 [5] "is_cell_control"                "total_features_by_counts"      
 [7] "log10_total_features_by_counts" "total_counts"                  
 [9] "log10_total_counts"             "pct_counts_in_top_50_features" 
[11] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
[13] "pct_counts_in_top_500_features"

head(colData(example_sce))
DataFrame with 6 rows and 13 columns
                Cell Mutation_Status  Cell_Cycle   Treatment is_cell_control
         <character>     <character> <character> <character>       <logical>
Cell_001    Cell_001        positive           S      treat1           FALSE
Cell_002    Cell_002        positive          G0      treat1           FALSE
Cell_003    Cell_003        negative          G1      treat1           FALSE
Cell_004    Cell_004        negative           S      treat1           FALSE
Cell_005    Cell_005        negative          G1      treat2           FALSE
Cell_006    Cell_006        negative          G0      treat1           FALSE
         total_features_by_counts log10_total_features_by_counts
                        <integer>                      <numeric>
Cell_001                      881               2.94546858513182
Cell_002                      624               2.79588001734408
Cell_003                      730               2.86391737695786
Cell_004                      728               2.86272752831797
Cell_005                      667               2.82477646247555
Cell_006                      646                2.8109042806687

# 查看基因水平的QC metrics
colnames(rowData(example_sce))
[1] "is_feature_control"    "mean_counts"           "log10_mean_counts"    
[4] "n_cells_by_counts"     "pct_dropout_by_counts" "total_counts"         
[7] "log10_total_counts" 

head(rowData(example_sce))
DataFrame with 6 rows and 7 columns
          is_feature_control mean_counts log10_mean_counts n_cells_by_counts
                   <logical>   <numeric>         <numeric>         <integer>
Gene_0001              FALSE      252.25  2.40354945403232                17
Gene_0002              FALSE      366.05  2.56472522840747                27
Gene_0003              FALSE      191.65  2.28476901334902                13
Gene_0004              FALSE      178.35  2.25370138101199                21
Gene_0005              FALSE       0.975 0.295567099962479                13
Gene_0006              FALSE     185.225  2.27003798294626                16
          pct_dropout_by_counts total_counts log10_total_counts
                      <numeric>    <integer>          <numeric>
Gene_0001                  57.5        10090   4.00393420617371
Gene_0002                  32.5        14642   4.16563006237618
Gene_0003                  67.5         7666   3.88462546325623
Gene_0004                  47.5         7134   3.85339397745067
Gene_0005                  67.5           39   1.60205999132796
Gene_0006                    60         7409   3.86981820797933

當(dāng)然半等,我們也可以設(shè)置一些參照(如ERCC spike-in,線粒體基因呐萨,死亡的細(xì)胞等)杀饵,計(jì)算其相應(yīng)的QC metrics進(jìn)行質(zhì)量控制。

example_sce <- calculateQCMetrics(example_sce, 
    feature_controls = list(ERCC = 1:20, mito = 500:1000),
    cell_controls = list(empty = 1:5, damaged = 31:40))

all_col_qc <- colnames(colData(example_sce))
all_col_qc <- all_col_qc[grep("ERCC", all_col_qc)]

all_col_qc
[1] "total_features_by_counts_ERCC"      
[2] "log10_total_features_by_counts_ERCC"
[3] "total_counts_ERCC"                  
[4] "log10_total_counts_ERCC"            
[5] "pct_counts_ERCC"                    
[6] "pct_counts_in_top_50_features_ERCC" 
[7] "pct_counts_in_top_100_features_ERCC"
[8] "pct_counts_in_top_200_features_ERCC"
[9] "pct_counts_in_top_500_features_ERCC"

細(xì)胞水平的QC metrics

  • total_counts: total number of counts for the cell (i.e., the library size).
  • total_features_by_counts: the number of features for the cell that have counts above the detection limit (default of zero).
  • pct_counts_X: percentage of all counts that come from the feature control set named X.

基因水平的QC metrics

  • mean_counts: the mean count of the gene/feature.
  • pct_dropout_by_counts: the percentage of cells with counts of zero for each gene.
  • pct_counts_Y: percentage of all counts that come from the cell control set named Y.

QC結(jié)果的可視化

Examining the most expressed features

使用plotHighestExprs函數(shù)可視化那些高表達(dá)基因(默認(rèn)查看50個(gè)基因)的表達(dá)情況谬擦。下圖中行表示每個(gè)基因切距,橙色的線(bar)代表該基因在每一個(gè)細(xì)胞中的表達(dá)量,圓圈代表這個(gè)基因在所有細(xì)胞中表達(dá)量的中位數(shù)惨远。默認(rèn)情況下谜悟,使用基因的count值計(jì)算表達(dá)情況话肖,也可以使用exprs_values參數(shù)進(jìn)行修改。

plotHighestExprs(example_sce, exprs_values = "counts")
image

Frequency of expression as a function of the mean

使用plotExprsFreqVsMean函數(shù)進(jìn)行可視化

plotExprsFreqVsMean(example_sce)
image

上圖趨勢中的異常值可能需要進(jìn)一步的調(diào)查葡幸。例如最筒,高表達(dá)基因的pseudo-genes的比對(duì)錯(cuò)誤將導(dǎo)致均值低的基因在所有的細(xì)胞中表達(dá)。相反蔚叨,PCR的擴(kuò)增偏差(或稀有種群的存在)可能會(huì)導(dǎo)致在極少數(shù)細(xì)胞中表達(dá)具有很高均值的基因床蜘。

Percentage of counts assigned to feature controls

對(duì)于細(xì)胞水平上的質(zhì)控,我們可以查看參照基因(feature controls)的表達(dá)量比上總基因表達(dá)量的百分比蔑水,如果一個(gè)基因在總基因表達(dá)量上的比例多邢锯,而在參照基因(如ERCC)里少,就是正常的細(xì)胞肤粱,反之則不正常弹囚。

plotColData(example_sce, x = "total_features_by_counts",
    y = "pct_counts_feature_control", colour = "Mutation_Status") +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
image

Cumulative expression plot

plotScater函數(shù)會(huì)從表達(dá)量最高的基因(默認(rèn)為500個(gè))中選一部分厨相,然后從高到低累加领曼,看看它們對(duì)每個(gè)細(xì)胞文庫的貢獻(xiàn)值大小。這種類型的圖類似于對(duì)芯片數(shù)據(jù)或bulk RNA-seq數(shù)據(jù)中按樣本繪制箱線圖可視化不同樣本的表達(dá)分布差異蛮穿。累積表達(dá)圖更適用于單細(xì)胞數(shù)據(jù)庶骄,因?yàn)閱渭?xì)胞數(shù)據(jù)難以一次性查看所有細(xì)胞的表達(dá)分布的箱形圖。
為了查看不同細(xì)胞的表達(dá)分布差異践磅,我們可以利用colData中的變量將細(xì)胞進(jìn)行分類单刁。默認(rèn)使用counts值進(jìn)行繪圖,我們也可以通過exprs_values參數(shù)指定其他的數(shù)據(jù)府适。

plotScater(example_sce, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
image

Plate position plot

For plate-based experiments, it is useful to see how expression or factors vary with the position of cell on the plate. This can be visualized using the plotPlatePosition function:

example_sce2 <- example_sce
example_sce2$plate_position <- paste0(
     rep(LETTERS[1:5], each = 8), 
     rep(formatC(1:8, width = 2, flag = "0"), 5)
)
plotPlatePosition(example_sce2, colour_by = "Gene_0001",
    by_exprs_values = "counts") 
image

Other quality control plots

可以使用plotFeatureData函數(shù)輕松地查看任意兩個(gè)元數(shù)據(jù)變量之間的關(guān)系:

plotRowData(example_sce, x = "n_cells_by_counts", y = "mean_counts")
image

The multiplot function also allows multiple plots to be generated on the same page, as demonstrated below.

p1 <- plotColData(example_sce, x = "total_counts", 
    y = "total_features_by_counts")
p2 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "total_features_by_counts")
p3 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "pct_counts_in_top_50_features")
multiplot(p1, p2, p3, cols = 3)
image

This is especially useful for side-by-side comparisons between control sets, as demonstrated below for the plot of highest-expressing features. A plot for non-control cells is shown on the left while the plot for the controls is shown on the right.

p1 <- plotHighestExprs(example_sce[, !example_sce$is_cell_control])
p2 <- plotHighestExprs(example_sce[, example_sce$is_cell_control])

multiplot(p1, p2, cols = 2)
image

QC結(jié)果的過濾

細(xì)胞水平的過濾

直接通過列數(shù)選取想要的細(xì)胞

# 選取前40個(gè)細(xì)胞
example_sce <- example_sce[,1:40]

使用filter函數(shù)根據(jù)指定條件選取想要的細(xì)胞

filter(example_sce, Treatment == "treat1")
## class: SingleCellExperiment 
## dim: 2000 27 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(37): is_feature_control is_feature_control_ERCC ...
##   log10_total_counts_damaged pct_counts_damaged
## colnames(27): Cell_001 Cell_002 ... Cell_037 Cell_039
## colData names(51): Cell Mutation_Status ...
##   pct_counts_in_top_200_features_mito
##   pct_counts_in_top_500_features_mito
## reducedDimNames(0):
## spikeNames(0):

根據(jù)QC metrics設(shè)定閾值篩選高質(zhì)量的細(xì)胞羔飞,這里我們選取那些總counts數(shù)大于100,000,表達(dá)的基因數(shù)大于500的細(xì)胞檐春。

# 選取總counts數(shù)大于100,000的
keep.total <- example_sce$total_counts > 1e5
# 選取表達(dá)的基因數(shù)大于500的
keep.n <- example_sce$total_features_by_counts > 500
# 根據(jù)設(shè)定的條件進(jìn)行過濾
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
## [1] 2000   37

我們還可以通過isOutlier函數(shù)計(jì)算篩選的閾值逻淌,它將閾值定義為距離中位數(shù)一定數(shù)量的“中位數(shù)絕對(duì)偏差(MAD)”。超出此閾值的值被認(rèn)為是異常值疟暖,可以假定它們是一些低質(zhì)量的細(xì)胞卡儒,而將其過濾掉。這里我們選取那些log(total counts)值小于3倍MAD值的細(xì)胞作為outliers俐巴。

keep.total <- isOutlier(example_sce$total_counts, nmads=3, 
    type="lower", log=TRUE)
filtered <- example_sce[,keep.total]

基因水平的過濾

直接通過基因的表達(dá)量過濾掉那些低表達(dá)的基因骨望,這里我們選取那些至少在4個(gè)細(xì)胞中表達(dá)的基因。

keep_feature <- nexprs(example_sce, byrow=TRUE) >= 4
example_sce <- example_sce[keep_feature,]
dim(example_sce)
## [1] 1753   40

當(dāng)然欣舵,我們也可以通過一些其他的條件(如核糖體蛋白基因擎鸠,線粒體基因等)進(jìn)行基因的過濾。

Relationships between experimental factors and expression

我們可以使用plotExplanatoryVariables函數(shù)查看不同解釋因素的相對(duì)重要性缘圈。當(dāng)對(duì)每個(gè)基因的不同因子進(jìn)行表達(dá)量的線性回歸模型擬合時(shí)糠亩,我們會(huì)對(duì)colData(example_sce)中的每個(gè)因子計(jì)算其對(duì)應(yīng)的R2值虐骑。最好在表達(dá)量的對(duì)數(shù)值上執(zhí)行此操作,以減少平均值對(duì)方差的影響赎线。因此廷没,我們首先對(duì)基因的表達(dá)量進(jìn)行歸一化處理。

# 先對(duì)基因的表達(dá)進(jìn)行歸一化處理
example_sce <- normalize(example_sce)
plotExplanatoryVariables(example_sce)
image

上圖中每條線對(duì)應(yīng)一個(gè)因子垂寥,代表所有基因中R2值的分布颠黎。當(dāng)然,我們也可以通過variables參數(shù)選擇特定的因子進(jìn)行計(jì)算可視化滞项。

plotExplanatoryVariables(example_sce,
    variables = c("total_features_by_counts", "total_counts",
        "Mutation_Status", "Treatment", "Cell_Cycle"))
image

在這個(gè)小數(shù)據(jù)集中狭归,total_counts和total_features_by_counts解釋了基因表達(dá)中很大一部分的方差,它們?cè)谡鎸?shí)數(shù)據(jù)集中能解釋的方差比例應(yīng)該小得多(例如1-5%)文判。

Removing technical biases 去除技術(shù)偏差

Scaling normalization 數(shù)據(jù)歸一化處理

縮放歸一化(Scaling normalization)可以消除細(xì)胞特異性偏差过椎,其使特定細(xì)胞中所有基因的表達(dá)增加或減少,例如測序的覆蓋率或捕獲效率戏仓。
進(jìn)行縮放歸一化的最簡便方法是根據(jù)所有細(xì)胞的縮放文庫大小定義size factors疚宇,使得平均size factor等于1,確保歸一化后的值與原始count值的范圍相同赏殃。

# 使用librarySizeFactors函數(shù)計(jì)算細(xì)胞文庫size factors
sizeFactors(example_sce) <- librarySizeFactors(example_sce)
summary(sizeFactors(example_sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1463  0.6609  0.8112  1.0000  1.2533  2.7356

然后再使用normalize函數(shù)計(jì)算log轉(zhuǎn)換后的歸一化值敷待,并將其存儲(chǔ)在“l(fā)ogcounts” Assay中

example_sce <- normalize(example_sce)

雖然這種歸一化的方式很簡單,但細(xì)胞文庫大小歸一化并不能解決高通量測序數(shù)據(jù)中經(jīng)常出現(xiàn)的成分偏差仁热,它也不能解釋影響spike-in轉(zhuǎn)錄本產(chǎn)生的差異榜揖。我們強(qiáng)烈建議使用來自scran包的computeSumFactorscomputeSpikeFactors函數(shù)來進(jìn)行計(jì)算。

Batch correction 校正批次效應(yīng)

批次效應(yīng)的校正可以解決不同批次中細(xì)胞之間表達(dá)的系統(tǒng)差異抗蠢,與比例偏差不同举哟,這些偏差通常在給定批次的所有細(xì)胞中都是恒定的,但對(duì)于每個(gè)基因而言都是不同的迅矛。
我們可以使用limma軟件包中的removeBatchEffect函數(shù)來消除批次效應(yīng)妨猩。

library(limma)
batch <- rep(1:2, each=20)
# 使用removeBatchEffect函數(shù)去除批次效應(yīng)
corrected <- removeBatchEffect(logcounts(example_sce), block=batch)
assay(example_sce, "corrected_logcounts") <- corrected

參考來源:http://www.bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html

image
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市诬乞,隨后出現(xiàn)的幾起案子册赛,更是在濱河造成了極大的恐慌,老刑警劉巖震嫉,帶你破解...
    沈念sama閱讀 216,372評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件森瘪,死亡現(xiàn)場離奇詭異,居然都是意外死亡票堵,警方通過查閱死者的電腦和手機(jī)扼睬,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人窗宇,你說我怎么就攤上這事措伐。” “怎么了军俊?”我有些...
    開封第一講書人閱讀 162,415評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵侥加,是天一觀的道長。 經(jīng)常有香客問我粪躬,道長担败,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,157評(píng)論 1 292
  • 正文 為了忘掉前任镰官,我火速辦了婚禮提前,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘泳唠。我一直安慰自己狈网,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,171評(píng)論 6 388
  • 文/花漫 我一把揭開白布笨腥。 她就那樣靜靜地躺著拓哺,像睡著了一般。 火紅的嫁衣襯著肌膚如雪扇雕。 梳的紋絲不亂的頭發(fā)上拓售,一...
    開封第一講書人閱讀 51,125評(píng)論 1 297
  • 那天窥摄,我揣著相機(jī)與錄音镶奉,去河邊找鬼。 笑死崭放,一個(gè)胖子當(dāng)著我的面吹牛哨苛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播币砂,決...
    沈念sama閱讀 40,028評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼建峭,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了决摧?” 一聲冷哼從身側(cè)響起亿蒸,我...
    開封第一講書人閱讀 38,887評(píng)論 0 274
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎掌桩,沒想到半個(gè)月后边锁,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,310評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡波岛,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,533評(píng)論 2 332
  • 正文 我和宋清朗相戀三年茅坛,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片则拷。...
    茶點(diǎn)故事閱讀 39,690評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡贡蓖,死狀恐怖曹鸠,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情斥铺,我是刑警寧澤彻桃,帶...
    沈念sama閱讀 35,411評(píng)論 5 343
  • 正文 年R本政府宣布,位于F島的核電站晾蜘,受9級(jí)特大地震影響叛薯,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜笙纤,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,004評(píng)論 3 325
  • 文/蒙蒙 一耗溜、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧省容,春花似錦抖拴、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至笼蛛,卻和暖如春洒放,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背滨砍。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評(píng)論 1 268
  • 我被黑心中介騙來泰國打工往湿, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人惋戏。 一個(gè)月前我還...
    沈念sama閱讀 47,693評(píng)論 2 368
  • 正文 我出身青樓领追,卻偏偏與公主長得像,于是被迫代替她去往敵國和親响逢。 傳聞我的和親對(duì)象是個(gè)殘疾皇子绒窑,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,577評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容