使用scater包進(jìn)行單細(xì)胞測序分析（二）：數(shù)據(jù)質(zhì)量控制

scater包提供了一系列的數(shù)據(jù)質(zhì)量控制方法剃斧，可以對(duì)單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)進(jìn)行嚴(yán)格的質(zhì)量控制，它主要從以下三個(gè)方面進(jìn)行質(zhì)量控制QC：

QC and filtering of cells（細(xì)胞水平的QC和過濾）

QC and filtering of features (genes)（基因水平的QC和過濾）

QC of experimental variables（實(shí)驗(yàn)變量的QC）

加載所需的R包和數(shù)據(jù)集

library(scater)
data("sc_example_counts")
data("sc_example_cell_info")

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

example_sce
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

計(jì)算QC metrics

scater使用calculateQCMetrics函數(shù)計(jì)算QC metrics籍嘹，它可以對(duì)細(xì)胞和基因進(jìn)行一系列的數(shù)據(jù)質(zhì)量控制，其結(jié)果分別存儲(chǔ)在colData和rowData中量蕊。默認(rèn)情況下谒臼，calculateQCMetrics函數(shù)使用原始的count值計(jì)算這些QC metrics，也可以通過exprs_values參數(shù)進(jìn)行修改酬屉。

# 使用calculateQCMetrics函數(shù)計(jì)算QC metrics
example_sce <- calculateQCMetrics(example_sce)

# 查看細(xì)胞水平的QC metrics
colnames(colData(example_sce))
 [1] "Cell"                           "Mutation_Status"               
 [3] "Cell_Cycle"                     "Treatment"                     
 [5] "is_cell_control"                "total_features_by_counts"      
 [7] "log10_total_features_by_counts" "total_counts"                  
 [9] "log10_total_counts"             "pct_counts_in_top_50_features" 
[11] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
[13] "pct_counts_in_top_500_features"

head(colData(example_sce))
DataFrame with 6 rows and 13 columns
                Cell Mutation_Status  Cell_Cycle   Treatment is_cell_control
         <character>     <character> <character> <character>       <logical>
Cell_001    Cell_001        positive           S      treat1           FALSE
Cell_002    Cell_002        positive          G0      treat1           FALSE
Cell_003    Cell_003        negative          G1      treat1           FALSE
Cell_004    Cell_004        negative           S      treat1           FALSE
Cell_005    Cell_005        negative          G1      treat2           FALSE
Cell_006    Cell_006        negative          G0      treat1           FALSE
         total_features_by_counts log10_total_features_by_counts
                        <integer>                      <numeric>
Cell_001                      881               2.94546858513182
Cell_002                      624               2.79588001734408
Cell_003                      730               2.86391737695786
Cell_004                      728               2.86272752831797
Cell_005                      667               2.82477646247555
Cell_006                      646                2.8109042806687

# 查看基因水平的QC metrics
colnames(rowData(example_sce))
[1] "is_feature_control"    "mean_counts"           "log10_mean_counts"    
[4] "n_cells_by_counts"     "pct_dropout_by_counts" "total_counts"         
[7] "log10_total_counts" 

head(rowData(example_sce))
DataFrame with 6 rows and 7 columns
          is_feature_control mean_counts log10_mean_counts n_cells_by_counts
                   <logical>   <numeric>         <numeric>         <integer>
Gene_0001              FALSE      252.25  2.40354945403232                17
Gene_0002              FALSE      366.05  2.56472522840747                27
Gene_0003              FALSE      191.65  2.28476901334902                13
Gene_0004              FALSE      178.35  2.25370138101199                21
Gene_0005              FALSE       0.975 0.295567099962479                13
Gene_0006              FALSE     185.225  2.27003798294626                16
          pct_dropout_by_counts total_counts log10_total_counts
                      <numeric>    <integer>          <numeric>
Gene_0001                  57.5        10090   4.00393420617371
Gene_0002                  32.5        14642   4.16563006237618
Gene_0003                  67.5         7666   3.88462546325623
Gene_0004                  47.5         7134   3.85339397745067
Gene_0005                  67.5           39   1.60205999132796
Gene_0006                    60         7409   3.86981820797933

當(dāng)然半等，我們也可以設(shè)置一些參照（如ERCC spike-in，線粒體基因呐萨，死亡的細(xì)胞等）杀饵，計(jì)算其相應(yīng)的QC metrics進(jìn)行質(zhì)量控制。

example_sce <- calculateQCMetrics(example_sce, 
    feature_controls = list(ERCC = 1:20, mito = 500:1000),
    cell_controls = list(empty = 1:5, damaged = 31:40))

all_col_qc <- colnames(colData(example_sce))
all_col_qc <- all_col_qc[grep("ERCC", all_col_qc)]

all_col_qc
[1] "total_features_by_counts_ERCC"      
[2] "log10_total_features_by_counts_ERCC"
[3] "total_counts_ERCC"                  
[4] "log10_total_counts_ERCC"            
[5] "pct_counts_ERCC"                    
[6] "pct_counts_in_top_50_features_ERCC" 
[7] "pct_counts_in_top_100_features_ERCC"
[8] "pct_counts_in_top_200_features_ERCC"
[9] "pct_counts_in_top_500_features_ERCC"

細(xì)胞水平的QC metrics

total_counts: total number of counts for the cell (i.e., the library size).

total_features_by_counts: the number of features for the cell that have counts above the detection limit (default of zero).

pct_counts_X: percentage of all counts that come from the feature control set named X.

基因水平的QC metrics

mean_counts: the mean count of the gene/feature.

pct_dropout_by_counts: the percentage of cells with counts of zero for each gene.

pct_counts_Y: percentage of all counts that come from the cell control set named Y.

QC結(jié)果的可視化

Examining the most expressed features

使用plotHighestExprs函數(shù)可視化那些高表達(dá)基因（默認(rèn)查看50個(gè)基因）的表達(dá)情況谬擦。下圖中行表示每個(gè)基因切距，橙色的線(bar)代表該基因在每一個(gè)細(xì)胞中的表達(dá)量，圓圈代表這個(gè)基因在所有細(xì)胞中表達(dá)量的中位數(shù)惨远。默認(rèn)情況下谜悟，使用基因的count值計(jì)算表達(dá)情況话肖，也可以使用exprs_values參數(shù)進(jìn)行修改。

plotHighestExprs(example_sce, exprs_values = "counts")

image

Frequency of expression as a function of the mean

使用plotExprsFreqVsMean函數(shù)進(jìn)行可視化

plotExprsFreqVsMean(example_sce)

image

上圖趨勢中的異常值可能需要進(jìn)一步的調(diào)查葡幸。例如最筒，高表達(dá)基因的pseudo-genes的比對(duì)錯(cuò)誤將導(dǎo)致均值低的基因在所有的細(xì)胞中表達(dá)。相反蔚叨，PCR的擴(kuò)增偏差（或稀有種群的存在）可能會(huì)導(dǎo)致在極少數(shù)細(xì)胞中表達(dá)具有很高均值的基因床蜘。

Percentage of counts assigned to feature controls

對(duì)于細(xì)胞水平上的質(zhì)控，我們可以查看參照基因（feature controls）的表達(dá)量比上總基因表達(dá)量的百分比蔑水，如果一個(gè)基因在總基因表達(dá)量上的比例多邢锯，而在參照基因（如ERCC）里少，就是正常的細(xì)胞肤粱，反之則不正常弹囚。

plotColData(example_sce, x = "total_features_by_counts",
    y = "pct_counts_feature_control", colour = "Mutation_Status") +
    theme(legend.position = "top") +
    stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)

image

Cumulative expression plot

plotScater函數(shù)會(huì)從表達(dá)量最高的基因（默認(rèn)為500個(gè)）中選一部分厨相，然后從高到低累加领曼，看看它們對(duì)每個(gè)細(xì)胞文庫的貢獻(xiàn)值大小。這種類型的圖類似于對(duì)芯片數(shù)據(jù)或bulk RNA-seq數(shù)據(jù)中按樣本繪制箱線圖可視化不同樣本的表達(dá)分布差異蛮穿。累積表達(dá)圖更適用于單細(xì)胞數(shù)據(jù)庶骄，因?yàn)閱渭?xì)胞數(shù)據(jù)難以一次性查看所有細(xì)胞的表達(dá)分布的箱形圖。
為了查看不同細(xì)胞的表達(dá)分布差異践磅，我們可以利用colData中的變量將細(xì)胞進(jìn)行分類单刁。默認(rèn)使用counts值進(jìn)行繪圖，我們也可以通過exprs_values參數(shù)指定其他的數(shù)據(jù)府适。

plotScater(example_sce, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")

image

Plate position plot

For plate-based experiments, it is useful to see how expression or factors vary with the position of cell on the plate. This can be visualized using the plotPlatePosition function:

example_sce2 <- example_sce
example_sce2$plate_position <- paste0(
     rep(LETTERS[1:5], each = 8), 
     rep(formatC(1:8, width = 2, flag = "0"), 5)
)
plotPlatePosition(example_sce2, colour_by = "Gene_0001",
    by_exprs_values = "counts")

image

Other quality control plots

可以使用plotFeatureData函數(shù)輕松地查看任意兩個(gè)元數(shù)據(jù)變量之間的關(guān)系：

plotRowData(example_sce, x = "n_cells_by_counts", y = "mean_counts")

image

The multiplot function also allows multiple plots to be generated on the same page, as demonstrated below.

p1 <- plotColData(example_sce, x = "total_counts", 
    y = "total_features_by_counts")
p2 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "total_features_by_counts")
p3 <- plotColData(example_sce, x = "pct_counts_feature_control",
    y = "pct_counts_in_top_50_features")
multiplot(p1, p2, p3, cols = 3)

image

This is especially useful for side-by-side comparisons between control sets, as demonstrated below for the plot of highest-expressing features. A plot for non-control cells is shown on the left while the plot for the controls is shown on the right.

p1 <- plotHighestExprs(example_sce[, !example_sce$is_cell_control])
p2 <- plotHighestExprs(example_sce[, example_sce$is_cell_control])

multiplot(p1, p2, cols = 2)

image

QC結(jié)果的過濾

細(xì)胞水平的過濾

直接通過列數(shù)選取想要的細(xì)胞

# 選取前40個(gè)細(xì)胞
example_sce <- example_sce[,1:40]

使用filter函數(shù)根據(jù)指定條件選取想要的細(xì)胞

filter(example_sce, Treatment == "treat1")
## class: SingleCellExperiment 
## dim: 2000 27 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(37): is_feature_control is_feature_control_ERCC ...
##   log10_total_counts_damaged pct_counts_damaged
## colnames(27): Cell_001 Cell_002 ... Cell_037 Cell_039
## colData names(51): Cell Mutation_Status ...
##   pct_counts_in_top_200_features_mito
##   pct_counts_in_top_500_features_mito
## reducedDimNames(0):
## spikeNames(0):

根據(jù)QC metrics設(shè)定閾值篩選高質(zhì)量的細(xì)胞羔飞，這里我們選取那些總counts數(shù)大于100,000，表達(dá)的基因數(shù)大于500的細(xì)胞檐春。

# 選取總counts數(shù)大于100,000的
keep.total <- example_sce$total_counts > 1e5
# 選取表達(dá)的基因數(shù)大于500的
keep.n <- example_sce$total_features_by_counts > 500
# 根據(jù)設(shè)定的條件進(jìn)行過濾
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
## [1] 2000   37

我們還可以通過isOutlier函數(shù)計(jì)算篩選的閾值逻淌，它將閾值定義為距離中位數(shù)一定數(shù)量的“中位數(shù)絕對(duì)偏差（MAD）”。超出此閾值的值被認(rèn)為是異常值疟暖，可以假定它們是一些低質(zhì)量的細(xì)胞卡儒，而將其過濾掉。這里我們選取那些log(total counts)值小于3倍MAD值的細(xì)胞作為outliers俐巴。

keep.total <- isOutlier(example_sce$total_counts, nmads=3, 
    type="lower", log=TRUE)
filtered <- example_sce[,keep.total]

基因水平的過濾

直接通過基因的表達(dá)量過濾掉那些低表達(dá)的基因骨望，這里我們選取那些至少在4個(gè)細(xì)胞中表達(dá)的基因。

keep_feature <- nexprs(example_sce, byrow=TRUE) >= 4
example_sce <- example_sce[keep_feature,]
dim(example_sce)
## [1] 1753   40

當(dāng)然欣舵，我們也可以通過一些其他的條件（如核糖體蛋白基因擎鸠，線粒體基因等）進(jìn)行基因的過濾。

Relationships between experimental factors and expression

我們可以使用plotExplanatoryVariables函數(shù)查看不同解釋因素的相對(duì)重要性缘圈。當(dāng)對(duì)每個(gè)基因的不同因子進(jìn)行表達(dá)量的線性回歸模型擬合時(shí)糠亩，我們會(huì)對(duì)colData（example_sce）中的每個(gè)因子計(jì)算其對(duì)應(yīng)的R2值虐骑。最好在表達(dá)量的對(duì)數(shù)值上執(zhí)行此操作，以減少平均值對(duì)方差的影響赎线。因此廷没，我們首先對(duì)基因的表達(dá)量進(jìn)行歸一化處理。

# 先對(duì)基因的表達(dá)進(jìn)行歸一化處理
example_sce <- normalize(example_sce)
plotExplanatoryVariables(example_sce)

image

上圖中每條線對(duì)應(yīng)一個(gè)因子垂寥，代表所有基因中R2值的分布颠黎。當(dāng)然，我們也可以通過variables參數(shù)選擇特定的因子進(jìn)行計(jì)算可視化滞项。

plotExplanatoryVariables(example_sce,
    variables = c("total_features_by_counts", "total_counts",
        "Mutation_Status", "Treatment", "Cell_Cycle"))

image

在這個(gè)小數(shù)據(jù)集中狭归，total_counts和total_features_by_counts解釋了基因表達(dá)中很大一部分的方差，它們?cè)谡鎸?shí)數(shù)據(jù)集中能解釋的方差比例應(yīng)該小得多（例如1-5％）文判。

Removing technical biases 去除技術(shù)偏差

Scaling normalization 數(shù)據(jù)歸一化處理

縮放歸一化（Scaling normalization）可以消除細(xì)胞特異性偏差过椎，其使特定細(xì)胞中所有基因的表達(dá)增加或減少，例如測序的覆蓋率或捕獲效率戏仓。
進(jìn)行縮放歸一化的最簡便方法是根據(jù)所有細(xì)胞的縮放文庫大小定義size factors疚宇，使得平均size factor等于1，確保歸一化后的值與原始count值的范圍相同赏殃。

# 使用librarySizeFactors函數(shù)計(jì)算細(xì)胞文庫size factors
sizeFactors(example_sce) <- librarySizeFactors(example_sce)
summary(sizeFactors(example_sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1463  0.6609  0.8112  1.0000  1.2533  2.7356

然后再使用normalize函數(shù)計(jì)算log轉(zhuǎn)換后的歸一化值敷待，并將其存儲(chǔ)在“l(fā)ogcounts” Assay中

example_sce <- normalize(example_sce)

雖然這種歸一化的方式很簡單，但細(xì)胞文庫大小歸一化并不能解決高通量測序數(shù)據(jù)中經(jīng)常出現(xiàn)的成分偏差仁热，它也不能解釋影響spike-in轉(zhuǎn)錄本產(chǎn)生的差異榜揖。我們強(qiáng)烈建議使用來自scran包的computeSumFactors和computeSpikeFactors函數(shù)來進(jìn)行計(jì)算。

Batch correction 校正批次效應(yīng)

批次效應(yīng)的校正可以解決不同批次中細(xì)胞之間表達(dá)的系統(tǒng)差異抗蠢，與比例偏差不同举哟，這些偏差通常在給定批次的所有細(xì)胞中都是恒定的，但對(duì)于每個(gè)基因而言都是不同的迅矛。
我們可以使用limma軟件包中的removeBatchEffect函數(shù)來消除批次效應(yīng)妨猩。

library(limma)
batch <- rep(1:2, each=20)
# 使用removeBatchEffect函數(shù)去除批次效應(yīng)
corrected <- removeBatchEffect(logcounts(example_sce), block=batch)
assay(example_sce, "corrected_logcounts") <- corrected

參考來源：http://www.bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html

image

最后編輯于：2020.09.17 12:32:38

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市诬乞，隨后出現(xiàn)的幾起案子册赛，更是在濱河造成了極大的恐慌，老刑警劉巖震嫉，帶你破解...
沈念sama閱讀 216,372評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件森瘪，死亡現(xiàn)場離奇詭異，居然都是意外死亡票堵，警方通過查閱死者的電腦和手機(jī)扼睬，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人窗宇，你說我怎么就攤上這事措伐。” “怎么了军俊？”我有些...
開封第一講書人閱讀 162,415評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵侥加，是天一觀的道長。經(jīng)常有香客問我粪躬，道長担败，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,157評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任镰官，我火速辦了婚禮提前，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘泳唠。我一直安慰自己狈网，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,171評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布笨腥。她就那樣靜靜地躺著拓哺，像睡著了一般。火紅的嫁衣襯著肌膚如雪扇雕。梳的紋絲不亂的頭發(fā)上拓售，一...
開封第一講書人閱讀 51,125評(píng)論 1贊 297
城市分裂傳說
那天窥摄，我揣著相機(jī)與錄音镶奉，去河邊找鬼。笑死崭放，一個(gè)胖子當(dāng)著我的面吹牛哨苛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播币砂，決...
沈念sama閱讀 40,028評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼建峭，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了决摧？” 一聲冷哼從身側(cè)響起亿蒸，我...
開封第一講書人閱讀 38,887評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎掌桩，沒想到半個(gè)月后边锁，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,310評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡波岛，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,533評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年茅坛，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片则拷。...
茶點(diǎn)故事閱讀 39,690評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡贡蓖，死狀恐怖曹鸠，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情斥铺，我是刑警寧澤彻桃，帶...
沈念sama閱讀 35,411評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站晾蜘，受9級(jí)特大地震影響叛薯，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜笙纤，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,004評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一耗溜、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧省容，春花似錦抖拴、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評(píng)論 0贊 22
一樁弒父案阿宅，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至笼蛛，卻和暖如春洒放，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背滨砍。一陣腳步聲響...
開封第一講書人閱讀 32,812評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來泰國打工往湿，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人惋戏。一個(gè)月前我還...
沈念sama閱讀 47,693評(píng)論 2贊 368
代替公主和親
正文我出身青樓领追，卻偏偏與公主長得像，于是被迫代替她去往敵國和親响逢。傳聞我的和親對(duì)象是個(gè)殘疾皇子绒窑，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,577評(píng)論 2贊 353

使用scater包進(jìn)行單細(xì)胞測序分析（二）：數(shù)據(jù)質(zhì)量控制

加載所需的R包和數(shù)據(jù)集

計(jì)算QC metrics

細(xì)胞水平的QC metrics

基因水平的QC metrics

QC結(jié)果的可視化

Examining the most expressed features

Frequency of expression as a function of the mean

Percentage of counts assigned to feature controls

Cumulative expression plot

Plate position plot

Other quality control plots

QC結(jié)果的過濾

細(xì)胞水平的過濾

基因水平的過濾

Relationships between experimental factors and expression

Removing technical biases 去除技術(shù)偏差

Scaling normalization 數(shù)據(jù)歸一化處理

Batch correction 校正批次效應(yīng)

推薦閱讀更多精彩內(nèi)容