scater
包提供了一系列的數(shù)據(jù)質(zhì)量控制方法剃斧,可以對(duì)單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)進(jìn)行嚴(yán)格的質(zhì)量控制,它主要從以下三個(gè)方面進(jìn)行質(zhì)量控制QC:
- QC and filtering of cells(細(xì)胞水平的QC和過濾)
- QC and filtering of features (genes)(基因水平的QC和過濾)
- QC of experimental variables(實(shí)驗(yàn)變量的QC)
加載所需的R包和數(shù)據(jù)集
library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):
計(jì)算QC metrics
scater使用calculateQCMetrics
函數(shù)計(jì)算QC metrics籍嘹,它可以對(duì)細(xì)胞和基因進(jìn)行一系列的數(shù)據(jù)質(zhì)量控制,其結(jié)果分別存儲(chǔ)在colData和rowData中量蕊。默認(rèn)情況下谒臼,calculateQCMetrics函數(shù)使用原始的count值計(jì)算這些QC metrics,也可以通過exprs_values參數(shù)進(jìn)行修改酬屉。
# 使用calculateQCMetrics函數(shù)計(jì)算QC metrics
example_sce <- calculateQCMetrics(example_sce)
# 查看細(xì)胞水平的QC metrics
colnames(colData(example_sce))
[1] "Cell" "Mutation_Status"
[3] "Cell_Cycle" "Treatment"
[5] "is_cell_control" "total_features_by_counts"
[7] "log10_total_features_by_counts" "total_counts"
[9] "log10_total_counts" "pct_counts_in_top_50_features"
[11] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
[13] "pct_counts_in_top_500_features"
head(colData(example_sce))
DataFrame with 6 rows and 13 columns
Cell Mutation_Status Cell_Cycle Treatment is_cell_control
<character> <character> <character> <character> <logical>
Cell_001 Cell_001 positive S treat1 FALSE
Cell_002 Cell_002 positive G0 treat1 FALSE
Cell_003 Cell_003 negative G1 treat1 FALSE
Cell_004 Cell_004 negative S treat1 FALSE
Cell_005 Cell_005 negative G1 treat2 FALSE
Cell_006 Cell_006 negative G0 treat1 FALSE
total_features_by_counts log10_total_features_by_counts
<integer> <numeric>
Cell_001 881 2.94546858513182
Cell_002 624 2.79588001734408
Cell_003 730 2.86391737695786
Cell_004 728 2.86272752831797
Cell_005 667 2.82477646247555
Cell_006 646 2.8109042806687
# 查看基因水平的QC metrics
colnames(rowData(example_sce))
[1] "is_feature_control" "mean_counts" "log10_mean_counts"
[4] "n_cells_by_counts" "pct_dropout_by_counts" "total_counts"
[7] "log10_total_counts"
head(rowData(example_sce))
DataFrame with 6 rows and 7 columns
is_feature_control mean_counts log10_mean_counts n_cells_by_counts
<logical> <numeric> <numeric> <integer>
Gene_0001 FALSE 252.25 2.40354945403232 17
Gene_0002 FALSE 366.05 2.56472522840747 27
Gene_0003 FALSE 191.65 2.28476901334902 13
Gene_0004 FALSE 178.35 2.25370138101199 21
Gene_0005 FALSE 0.975 0.295567099962479 13
Gene_0006 FALSE 185.225 2.27003798294626 16
pct_dropout_by_counts total_counts log10_total_counts
<numeric> <integer> <numeric>
Gene_0001 57.5 10090 4.00393420617371
Gene_0002 32.5 14642 4.16563006237618
Gene_0003 67.5 7666 3.88462546325623
Gene_0004 47.5 7134 3.85339397745067
Gene_0005 67.5 39 1.60205999132796
Gene_0006 60 7409 3.86981820797933
當(dāng)然半等,我們也可以設(shè)置一些參照(如ERCC spike-in,線粒體基因呐萨,死亡的細(xì)胞等)杀饵,計(jì)算其相應(yīng)的QC metrics進(jìn)行質(zhì)量控制。
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(ERCC = 1:20, mito = 500:1000),
cell_controls = list(empty = 1:5, damaged = 31:40))
all_col_qc <- colnames(colData(example_sce))
all_col_qc <- all_col_qc[grep("ERCC", all_col_qc)]
all_col_qc
[1] "total_features_by_counts_ERCC"
[2] "log10_total_features_by_counts_ERCC"
[3] "total_counts_ERCC"
[4] "log10_total_counts_ERCC"
[5] "pct_counts_ERCC"
[6] "pct_counts_in_top_50_features_ERCC"
[7] "pct_counts_in_top_100_features_ERCC"
[8] "pct_counts_in_top_200_features_ERCC"
[9] "pct_counts_in_top_500_features_ERCC"
細(xì)胞水平的QC metrics
- total_counts: total number of counts for the cell (i.e., the library size).
- total_features_by_counts: the number of features for the cell that have counts above the detection limit (default of zero).
- pct_counts_X: percentage of all counts that come from the feature control set named X.
基因水平的QC metrics
- mean_counts: the mean count of the gene/feature.
- pct_dropout_by_counts: the percentage of cells with counts of zero for each gene.
- pct_counts_Y: percentage of all counts that come from the cell control set named Y.
QC結(jié)果的可視化
Examining the most expressed features
使用plotHighestExprs
函數(shù)可視化那些高表達(dá)基因(默認(rèn)查看50個(gè)基因)的表達(dá)情況谬擦。下圖中行表示每個(gè)基因切距,橙色的線(bar)代表該基因在每一個(gè)細(xì)胞中的表達(dá)量,圓圈代表這個(gè)基因在所有細(xì)胞中表達(dá)量的中位數(shù)惨远。默認(rèn)情況下谜悟,使用基因的count值計(jì)算表達(dá)情況话肖,也可以使用exprs_values參數(shù)進(jìn)行修改。
plotHighestExprs(example_sce, exprs_values = "counts")
Frequency of expression as a function of the mean
使用plotExprsFreqVsMean
函數(shù)進(jìn)行可視化
plotExprsFreqVsMean(example_sce)
上圖趨勢中的異常值可能需要進(jìn)一步的調(diào)查葡幸。例如最筒,高表達(dá)基因的pseudo-genes的比對(duì)錯(cuò)誤將導(dǎo)致均值低的基因在所有的細(xì)胞中表達(dá)。相反蔚叨,PCR的擴(kuò)增偏差(或稀有種群的存在)可能會(huì)導(dǎo)致在極少數(shù)細(xì)胞中表達(dá)具有很高均值的基因床蜘。
Percentage of counts assigned to feature controls
對(duì)于細(xì)胞水平上的質(zhì)控,我們可以查看參照基因(feature controls)的表達(dá)量比上總基因表達(dá)量的百分比蔑水,如果一個(gè)基因在總基因表達(dá)量上的比例多邢锯,而在參照基因(如ERCC)里少,就是正常的細(xì)胞肤粱,反之則不正常弹囚。
plotColData(example_sce, x = "total_features_by_counts",
y = "pct_counts_feature_control", colour = "Mutation_Status") +
theme(legend.position = "top") +
stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
Cumulative expression plot
plotScater
函數(shù)會(huì)從表達(dá)量最高的基因(默認(rèn)為500個(gè))中選一部分厨相,然后從高到低累加领曼,看看它們對(duì)每個(gè)細(xì)胞文庫的貢獻(xiàn)值大小。這種類型的圖類似于對(duì)芯片數(shù)據(jù)或bulk RNA-seq數(shù)據(jù)中按樣本繪制箱線圖可視化不同樣本的表達(dá)分布差異蛮穿。累積表達(dá)圖更適用于單細(xì)胞數(shù)據(jù)庶骄,因?yàn)閱渭?xì)胞數(shù)據(jù)難以一次性查看所有細(xì)胞的表達(dá)分布的箱形圖。
為了查看不同細(xì)胞的表達(dá)分布差異践磅,我們可以利用colData中的變量將細(xì)胞進(jìn)行分類单刁。默認(rèn)使用counts值進(jìn)行繪圖,我們也可以通過exprs_values參數(shù)指定其他的數(shù)據(jù)府适。
plotScater(example_sce, block1 = "Mutation_Status", block2 = "Treatment",
colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
Plate position plot
For plate-based experiments, it is useful to see how expression or factors vary with the position of cell on the plate. This can be visualized using the plotPlatePosition
function:
example_sce2 <- example_sce
example_sce2$plate_position <- paste0(
rep(LETTERS[1:5], each = 8),
rep(formatC(1:8, width = 2, flag = "0"), 5)
)
plotPlatePosition(example_sce2, colour_by = "Gene_0001",
by_exprs_values = "counts")
Other quality control plots
可以使用plotFeatureData
函數(shù)輕松地查看任意兩個(gè)元數(shù)據(jù)變量之間的關(guān)系:
plotRowData(example_sce, x = "n_cells_by_counts", y = "mean_counts")
The multiplot function also allows multiple plots to be generated on the same page, as demonstrated below.
p1 <- plotColData(example_sce, x = "total_counts",
y = "total_features_by_counts")
p2 <- plotColData(example_sce, x = "pct_counts_feature_control",
y = "total_features_by_counts")
p3 <- plotColData(example_sce, x = "pct_counts_feature_control",
y = "pct_counts_in_top_50_features")
multiplot(p1, p2, p3, cols = 3)
This is especially useful for side-by-side comparisons
between control sets, as demonstrated below for the plot of highest-expressing features. A plot for non-control cells is shown on the left while the plot for the controls is shown on the right.
p1 <- plotHighestExprs(example_sce[, !example_sce$is_cell_control])
p2 <- plotHighestExprs(example_sce[, example_sce$is_cell_control])
multiplot(p1, p2, cols = 2)
QC結(jié)果的過濾
細(xì)胞水平的過濾
直接通過列數(shù)選取想要的細(xì)胞
# 選取前40個(gè)細(xì)胞
example_sce <- example_sce[,1:40]
使用filter
函數(shù)根據(jù)指定條件選取想要的細(xì)胞
filter(example_sce, Treatment == "treat1")
## class: SingleCellExperiment
## dim: 2000 27
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(37): is_feature_control is_feature_control_ERCC ...
## log10_total_counts_damaged pct_counts_damaged
## colnames(27): Cell_001 Cell_002 ... Cell_037 Cell_039
## colData names(51): Cell Mutation_Status ...
## pct_counts_in_top_200_features_mito
## pct_counts_in_top_500_features_mito
## reducedDimNames(0):
## spikeNames(0):
根據(jù)QC metrics設(shè)定閾值篩選高質(zhì)量的細(xì)胞羔飞,這里我們選取那些總counts數(shù)大于100,000,表達(dá)的基因數(shù)大于500的細(xì)胞檐春。
# 選取總counts數(shù)大于100,000的
keep.total <- example_sce$total_counts > 1e5
# 選取表達(dá)的基因數(shù)大于500的
keep.n <- example_sce$total_features_by_counts > 500
# 根據(jù)設(shè)定的條件進(jìn)行過濾
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
## [1] 2000 37
我們還可以通過isOutlier
函數(shù)計(jì)算篩選的閾值逻淌,它將閾值定義為距離中位數(shù)一定數(shù)量的“中位數(shù)絕對(duì)偏差(MAD)”。超出此閾值的值被認(rèn)為是異常值疟暖,可以假定它們是一些低質(zhì)量的細(xì)胞卡儒,而將其過濾掉。這里我們選取那些log(total counts)值小于3倍MAD值的細(xì)胞作為outliers俐巴。
keep.total <- isOutlier(example_sce$total_counts, nmads=3,
type="lower", log=TRUE)
filtered <- example_sce[,keep.total]
基因水平的過濾
直接通過基因的表達(dá)量過濾掉那些低表達(dá)的基因骨望,這里我們選取那些至少在4個(gè)細(xì)胞中表達(dá)的基因。
keep_feature <- nexprs(example_sce, byrow=TRUE) >= 4
example_sce <- example_sce[keep_feature,]
dim(example_sce)
## [1] 1753 40
當(dāng)然欣舵,我們也可以通過一些其他的條件(如核糖體蛋白基因擎鸠,線粒體基因等)進(jìn)行基因的過濾。
Relationships between experimental factors and expression
我們可以使用plotExplanatoryVariables
函數(shù)查看不同解釋因素的相對(duì)重要性缘圈。當(dāng)對(duì)每個(gè)基因的不同因子進(jìn)行表達(dá)量的線性回歸模型擬合時(shí)糠亩,我們會(huì)對(duì)colData(example_sce)中的每個(gè)因子計(jì)算其對(duì)應(yīng)的R2值虐骑。最好在表達(dá)量的對(duì)數(shù)值上執(zhí)行此操作,以減少平均值對(duì)方差的影響赎线。因此廷没,我們首先對(duì)基因的表達(dá)量進(jìn)行歸一化處理。
# 先對(duì)基因的表達(dá)進(jìn)行歸一化處理
example_sce <- normalize(example_sce)
plotExplanatoryVariables(example_sce)
上圖中每條線對(duì)應(yīng)一個(gè)因子垂寥,代表所有基因中R2值的分布颠黎。當(dāng)然,我們也可以通過variables參數(shù)選擇特定的因子進(jìn)行計(jì)算可視化滞项。
plotExplanatoryVariables(example_sce,
variables = c("total_features_by_counts", "total_counts",
"Mutation_Status", "Treatment", "Cell_Cycle"))
在這個(gè)小數(shù)據(jù)集中狭归,total_counts和total_features_by_counts解釋了基因表達(dá)中很大一部分的方差,它們?cè)谡鎸?shí)數(shù)據(jù)集中能解釋的方差比例應(yīng)該小得多(例如1-5%)文判。
Removing technical biases 去除技術(shù)偏差
Scaling normalization 數(shù)據(jù)歸一化處理
縮放歸一化(Scaling normalization)可以消除細(xì)胞特異性偏差过椎,其使特定細(xì)胞中所有基因的表達(dá)增加或減少,例如測序的覆蓋率或捕獲效率戏仓。
進(jìn)行縮放歸一化的最簡便方法是根據(jù)所有細(xì)胞的縮放文庫大小定義size factors疚宇,使得平均size factor等于1,確保歸一化后的值與原始count值的范圍相同赏殃。
# 使用librarySizeFactors函數(shù)計(jì)算細(xì)胞文庫size factors
sizeFactors(example_sce) <- librarySizeFactors(example_sce)
summary(sizeFactors(example_sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1463 0.6609 0.8112 1.0000 1.2533 2.7356
然后再使用normalize
函數(shù)計(jì)算log轉(zhuǎn)換后的歸一化值敷待,并將其存儲(chǔ)在“l(fā)ogcounts” Assay中
example_sce <- normalize(example_sce)
雖然這種歸一化的方式很簡單,但細(xì)胞文庫大小歸一化并不能解決高通量測序數(shù)據(jù)中經(jīng)常出現(xiàn)的成分偏差仁热,它也不能解釋影響spike-in轉(zhuǎn)錄本產(chǎn)生的差異榜揖。我們強(qiáng)烈建議使用來自scran包的computeSumFactors
和computeSpikeFactors
函數(shù)來進(jìn)行計(jì)算。
Batch correction 校正批次效應(yīng)
批次效應(yīng)的校正可以解決不同批次中細(xì)胞之間表達(dá)的系統(tǒng)差異抗蠢,與比例偏差不同举哟,這些偏差通常在給定批次的所有細(xì)胞中都是恒定的,但對(duì)于每個(gè)基因而言都是不同的迅矛。
我們可以使用limma軟件包中的removeBatchEffect
函數(shù)來消除批次效應(yīng)妨猩。
library(limma)
batch <- rep(1:2, each=20)
# 使用removeBatchEffect函數(shù)去除批次效應(yīng)
corrected <- removeBatchEffect(logcounts(example_sce), block=batch)
assay(example_sce, "corrected_logcounts") <- corrected
參考來源:http://www.bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/overview.html