10X單細(xì)胞(10X空間轉(zhuǎn)錄組)細(xì)胞組成變化分析之scCODA

hello蹂窖,又是周五了,比較忙恩敌,沒(méi)有更瞬测,但是29周歲而立之年的焦慮還一直在,不知道怎么做才能緩解,好了月趟,這一篇我們要分享一個(gè)新的方法灯蝴,a Bayesian model for compositional single-cell data analysis,分享的文章在scCODA is a Bayesian model for compositional single-cell data analysis孝宗,2021年12月發(fā)表于NC穷躁,方法還是很不錯(cuò)的

研究背景

細(xì)胞類(lèi)型的組成變化是生物過(guò)程的主要驅(qū)動(dòng)力。 由于數(shù)據(jù)的組成性和樣本量較小因妇,很難通過(guò)單細(xì)胞實(shí)驗(yàn)檢測(cè)它們问潭。

單細(xì)胞 RNA 測(cè)序 (scRNA-seq) 的最新進(jìn)展允許在廣泛的組織中對(duì)單個(gè)細(xì)胞進(jìn)行大規(guī)模定量轉(zhuǎn)錄分析,從而能夠監(jiān)測(cè)條件或發(fā)育階段之間的轉(zhuǎn)錄變化以及數(shù)據(jù)驅(qū)動(dòng)的識(shí)別不同的細(xì)胞類(lèi)型婚被。

盡管是疾病狡忙、發(fā)育、衰老和免疫等生物過(guò)程的重要驅(qū)動(dòng)因素址芯,但使用 scRNA-seq 檢測(cè)細(xì)胞類(lèi)型組成的變化并非易事灾茁。統(tǒng)計(jì)測(cè)試需要考慮技術(shù)和方法限制的多種來(lái)源,包括實(shí)驗(yàn)重復(fù)次數(shù)少谷炸。在大多數(shù)單細(xì)胞技術(shù)中北专,每個(gè)樣本的細(xì)胞總數(shù)受到限制,這意味著細(xì)胞類(lèi)型計(jì)數(shù)本質(zhì)上是成比例的旬陡。反過(guò)來(lái)逗余,這會(huì)導(dǎo)致細(xì)胞類(lèi)型相關(guān)性估計(jì)出現(xiàn)負(fù)偏差。例如季惩,如果只有一種特定的細(xì)胞類(lèi)型在擾動(dòng)后被耗盡录粱,其他細(xì)胞的相對(duì)頻率就會(huì)上升。如果從表面上看画拾,這將導(dǎo)致不同細(xì)胞類(lèi)型的膨脹啥繁。因此,獨(dú)立測(cè)試每種細(xì)胞類(lèi)型的組成變化的標(biāo)準(zhǔn)單變量統(tǒng)計(jì)模型可能錯(cuò)誤地將某些群體變化視為真實(shí)效應(yīng)青抛,即使它們僅由細(xì)胞類(lèi)型比例的固有負(fù)相關(guān)性引起旗闽。然而,目前應(yīng)用于組成細(xì)胞類(lèi)型分析的常見(jiàn)統(tǒng)計(jì)方法忽略了這種影響蜜另。

為了解釋細(xì)胞類(lèi)型組成中存在的固有偏差适室,從微生物組數(shù)據(jù)的組成分析方法中汲取靈感,并提出了一種用于細(xì)胞類(lèi)型組成差異豐度分析的貝葉斯方法举瑰,以進(jìn)一步解決低復(fù)制問(wèn)題捣辆。單細(xì)胞成分?jǐn)?shù)據(jù)分析 (scCODA) 框架使用分層 Dirichlet-Multinomial 分布對(duì)細(xì)胞類(lèi)型計(jì)數(shù)進(jìn)行建模,該分布通過(guò)對(duì)所有測(cè)量的細(xì)胞類(lèi)型比例而不是通過(guò)聯(lián)合建模來(lái)解釋細(xì)胞類(lèi)型比例的不確定性和負(fù)相關(guān)偏差個(gè)別的此迅。該模型使用帶有對(duì)數(shù)鏈接函數(shù)的 Logit 正態(tài)尖峰和平板先驗(yàn)汽畴,以簡(jiǎn)約的方式估計(jì)二元(或連續(xù))協(xié)變量對(duì)細(xì)胞類(lèi)型比例的影響旧巾。由于成分分析始終需要能夠識(shí)別成分變化的參考,因此 scCODA 可以自動(dòng)選擇適當(dāng)?shù)募?xì)胞類(lèi)型作為參考或使用預(yù)先指定的參考細(xì)胞類(lèi)型忍些。這意味著必須根據(jù)所選參考來(lái)解釋 scCODA 檢測(cè)到的可信變化鲁猩。最重要的是,該框架提供了對(duì)其他完善的組合測(cè)試統(tǒng)計(jì)數(shù)據(jù)的訪問(wèn)罢坝,并完全集成到 Scanpy pipeline中廓握。

圖片.png

代碼示例

單細(xì)胞數(shù)據(jù)分析細(xì)胞比例的缺點(diǎn)
  • scRNA-seq population data is compositional. This must be considered to avoid an inflation of false-positive results.
  • Most datasets consist only of very few samples, making frequentist tests inaccurate.
  • A condition usually only effects a fraction of cell types. Therefore, sparse effects are preferable.
    The scCODA model overcomes all these limitations in a fully Bayesian model, that outperforms other compositional and non-compositional methods.(軟件是python版本)

scCODA - Compositional analysis of single-cell data

# Setup
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt

from sccoda.util import comp_ana as mod
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz

import sccoda.datasets as scd
load data
# Load data

cell_counts = scd.haber()

print(cell_counts)

Mouse Endocrine Enterocyte Enterocyte.Progenitor Goblet Stem
0 Control_1 36 59 136 36 239
1 Control_2 5 46 23 20 50
2 Control_3 45 98 188 124 250
3 Control_4 26 221 198 36 131
4 H.poly.Day10_1 42 71 203 147 271
5 H.poly.Day10_2 40 57 383 170 321
6 H.poly.Day3_1 52 75 347 66 323
7 H.poly.Day3_2 65 126 115 33 65
8 Salm_1 37 332 113 59 90
9 Salm_2 32 373 116 67 117

預(yù)處理
# Convert data to anndata object
data_all = dat.from_pandas(cell_counts, covariate_columns=["Mouse"])

# Extract condition from mouse name and add it as an extra column to the covariates
data_all.obs["Condition"] = data_all.obs["Mouse"].str.replace(r"_[0-9]", "")
For our first example, we want to look at how the Salmonella infection influences the cell composition. Therefore, we subset our data.
# Select control and salmonella data
data_salm = data_all[data_all.obs["Condition"].isin(["Control", "Salm"])]
viz.boxplots(data_salm, feature_name="Condition")
plt.show()
圖片.png

Model setup and inference

We can now create the model and run inference on it. Creating a sccoda.util.comp_ana.CompositionalAnalysis class object sets up the compositional model and prepares everxthing for parameter inference. It needs these informations:

  • The data object from above.

  • The formula parameter. It specifies how the covariates are used in the model. It can process R-style formulas via the patsy package, e.g. formula="Cov1 + Cov2 + Cov3". Here, we simply use the “Condition” covariate of our dataset

  • The reference_cell_type parameter is used to specify a cell type that is believed to be unchanged by the covariates in formula. This is necessary, because compositional analysis must always be performed relative to a reference (See Büttner, Ostner et al., 2021 for a more thorough explanation). If no knowledge about such a cell type exists prior to the analysis, taking a cell type that has a nearly constant relative abundance over all samples is often a good choice. It is also possible to let scCODA find a suited reference cell type by using reference_cell_type="automatic". Here, we take Goblet cells as the reference.

model_salm = mod.CompositionalAnalysis(data_salm, formula="Condition", reference_cell_type="Goblet")
sim_results = model_salm.sample_hmc()
Result interpretation
sim_results.summary()
Compositional Analysis summary:

Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition

Intercepts:
                       Final Parameter  Expected Sample
Cell Type
Endocrine                        1.102        34.068199
Enterocyte                       2.328       116.089840
Enterocyte.Progenitor            2.523       141.085258
Goblet                           1.753        65.324318
Stem                             2.705       169.247878
TA                               2.113        93.631267
TA.Early                         2.861       197.821355
Tuft                             0.449        17.731884


Effects:
                                         Final Parameter  Expected Sample  \
Covariate         Cell Type
Condition[T.Salm] Endocrine                       0.0000        24.315528
                  Enterocyte                      1.3571       321.891569
                  Enterocyte.Progenitor           0.0000       100.696915
                  Goblet                          0.0000        46.623988
                  Stem                            0.0000       120.797449
                  TA                              0.0000        66.827533
                  TA.Early                        0.0000       141.191224
                  Tuft                            0.0000        12.655794

                                         log2-fold change
Covariate         Cell Type
Condition[T.Salm] Endocrine                     -0.486548
                  Enterocyte                     1.471333
                  Enterocyte.Progenitor         -0.486548
                  Goblet                        -0.486548
                  Stem                          -0.486548
                  TA                            -0.486548
                  TA.Early                      -0.486548
                  Tuft                          -0.486548
Intercepts

The first column of the intercept summary shows the parameters determined by the MCMC inference.

The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.

Effects

For the effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types. Most important is the distinctions between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.

Since the numerical values of the “Final parameter” columns are not straightforward to interpret, the “Expected sample” and “l(fā)og2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, are will change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.

Interpretation

In the salmonella case, we see only a credible increase of Enterocytes, while all other cell types are unaffected by the disease. The log-fold change of Enterocytes between control and infected samples with the same total cell count lies at about 1.54.

Adjusting the False discovery rate

scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries.

The desired FDR level can be easily set after inference via sim_results.set_fdr(). Per default, the value is 0.05, but we recommend to increase it if no effects are found at a more conservative level.

In our example, setting a desired FDR of 0.4 reveals effects on Endocrine and Enterocyte cells.

sim_results.set_fdr(est_fdr=0.4)
sim_results.summary()
Compositional Analysis summary (extended):

Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Spike-and-slab threshold: 0.434

MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 79.348 sec. Acceptance rate: 51.9%

Intercepts:
                       Final Parameter  HDI 3%  HDI 97%     SD  \
Cell Type
Endocrine                        1.102   0.363    1.740  0.369
Enterocyte                       2.328   1.694    2.871  0.314
Enterocyte.Progenitor            2.523   1.904    3.088  0.320
Goblet                           1.753   1.130    2.346  0.330
Stem                             2.705   2.109    3.285  0.318
TA                               2.113   1.459    2.689  0.332
TA.Early                         2.861   2.225    3.378  0.307
Tuft                             0.449  -0.248    1.207  0.394

                       Expected Sample
Cell Type
Endocrine                    34.068199
Enterocyte                  116.089840
Enterocyte.Progenitor       141.085258
Goblet                       65.324318
Stem                        169.247878
TA                           93.631267
TA.Early                    197.821355
Tuft                         17.731884


Effects:
                                         Final Parameter  HDI 3%  HDI 97%  \
Covariate         Cell Type
Condition[T.Salm] Endocrine                     0.327533  -0.506    1.087
                  Enterocyte                    1.357100   0.886    1.872
                  Enterocyte.Progenitor         0.000000  -0.395    0.612
                  Goblet                        0.000000   0.000    0.000
                  Stem                         -0.240268  -0.827    0.168
                  TA                            0.000000  -0.873    0.252
                  TA.Early                      0.000000  -0.464    0.486
                  Tuft                          0.000000  -1.003    0.961

                                            SD  Inclusion probability  \
Covariate         Cell Type
Condition[T.Salm] Endocrine              0.338               0.457133
                  Enterocyte             0.276               0.998400
                  Enterocyte.Progenitor  0.163               0.338200
                  Goblet                 0.000               0.000000
                  Stem                   0.219               0.434800
                  TA                     0.220               0.364000
                  TA.Early               0.128               0.284733
                  Tuft                   0.319               0.392533

                                         Expected Sample  log2-fold change
Covariate         Cell Type
Condition[T.Salm] Endocrine                    34.413767          0.014560
                  Enterocyte                  328.331183          1.499910
                  Enterocyte.Progenitor       102.711411         -0.457971
                  Goblet                       47.556726         -0.457971
                  Stem                         96.897648         -0.804604
                  TA                           68.164454         -0.457971
                  TA.Early                    144.015830         -0.457971
                  Tuft                         12.908980         -0.457971

數(shù)據(jù)可視化

# Stacked barplot for each sample
viz.stacked_barplot(data_mouse, feature_name="samples")
plt.show()

# Stacked barplot for the levels of "Condition"
viz.stacked_barplot(data_mouse, feature_name="Condition")
plt.show()
圖片.png
# Grouped boxplots. No facets, relative abundance, no dots.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=False,
    y_scale="relative",
    add_dots=False,
)
plt.show()

# Grouped boxplots. Facets, log scale, added dots and custom color palette.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=True,
    y_scale="log",
    add_dots=True,
    cmap="Reds",
)
plt.show()
圖片.png

Finding a reference cell type

The scCODA model requires a cell type to be set as the reference category. However, choosing this cell type is often difficult. A good first choice is a referenece cell type that closely preserves the changes in relative abundance during the compositional analysis.

For this, it is important that the reference cell type is not rare, to avoid large relative changes being caused by small absolute changes. Also, the relative abundance of the reference should vary as little as possible across all samples.

The visualization viz.rel_abundance_dispersion_plot shows the presence (share of non-zero samples) over all samples for each cell type versus its dispersion in relative abundance. Cell types that have a higher presence than a certain threshold (default 0.9) are suitable candidates for the reference and thus colored.

viz.rel_abundance_dispersion_plot(
    data=data_mouse,
    abundant_threshold=0.9
)
plt.show()
圖片.png

Diagnostics and plotting

Similarly to the summary dataframes being compatible with arviz, the result class itself is an extension of arviz’s Inference Data class. This means that we can use all its MCMC diagnostic and plotting functionality. As an example, looking at the MCMC trace plots and kernel density estimates, we see that they are indicative of a well sampled MCMC chain:

Note: Due to the spike-and-slab priors, the beta parameters have many values at 0, which looks like a convergence issue, but is actually not.

Caution: Trying to plot a kernel density estimate for an effect on the reference cell type results in an error, since it is constant at 0 for the entire chain. To avoid this, add coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]} as an argument to az.plot_trace, which causes the plots for the reference cell type to be skipped.

az.plot_trace(
    salm_results,
    divergences=False,
    var_names=["alpha", "beta"],
    coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]},
)
plt.show()
圖片.png

示例代碼的網(wǎng)址在scCODA

圖片.png

最后,看一看算法

圖片.png

圖片.png

生活很好嘁酿,有你更好

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載隙券,如需轉(zhuǎn)載請(qǐng)通過(guò)簡(jiǎn)信或評(píng)論聯(lián)系作者。
  • 序言:七十年代末痹仙,一起剝皮案震驚了整個(gè)濱河市是尔,隨后出現(xiàn)的幾起案子殉了,更是在濱河造成了極大的恐慌开仰,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件薪铜,死亡現(xiàn)場(chǎng)離奇詭異众弓,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)隔箍,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)谓娃,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人蜒滩,你說(shuō)我怎么就攤上這事滨达。” “怎么了俯艰?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵捡遍,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我竹握,道長(zhǎng)画株,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任啦辐,我火速辦了婚禮谓传,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘芹关。我一直安慰自己续挟,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布侥衬。 她就那樣靜靜地躺著庸推,像睡著了一般常侦。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上贬媒,一...
    開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1 283
  • 那天聋亡,我揣著相機(jī)與錄音,去河邊找鬼际乘。 笑死坡倔,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的脖含。 我是一名探鬼主播罪塔,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼养葵!你這毒婦竟也來(lái)了征堪?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤关拒,失蹤者是張志新(化名)和其女友劉穎佃蚜,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體着绊,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡谐算,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了归露。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片洲脂。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖剧包,靈堂內(nèi)的尸體忽然破棺而出恐锦,到底是詐尸還是另有隱情,我是刑警寧澤疆液,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布一铅,位于F島的核電站,受9級(jí)特大地震影響枚粘,放射性物質(zhì)發(fā)生泄漏馅闽。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一馍迄、第九天 我趴在偏房一處隱蔽的房頂上張望福也。 院中可真熱鬧,春花似錦攀圈、人聲如沸暴凑。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)现喳。三九已至凯傲,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間嗦篱,已是汗流浹背冰单。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留灸促,地道東北人诫欠。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像浴栽,于是被迫代替她去往敵國(guó)和親荒叼。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容