Related Knowledge
異質性
- 腫瘤的異質性是惡性腫瘤的特征之一癞蚕,是指腫瘤在生長過程中桦山,經(jīng)過多次分裂增殖醋旦,其子細胞呈現(xiàn)出分子生物學或基因方面的改變浑度,從而使腫瘤的生長速度鸦概、侵襲能力窗市、對藥物的敏感性、預后等各方面產(chǎn)生差異论熙。簡單點說就是同一腫瘤中可以存在有很多不同的基因型或者亞型的細胞脓诡。因此同一種腫瘤在不同的個體身上可表現(xiàn)出不一樣的治療效果及預后媒役,甚至同一個體身上的腫瘤細胞也存在不同的特性和差異酣衷。
- 腫瘤的異質性是指腫瘤組織內(nèi)部不同的腫瘤細胞或者亞群中體細胞突變不完全相同穿仪。
腫瘤純度
- 腫瘤樣本中癌細胞總是混合一定未知比例的正常細胞,我們稱腫瘤樣本中癌細胞所占的比例為腫瘤純度(Tumor purity)只锻。
SNV
- SNV是基因組上單個堿基發(fā)生改變的位點齐饮,在基因組上廣泛分布碴里。
Abstract
PurBayes,to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples, which uses finite mixture modeling methods.
PurBayes咬腋,基于使用有限混合物建模方法的成對腫瘤 - 正常組織樣本的下一代測序數(shù)據(jù)(NGS)來估計腫瘤純度和檢測腫瘤內(nèi)異質性根竿。
introduction
With advances in high-throughput next-generation sequencing (NGS) technologies, sequencing of tumor-normal tissue pairs is becoming commonplace in cancer studies. Often, the sampled tumor tissue is contaminated with stromal cells, resulting in a mixture of tumor and normal sequence data in the tumor sample. There has been a recent interest in accurate estimation of tumor purity levels in tumor data analysis, including methods specific to NGS data such as PurityEst.
隨著高通量新一代測序(NGS)技術的進步寇壳,腫瘤-正常組織對的測序在癌癥研究中變得普遍。 通常泞歉,取樣的腫瘤組織被基質細胞污染匿辩,導致腫瘤樣品中腫瘤和正常序列數(shù)據(jù)的混合铲球。 最近人們對腫瘤數(shù)據(jù)分析中腫瘤純度水平的準確估計感興趣,包括NGS數(shù)據(jù)特有的方法选侨,如PurityEst援制。
However, a subset of the observed somatic mutations may be subclonal because of intratumor heterogeneity . Unlike clonal mutations, which are observed tumor-wide, subclonal mutations will be observed at cellularities less than the tumor purity level and subsequently bias purity estimates under an assumption of tumor tissue homogeneity. By modeling this heterogeneity, it may also be possible to make inferences about tumor evolution and founder events. To date there are no methods that aim to both quantify tumor purity and detect intratumor heterogeneity using NGS data.
然而隘谣,由于腫瘤內(nèi)異質性啄巧,觀察到的體細胞突變的子集可能是亞克隆的秩仆。 與在腫瘤范圍內(nèi)觀察到的克隆突變不同,將在低于腫瘤純度水平的細胞系中觀察到亞克隆突變噪珊,并且隨后在腫瘤組織同質性的假設下偏向純度估計痢站。 通過對這種異質性進行建模选酗,也可以對腫瘤進化和創(chuàng)始事件做出推論芒填。 迄今為止,沒有任何方法旨在使用NGS數(shù)據(jù)來量化腫瘤純度和檢測腫瘤內(nèi)異質性朱庆。
In this article, we present a Bayesian mixture modeling approach, PurBayes, toward estimating tumor purity and subclonality using NGS data, resulting in posterior distributions of tumor cellularities from which credible intervals (CI) can be derived. To illustrate its implementation, we conduct a simulation study under a variety of conditions and discuss the performance of PurBayes on synthetic data.
在本文中娱颊,我們提出了貝葉斯混合物建模方法箱硕,PurBayes,使用NGS數(shù)據(jù)估計腫瘤純度和亞克隆性殖熟,得出腫瘤細胞的后驗分布菱属,從中可以得出可信區(qū)間(CI)纽门。 為了說明其實施营罢,我們在各種條件下進行了模擬研究饲漾,并討論了PurBayes在合成數(shù)據(jù)上的性能。
Methods
- For a set of S observed heterozygous loci because of somatically acquired single-nucleotide variants (SNVs) for a given tumor sequencing sample, each SNV can be represented by respective normal and mutant allele read counts Xi and Yi. The total number of sample reads Ni = Xi + Yi can in turn be decomposed into respective tumor and normal tissue read counts Nti and Nwi , such that Ni = Nwi + Nti . As it cannot be directly determined which cell type each individual read was derived, Nti and Nwi are latent variables. If we assume Nti to be binomially distributed, such that Nti~Bin(Ni, λ) and λ indicates tumor sample purity, and Yi|Nti~Bin(Nti , 0.50), then Yi follows a binomial–binomial hierarchical mixture model with marginal distribution Yi~Bin(Ni, λ/2) .
- 對于一組觀察到的雜合位點集合S,由于給定了腫瘤測序樣品的體細胞獲得的單核苷酸變體(SNV)勤晚,每個SNV可以由相應的正常和突變等位基因讀數(shù)Xi和Yi表示泉褐。 樣本總數(shù)寫作Ni = Xi + Yi又可以分解成各自的腫瘤和正常組織讀數(shù)Nti和Nwi膜赃,使得Ni = Nwi + Nti。 由于不能直接確定每個單獨讀取的細胞類型悠夯,Nti和Nwi是潛在變量沦补。 如果我們假設Nti是二項分布的夕膀,那么Nti~Bin(Ni美侦,λ)其中λ表示腫瘤樣本純度菠剩,并且Yi|Nti~Bin(Nti具壮,0.50),則Yi遵循二項式 - 二項式層次混合模型與邊緣分布Yi~Bin(Ni攘已,λ/ 2)怜跑。
- Consider a tumor that exhibits intratumor heterogeneity. If we assume subclonal mutations cluster into an a priori finite number of J-1 subclonal populations, Y can be modeled under a Bayesian finite mixture model. Let Kj denote to the probability a mutation corresponds to variant population j with respective cellularity λj, for j = 1, ... , J, such that E Kj = 1, λ1<...<λj, and λj ~=λ , with uniform priors on λj. To obtain a data-driven value for J, PurBayes generates model fits iteratively by initially assuming tumor homogeneity and then increasing the subclonal population count by one until an optimal model fit is achieved under a penalized expected deviance (PED) criterion .
- 考慮一種表現(xiàn)出腫瘤內(nèi)異質性的腫瘤峡眶。 如果我們假設亞克隆突變聚集到先驗有限數(shù)量的J-1亞克隆群體中植锉,Y可以在貝葉斯有限混合模型下建模汽煮。 令Kj表示突變對應于具有各自細胞性λj的變體群J的概率暇赤,對于j = 1,...止后,J译株,使得 epsilon Kj = 1,λ1<... <λj歉糜,并且λj ~= λ乘寒, 其中λj是均勻先驗的。 為了獲得J的數(shù)據(jù)驅動值匪补,PurBayes通過初始假設腫瘤同質性然后將亞克隆種群數(shù)增加1來迭代地生成模型擬合伞辛,直到在懲罰預期偏差(PED)標準下實現(xiàn)最佳模型擬合。
- Mapping bias can result in non-reference alleles in heterozygous loci being mapped at rates<0.50, which would impact tumor purity estimation. PurBayes can accommodate this bias by estimating it from additional reference and alternate allele counts in heterozygous normal tissue variant calls.
- 定位偏差可導致雜合基因座中的非參考等位基因以<0.50的速率定位夯缺,這將影響腫瘤純度估計蚤氏。 PurBayes可以通過從雜合正常組織變異調用中的額外參考和替代等位基因計數(shù)來估計它來適應這種偏差。
- PurBayes is implemented in the statistical programming language R and uses the MCMC software JAGS. The only inputs required for PurBayes are the tumor tissue read counts (N and Y) for a set of high-confidence SNVs, which can easily be derived from most variant calling software output file formats on NGS data.
- PurBayes以統(tǒng)計編程語言R實現(xiàn)踊兜,并使用MCMC軟件JAGS。 PurBayes所需的唯一輸入是一組高可信度SNV的腫瘤組織讀數(shù)(N和Y)捏境,可以很容易地從NGS數(shù)據(jù)上的大多數(shù)變體調用軟件輸出文件格式中獲得姐呐。
Simulation
To illustrate the performance of PurBayes under a variety of conditions, we conducted simulation studies based on real sequencing data from the 1000 Genomes Project (details in Supplementary Materials). We first simulated read count data for homogenous tumors ranging in purity from 20–80%, with S = 100 and average sequencing depth at 50x and 100x. We ran 100 replications of each unique set of conditions and examined the PurBayes posterior median estimates. We ran similar simulations for heterogeneous tumor data with J = 2 at 100x for various values of Kj and λj to determine how well PurBayes can detect intratumor heterogeneity and estimate tumor purity. For each application, we also simulated read count data from 100 additional germ line variant calls to account for mapping bias. For purposes of comparison, we also applied the PurityEst algorithm to each simulation replicate.
為了說明PurBayes在各種條件下的性能,我們基于來自1000個基因組項目(詳見補充材料)的真實測序數(shù)據(jù)進行了仿真研究典蝌。我們首先模擬了20-80%純度的同質腫瘤的計數(shù)數(shù)據(jù)曙砂,S = 100,平均測序深度分別為50x和100x骏掀。我們對每種獨特的條件進行了100次復制鸠澈,并檢查了PurBayes后中位數(shù)估計值。我們對各種Kj和λj值的異質性腫瘤數(shù)據(jù)進行了類似的仿真截驮,其中 J = 2, 100x笑陈,以確定PurBayes檢測腫瘤內(nèi)異質性和估計腫瘤純度的精度。 對于每鐘應用葵袭,我們還仿真了來自另外100個胚芽系變體調用的讀計數(shù)數(shù)據(jù)涵妥,以考慮映射偏差。為了便于比較坡锡,我們還對每個仿真迭代應用了PurityEst 算法蓬网。
For each application of PurBayes, the first 50000 iterations of the optimal MCMC model fit were discarded as a burn-in before posterior sampling of 10000 iterations. Mean per-sample execution time was 2 min on a workstation equipped with an Intel CoreTM i5 3.10 Ghz processor and 4GB of random access memory.
對于PurBayes的每個應用,最佳MCMC模型擬合的前50000次迭代在10000次迭代的后驗取之前被丟棄作為老化鹉勒。 在配備Intel CoreTM i5 3.10 Ghz處理器和4GB隨機存取存儲器的工作站上帆锋,每個樣本的平均執(zhí)行時間為2分鐘。
Results and Discussion
- For the homogenous tumor simulations, PurBayes correctly identified tumor homogeneity in all replications. Distributions of the posterior median estimates of tumor purity for each value of λ and method are displayed in Figure 1. Estimates from PurBayes and PurityEst were nearly identical, with a Pearson correlation of 0.9997. Both methods were accurate, tending toward overestimation at lower values of λ. When we applied PurBayes to heterogeneous data, the ability to detect heterogeneity was highly dependent on the disparity between cellularities. The proportion of clonal variants also affected detection, with larger values of K1 leading to higher mean absolute error (MAE) of the posterior median purity estimates. Although PurityEst performed comparably under certain conditions, the ability for PurBayes to detect heterogeneity generally resulted in greater estimate accuracy.
- 對于同質腫瘤仿真禽额,PurBayes在所有重復實驗中正確識別腫瘤同質性锯厢。 圖1顯示了對每個λ值的腫瘤純度的后驗中位數(shù)估計值的分布以及方法皮官。PurBayes和PurityEst的估計值幾乎相同,Pearson相關系數(shù)為0.9997实辑。 兩種方法都是準確的捺氢,傾向于在較低的λ值下過高估計。 當我們將PurBayes應用于異質性數(shù)據(jù)時剪撬,檢測異質性的能力高度依賴于細胞之間的差異摄乒。 克隆變體的比例也影響檢測,較大的K1值導致后驗中位數(shù)純度估計時較高的平均絕對誤差(MAE)婿奔。雖然PurityEst在某些條件下表現(xiàn)相當,但PurBayes檢測異質性的能力通常會帶來更高的估計準確性问慎。
- Our simulation results highlight the potential bias of tumor purity estimates in the presence of unaccounted intratumor heterogeneity. By simultaneously estimating tumor purity and subclonality, PurBayes may also provide additional advantages, such as facilitating inference regarding the tumor composition and evolution as well as isolation of potential founder events. As a Bayesian approach, measures of uncertainty are directly derived from the posterior distribution of J in the form of CIs.
- 我們的仿真結果強調了在未計入腫瘤內(nèi)異質性的情況下腫瘤純度估計的潛在偏差萍摊。 通過同時估計腫瘤純度和亞克隆性,PurBayes還可以提供額外的優(yōu)勢如叼,例如促進關于腫瘤組成和進化的推斷冰木,以及潛在的創(chuàng)始事件的分離。 作為貝葉斯方法笼恰,不確定性的度量直接來自于CI的形式的J的后驗分布踊沸。
- One possible issue in the application of PurBayes is if it estimates J to be larger than the true value because of outlier observations, which leads to a positively biased tumor purity estimate. This can be especially problematic with the existence of copy number variation (CNV) and structural rearrangements. Given that regions of CNV will result in multiplicative impact on the number of mapped reads and SNVs contained within such regions will not truly reflect heterozygosity at a proportion of 0.50, such SNVs would highly influence estimation of J. As such, we anticipate PurityEst to perform better in instances in which CNVs are present and unaccounted for in purity estimation because of its robust estimation procedures. It is thus highly recommended that regions indicated to be CNVs by parallel analyses be filtered from the estimation procedure.
- 應用PurBayes的一個可能問題是,如果由于離群值的觀察使J的估計大于真值社证,則導致腫瘤純度估計值偏向正偏差逼龟。 對于拷貝數(shù)變異(CNV)和結構重排的存在,這可能尤其成問題追葡。 鑒于CNV區(qū)域將對映射讀數(shù)的數(shù)量產(chǎn)生倍增影響腺律,并且這些區(qū)域中包含的SNV不能真實地反映0.50的比例的雜合性,這樣的SNV將高度影響J的估計宜肉。因此匀钧,我們預期PurityEst執(zhí)行在CNV存在的情況下更好,并且由于其強大的估計程序而在純度估計中不明確谬返。 因此之斯,強烈建議從估計程序中過濾通過平行分析指示為CNV的區(qū)域。
- We foresee a variety of extensions to the concepts in PurBayes. For example, the mixture model could be alternatively formulated to characterize tumor cellularity as a continuous distribution using semi-parametric approaches. Integration of CNV and ploidy information will also make PurBayes a more effective estimator.
- 我們預見到對PurBayes概念的各種擴展遣铝。例如佑刷,混合型模型可以通過半?yún)?shù)化方法來描述腫瘤細胞的連續(xù)分布。CNV和倍性信息的集成也將使PurBayes成為一種更有效的估計器酿炸。