發(fā)表于2018年的NC,文章是:Multi-omics profiling of younger Asian breast cancers reveals distinctive molecular signatures 文章特色就是測(cè)了 187 primary tumors from a Korean BC cohort (SMC) 的轉(zhuǎn)錄組和外顯子組數(shù)據(jù)袱蜡。數(shù)據(jù)上傳在:EGAS0000100262 和 GSE113184 .
背景知識(shí)
了解TCGA計(jì)劃中的乳腺癌研究現(xiàn)狀
了解一個(gè)事實(shí):西方人的乳腺癌患者中15?30% 是絕經(jīng)期前嗡载,而亞洲人這一比例接近50%于个,人種差異需要解釋,而且影響治療策略的選擇曲稼,因?yàn)锽reast cancers arising in younger patients (YBC)通常更惡性纤房。特別的票从,ER陽(yáng)性的YBC趨向于抵抗內(nèi)分泌治療余佛。但是為什么YBC患者生存更糟糕卻研究不夠。
得益于多組學(xué)技術(shù)的發(fā)展以及TCGA計(jì)劃求摇,西方人群體的乳腺癌的分子分型和遺傳特性研究的比較清楚射沟。也有一些研究比較了年輕一點(diǎn)的乳腺癌患者和年長(zhǎng)一點(diǎn)的乳腺癌患者。
- ESR1基因的轉(zhuǎn)錄表達(dá)水平和蛋白表達(dá)水平都是YBC大于OBC与境,而且YBC趨向于超甲基化验夯。
- YBC群體的GATA3突變多,而OBC群體里面 CDH1居多摔刁。
- ER陽(yáng)性的YBC表現(xiàn)出integrin/laminin, EGFR signaling, and TGF-β通路激活挥转,還有proliferation, stem cell, and endocrine resistance通路也是在YBC過于激活。
但是關(guān)于亞洲人的研究一直比較少共屈,或者涉及的人群樣本量邪笠ァ:
- 其中一個(gè)研究納入 113 Middle Eastern women and identified 63 genes specific to tumors in young women
- 另外一個(gè)研究比較了 Chinese and Italian 乳腺癌的 基因表達(dá)和miRNA表達(dá)。
缺乏多組學(xué)數(shù)據(jù)拗引,而且也缺乏亞洲人的年輕乳腺癌患者數(shù)據(jù)借宵,所以作者做了這個(gè)研究。
兩個(gè)cohort的病人年齡及亞型區(qū)別
TCGA計(jì)劃里面有1116個(gè)美國(guó)乳腺癌患者矾削,而作者是SMC里面有186個(gè)患者壤玫,都是WES數(shù)據(jù)加上RNA-seq數(shù)據(jù)豁护。
首先TCGA按照年齡分3類
- YBC (age?≤?40, n?=?181)
- IBC (40?<?age?≤?60, n?=?562)
- OBC (age?>?60, n?=?535).
然后韓國(guó)人的研究分2類
- YBC (age?≤?40, n?=?125)
- IBC (age?>?40, n?=?62)
可以看到區(qū)別很明顯,如下欲间;
分子分型的三種方式:
分別是:
- ER and HER2 immunohistochemistry (IHC) analyses
- gene expression classifier (PAM50)
- a naive molecular classifier (NMC) based on ESR1, PGR, ERBB2 gene expression and ERBB2 copy number data.
這3種亞型在SMC和TCGA數(shù)據(jù)集的比例區(qū)別也如上圖所示楚里。
TCGA公共數(shù)據(jù)的臨床三線表
有現(xiàn)成的R包可以做,直接在TCGA官網(wǎng)下載臨床信息猎贴,篩選后即可直接生成表格班缎,作者他們自己的SMC數(shù)據(jù)集也可以做臨床三線表,這個(gè)是所有人群隊(duì)列研究必備的嘱能,如下:
胚系遺傳性的致病突變
BRCA1/BRCA2 是乳腺癌患者最出名的易感基因吝梅,之前的數(shù)據(jù)是西方高加索人群乳腺癌群體攜帶率是1~5%虱疏,而亞洲人高達(dá)3?7% 惹骂,所以作者也檢查了自己的韓國(guó)乳腺癌隊(duì)列的胚系遺傳性的致病突變情況,主要是尋找:
- truncate protein reading frame or reported as a disease-causing variant in ClinVar
- 13 genes known to increase breast cancer susceptibility with high to moderate penetrance
發(fā)現(xiàn)攜帶率是 18.8% (35/186) 做瞪,其中BRCA1/2就明顯在SMC里面多余在TCGA隊(duì)列对粪,10.8% of SMC but only 4.7% of TCGA
somatic突變
數(shù)據(jù)分析得到總共是6885個(gè)影響蛋白的突變,涉及到4949個(gè)基因装蓬,TMB平均是0.6著拭,很明顯 TCGA (1.4?±?4.5) 是高于 SMC (0.90?±?0.97) 的,因?yàn)門MB跟年齡相關(guān)牍帚,而且SMC隊(duì)列以年輕人為主儡遮。
隊(duì)列數(shù)量足夠大,所以可以用MutSigCV 分析顯著突變基因暗赶,得到6個(gè)鄙币,分別是:TP53, PIK3CA, GATA3, CBFB, PTEN, and CDH1 然后就一個(gè)個(gè)基因說明它在TCGA和SMC隊(duì)列的區(qū)別。
而且也可以比較兩個(gè)隊(duì)列的突變?nèi)皥D蹂随。
突變的signature
這個(gè)也是癌癥研究的一個(gè)常見要點(diǎn)了十嘿,可以是2012的cosmic數(shù)據(jù)庫(kù)的30個(gè)signatures作為庫(kù),使用成熟的R包來分析自己的數(shù)據(jù)集的突變情況岳锁,再跟TCGA的對(duì)比绩衷。
NMF模擬顯微切割
首先納入了 4 個(gè)數(shù)據(jù)集的共1678個(gè)樣本,如下圖:
通過技術(shù)可以得到 13 NMF factors were attributed to 4 tissue compartments
- tumor intrinsic
- stroma
- tumor infiltrating leukocyte (TIL)
- normal tissue.
其中有每個(gè)數(shù)據(jù)集都有一個(gè)自己獨(dú)特的NMF因子激率,可以理解為批次效應(yīng)咳燕,下面是不同的因子在不同的數(shù)據(jù)集的表現(xiàn)。
這個(gè)算法也被打包到R里面了乒躺,很容易復(fù)現(xiàn)這個(gè)數(shù)據(jù)分析流程招盲。
也可以把Bindea immune expression signature scores (GSVA) and NMF factors做一下相關(guān)性,這樣可以解釋NMF算法得到的這些因子的生物學(xué)意義聪蘸。
差異分析及功能注釋
作者這里采用了voom來對(duì)表達(dá)矩陣進(jìn)行歸一化成logCPM形式宪肖,然后使用limma做差異分析嗎表制,定義FDR?<?0.01 and |log2FC|?>?0.2的基因?yàn)轱@著差異。
值得注意的是作者過濾掉了25%的低表達(dá)量基因以及25%的低SD基因控乾。
使用GSVA算法么介,計(jì)算了MsigDB v5.1數(shù)據(jù)庫(kù)的6475 gene sets
- H (hallmark gene sets)
- C2 (curated gene sets)
- C5 (GO gene sets)
- C6 (oncogenic gene sets)
只有表達(dá)矩陣是比較方便下載的。
附件信息比較豐富
雖然無法下載全外顯子測(cè)序數(shù)據(jù)蜕衡,但是作者團(tuán)隊(duì)的分析結(jié)果都在附件給出了壤短。
Supplementary Data 1: SMC and TCGA sample annotation.
Supplementary Data 2: Clinical data summary of SMC and TCGA.
Supplementary Data 3: Molecular subtype distribution in SMC and TCGA. This table contains proportions of molecular subtypes (Consensus, IHC and PAM50) in different age groups and cohorts. Also included are statistical significances of differential distribution between different sample groups.
Supplementary Data 4: BRCA1/BRCA2 germline pathogenic mutations.
Supplementary Data 5: Mutation prevalence of cancer driver genes in SMC and TCGA. This table contains mutation frequencies (% samples) in different cohorts and age groups for significantly mutated genes. Also included are statistical significances of differential distribution and frequencies of germline pathogenic mutations for BRCA1 and BRCA2.
Supplementary Data 6: Somatic mutation predictions in SMC. Detailed annotations for 6,885 somaticprotein-altering mutations.
Supplementary Data 7: Significantly mutated genes. Significantly mutated genes identified by MutSigCVbased on combined mutations from SMC and TCGA (“Combined”), SMC mutations alone (“SMC”) and TCGAmutations alone (“TCGA”). n_nonsilent: number of protein-altering mutations. n_silent: number of silentmutations.
Supplementary Data 8: Somatic alteration prevalence of cancer driver genes in SMC and TCGA. This table contains the prevalence (% samples) of somatic alterations, including protein-altering substitutions,insertions, deletions, copy number amplifications and deletions, for frequently altered genes in different agegroups and cohorts. Amplification is defined as absolute copy number ≥ 6 and deletion as copy number ≤ 1.Also included are statistical significances of different group comparisons. Both direct comparisons with Fisher’s exact test and comparisons adjusting for tumor stage with logistic regression were performed. P-value was calculated using the Fisher’s exact test and FDR corrected using the Benjamini-Hochberg method.
Supplementary Data 9: Somatic alteration prevalence of oncogenic pathways. This table containsalteration prevalence of five BC related oncogenic pathways in different age groups and cohorts. A pathwaywas altered in a sample if one or more genes in that pathway harbor alteration in the sample. Pairwisecomparison of alteration prevalence was performed between different groups using the Fisher’s exact test.
Supplementary Data 10: Somatic mutation signatures. (a) Mutation signature distribution acrosssubtypes and cohorts. (b) Mutation signature correlation with mutation burden, number of protein-alteringsomatic mutations in each sample.
Supplementary Data 11: Pathway enrichment of NMF factors. Statistical significance for pairwiseassociations between DE pathways and NMF factors.
Supplementary Data 12: Differentially expressed genes and pathways. (a) Differentially expressed genes(a) and pathways (b) identified from two comparisons - SMC pre-menopausal vs. TCGA post-menopausal andSMC pre-menopausal vs. TCGA pre-menopausal.
(文章轉(zhuǎn)自jimmy的2018年閱讀文獻(xiàn)筆記)
生信基礎(chǔ)知識(shí)大全系列:生信基礎(chǔ)知識(shí)100講
史上最強(qiáng)的生信自學(xué)環(huán)境準(zhǔn)備課來啦!慨仿! 7次改版久脯,11節(jié)課程,14K的講稿镰吆,30個(gè)夜晚打磨帘撰,100頁(yè)P(yáng)PT的課程。
如果需要組裝自己的服務(wù)器万皿;代辦生物信息學(xué)服務(wù)器
如果需要幫忙下載海外數(shù)據(jù)(GEO/TCGA/GTEx等等)摧找,點(diǎn)我?
如果需要線下輔導(dǎo)及培訓(xùn)牢硅,看招學(xué)徒
如果需要個(gè)人電腦:個(gè)人計(jì)算機(jī)推薦
如果需要置辦生物信息學(xué)書籍蹬耘,看:生信人必備書單
如果需要實(shí)習(xí)崗位:實(shí)習(xí)職位發(fā)布
如果需要售后:點(diǎn)我