這兩天“王院長(zhǎng)”幫別人分析snv數(shù)據(jù)签舞,順便給我推薦了一個(gè)R包:maftools贷帮。我看了一下官網(wǎng)戚揭,感覺(jué)是個(gè)非常強(qiáng)大的R包。那必須要學(xué)習(xí)起來(lái)~實(shí)際上網(wǎng)上已經(jīng)有很多教程了撵枢,但都是用maftools自帶的示例數(shù)據(jù)來(lái)演示的民晒。這里我自己去TCGA網(wǎng)站上隨便挑了一個(gè)maf文件來(lái)做練習(xí)(用示例數(shù)據(jù)來(lái)練習(xí)就不知道哪里會(huì)報(bào)錯(cuò)精居、哪一步需要注意了,事實(shí)證明在練習(xí)過(guò)程中潜必,有許多步驟都需要根據(jù)你的MAF里包含的信息進(jìn)行參數(shù)修改靴姿,否則是要報(bào)錯(cuò)的。如果只是使用示例數(shù)據(jù)磁滚,這些需要注意的地方都不會(huì)被發(fā)現(xiàn)的)佛吓。maftools的官方網(wǎng)站:here
從TCGA網(wǎng)站上隨便選了一個(gè)SNV的MAF文件,并且把對(duì)應(yīng)的臨床數(shù)據(jù)下載下來(lái)(這第一步就是需要注意的地方垂攘,但是如果你用的是maftools自帶的數(shù)據(jù)跑流程维雇,你不會(huì)知道第一步就是一個(gè)坑):
讀取MAF文件
這一步構(gòu)建maf對(duì)象同樣非常重要晒他,也是一個(gè)需要注意的點(diǎn)谆沃。
#這里需要注意的是:下面read.maf里必須加上isTCGA = TRUE
#否則這一步會(huì)報(bào)錯(cuò),有興趣的同學(xué)可以去掉這一個(gè)參數(shù)設(shè)置運(yùn)行一下仪芒,看看是什么報(bào)錯(cuò)唁影。
#另外,如果你看其他教程里使用的是maftools自帶的數(shù)據(jù)掂名,這一步的參數(shù)也是沒(méi)有isTCGA=TRUE的据沈,這就是為什么我沒(méi)有使用自帶數(shù)據(jù)來(lái)練習(xí),因?yàn)槟悴恢涝诜治瞿阕约簲?shù)據(jù)的時(shí)候會(huì)出現(xiàn)什么報(bào)錯(cuò)饺蔑。
> maf <- read.maf("TCGA.HNSC.mutect.somatic.maf.gz",clinicalData = "clinical.tsv",isTCGA = TRUE)
-Reading
|--------------------------------------------------|
|==================================================|
-Validating
-Silent variants: 35007
-Summarizing
--Possible FLAGS among top ten genes:
TTN
MUC16
SYNE1
-Processing clinical data
-Finished in 10.3s elapsed (9.030s cpu)
現(xiàn)在讀取文件后锌介,可以看看maf這個(gè)對(duì)象的簡(jiǎn)單的總結(jié):
> maf
An object of class MAF
ID summary Mean Median
1: NCBI_Build GRCh38 NA NA
2: Center BI NA NA
3: Samples 508 NA NA
4: nGenes 15436 NA NA
5: Frame_Shift_Del 2055 4.045 2
6: Frame_Shift_Ins 1382 2.720 1
7: In_Frame_Del 328 0.646 0
8: In_Frame_Ins 251 0.494 0
9: Missense_Mutation 56861 111.931 78
10: Nonsense_Mutation 4796 9.441 6
11: Nonstop_Mutation 63 0.124 0
12: Splice_Site 1482 2.917 2
13: Translation_Start_Site 84 0.165 0
14: total 67302 132.484 92
查看每個(gè)樣品的總結(jié)情況:
> getSampleSummary(maf)
Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation Nonsense_Mutation
1: TCGA-F7-A624 248 57 7 1 2174 63
2: TCGA-CV-7568 4 1 1 1 1645 105
3: TCGA-D6-6516 5 1 2 0 1331 78
4: TCGA-CR-7402 28 9 1 1 696 59
5: TCGA-CN-4723 8 4 0 0 564 73
---
504: TCGA-CR-7392 0 0 0 0 11 0
505: TCGA-D6-6827 1 1 2 0 7 0
506: TCGA-CR-7391 0 0 0 0 4 0
507: TCGA-CN-6017 1 0 0 0 1 0
508: TCGA-CR-7393 1 0 0 0 0 0
Nonstop_Mutation Splice_Site Translation_Start_Site total
1: 2 60 1 2613
2: 0 34 3 1794
3: 0 24 0 1441
4: 0 27 1 822
5: 1 10 1 661
---
504: 0 0 0 11
505: 0 0 0 11
506: 0 0 0 4
507: 0 0 0 2
508: 0 0 0 1
查看基因的情況:
> getGeneSummary(maf)
Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation Nonsense_Mutation Nonstop_Mutation
1: TP53 49 21 8 3 220 68 0
2: TTN 6 5 0 1 310 18 0
3: FAT1 25 23 3 0 26 61 0
4: CDKN2A 15 3 3 0 20 51 0
5: MUC16 2 2 0 3 109 11 0
---
15432: ZNHIT1 0 0 0 0 1 0 0
15433: ZNHIT2 0 0 0 0 1 0 0
15434: ZP1 0 0 0 0 1 0 0
15435: ZRSR1 0 0 0 0 1 0 0
15436: ZSCAN30 0 0 0 0 1 0 0
Splice_Site Translation_Start_Site total MutatedSamples AlteredSamples
1: 28 1 398 340 340
2: 6 0 346 194 194
3: 5 0 143 114 114
4: 12 0 104 101 101
5: 2 0 129 97 97
---
15432: 0 0 1 1 1
15433: 0 0 1 1 1
15434: 0 0 1 1 1
15435: 0 0 1 1 1
15436: 0 0 1 1 1
除此之外,你還可以使用下面的代碼查看其它信息:
#shows clinical data associated with samples查看臨床信息
> getClinicalData(maf)
#Shows all fields in MAF查看maf文件里所有的可用信息
> getFields(maf)
可視化
(1)畫(huà)MAF文件的summary圖
> plotmafSummary(maf = maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
(2)Oncoplot圖
oncoplot圖也叫做瀑布圖:
> oncoplot(maf = maf, top = 10)
當(dāng)你的樣品數(shù)很多的時(shí)候玄货,比如像我這個(gè)MAF文件有508個(gè)樣品皇钞,你還可以添加一個(gè)參數(shù),讓樣品之間沒(méi)有灰色的框框松捉,看起來(lái)會(huì)更好看一些(參考: 腫瘤變異數(shù)據(jù)分析和可視化工具maftools:突變數(shù)據(jù)下載和可視化)
> oncoplot(maf = maf, top = 10,borderCol=NULL)#樣品不顯示邊界
如果你不喜歡圖里的顏色夹界,也可以根據(jù)這篇文章里的代碼來(lái)更改顏色:maftools | 從頭開(kāi)始繪制發(fā)表級(jí)oncoplot(瀑布圖)
對(duì)于oncoplot這個(gè)函數(shù),你還可以在圖上添加其他的信息隘世,可以參考官網(wǎng):here
(3)轉(zhuǎn)換和顛換的可視化
titv
功能可以把SNP分類為轉(zhuǎn)換和顛換可柿,并且返回一個(gè)總結(jié)性的table也拜。這個(gè)總結(jié)性的數(shù)據(jù)也可以被可視化,并且可以展示每一個(gè)樣品里每一種類型的比例:
> hnsc.titv = titv(maf = maf, plot = FALSE, useSyn = TRUE)
#plot titv summary
> plotTiTv(res = hnsc.titv)
(3)氨基酸改變的Lollipop plots圖(棒棒糖圖)
如果你想畫(huà)氨基酸change的棒棒糖圖趾痘,首先要確保你的MAF文件里有相應(yīng)的信息慢哈。然而MAF文件沒(méi)有明確的標(biāo)準(zhǔn)對(duì)于氨基酸changes這一項(xiàng)信息的命名,在不同的study里可能有著不同的命名方法永票。默認(rèn)情況下卵贱,棒棒糖圖是根據(jù)MAF文件里AAChanges這一列畫(huà)的。如果你的MAF文件里沒(méi)有AAChanges這一項(xiàng)侣集,那么會(huì)返回一個(gè)warning message提示你你的MAF文件里都有什么信息键俱。比如我使用的這個(gè)MAF文件里,如果根據(jù)官網(wǎng)上寫的參數(shù)設(shè)置世分,R會(huì)返回給我一個(gè)warning message:
#在官網(wǎng)的例子里编振,他們用的MAF文件里關(guān)于氨基酸change的命名是“Protein_change”,而在我的數(shù)據(jù)里不是這個(gè)命名臭埋,所以出現(xiàn)以下報(bào)錯(cuò):
> lollipopPlot(maf = maf, gene = 'TTN', AACol = 'Protein_Change', showMutationRate = TRUE)
Available fields:
[1] "Hugo_Symbol" "Entrez_Gene_Id"
[3] "Center" "NCBI_Build"
[5] "Chromosome" "Start_Position"
[7] "End_Position" "Strand"
[9] "Variant_Classification" "Variant_Type"
[11] "Reference_Allele" "Tumor_Seq_Allele1"
[13] "Tumor_Seq_Allele2" "dbSNP_RS"
[15] "dbSNP_Val_Status" "Tumor_Sample_Barcode"
[17] "Matched_Norm_Sample_Barcode" "Match_Norm_Seq_Allele1"
[19] "Match_Norm_Seq_Allele2" "Tumor_Validation_Allele1"
[21] "Tumor_Validation_Allele2" "Match_Norm_Validation_Allele1"
[23] "Match_Norm_Validation_Allele2" "Verification_Status"
[25] "Validation_Status" "Mutation_Status"
[27] "Sequencing_Phase" "Sequence_Source"
[29] "Validation_Method" "Score"
[31] "BAM_File" "Sequencer"
[33] "Tumor_Sample_UUID" "Matched_Norm_Sample_UUID"
[35] "HGVSc" "HGVSp"
[37] "HGVSp_Short" "Transcript_ID"
[39] "Exon_Number" "t_depth"
[41] "t_ref_count" "t_alt_count"
[43] "n_depth" "n_ref_count"
[45] "n_alt_count" "all_effects"
[47] "Allele" "Gene"
[49] "Feature" "Feature_type"
[51] "One_Consequence" "Consequence"
[53] "cDNA_position" "CDS_position"
[55] "Protein_position" "Amino_acids"
[57] "Codons" "Existing_variation"
[59] "ALLELE_NUM" "DISTANCE"
[61] "TRANSCRIPT_STRAND" "SYMBOL"
[63] "SYMBOL_SOURCE" "HGNC_ID"
[65] "BIOTYPE" "CANONICAL"
[67] "CCDS" "ENSP"
[69] "SWISSPROT" "TREMBL"
[71] "UNIPARC" "RefSeq"
[73] "SIFT" "PolyPhen"
[75] "EXON" "INTRON"
[77] "DOMAINS" "GMAF"
[79] "AFR_MAF" "AMR_MAF"
[81] "ASN_MAF" "EAS_MAF"
[83] "EUR_MAF" "SAS_MAF"
[85] "AA_MAF" "EA_MAF"
[87] "CLIN_SIG" "SOMATIC"
[89] "PUBMED" "MOTIF_NAME"
[91] "MOTIF_POS" "HIGH_INF_POS"
[93] "MOTIF_SCORE_CHANGE" "IMPACT"
[95] "PICK" "VARIANT_CLASS"
[97] "TSL" "HGVS_OFFSET"
[99] "PHENO" "MINIMISED"
[101] "ExAC_AF" "ExAC_AF_Adj"
[103] "ExAC_AF_AFR" "ExAC_AF_AMR"
[105] "ExAC_AF_EAS" "ExAC_AF_FIN"
[107] "ExAC_AF_NFE" "ExAC_AF_OTH"
[109] "ExAC_AF_SAS" "GENE_PHENO"
[111] "FILTER" "CONTEXT"
[113] "src_vcf_id" "tumor_bam_uuid"
[115] "normal_bam_uuid" "case_id"
[117] "GDC_FILTER" "COSMIC"
[119] "MC3_Overlap" "GDC_Validation_Status"
Error in lollipopPlot(maf = maf, gene = "TTN", AACol = "Protein_Change", :
Column Protein_Change not found.
那么根據(jù)我的MAF文件的情況踪央,則要選擇“HGVSp_Short”這一項(xiàng)來(lái)進(jìn)行畫(huà)圖:
> lollipopPlot(maf = maf, gene = 'FAT1', AACol = 'HGVSp_Short', showMutationRate = TRUE)
也可以針對(duì)某一個(gè)突變位點(diǎn)進(jìn)行標(biāo)注:
> lollipopPlot(maf = maf, gene = 'FAT1', showDomainLabel = FALSE, labelPos = 4136)
(4)Rainfall plots圖
腫瘤基因組,特別是實(shí)體腫瘤瓢阴,具有局部超突變(hyper-mutations)畅蹂。這種超突變基因組區(qū)域可以通過(guò)在線性基因組尺度上繪制變異間距離來(lái)進(jìn)行可視化。這種圖通常稱為Rainfall plots荣恐,我們可以使用rainfallPlot
來(lái)畫(huà)圖液斜。如果detectChangePoints參數(shù)設(shè)置為TRUE,rainfallPlot圖還可以突出顯示突變間距離潛在變化所在的區(qū)域叠穆。NOTE:因?yàn)槲沂褂玫臄?shù)據(jù)里沒(méi)有檢測(cè)到changepoints少漆,所以我就借用一下官網(wǎng)的圖
> rainfallPlot(maf = maf, detectChangePoints = TRUE, pointSize = 0.6)
(5)與TCGA cohort比較突變負(fù)荷
將自己的MAF與自帶的33個(gè)TCGA cohorts進(jìn)行比較:
> hnsc.mutload = tcgaCompare(maf = maf, cohortName = 'Example-hnsc')
Performing pairwise t-test for differences in mutation burden..
Warning message:
In FUN(X[[i]], ...) : Removed 0 samples with zero mutations.
(6)變異等位基因頻率(VAF)可視化
NOTE:由于我的數(shù)據(jù)里沒(méi)有i_TumorVAF_WU
這一項(xiàng)信息,所以這里的代碼會(huì)有所不同硼被。如果你使用的MAF文件里也沒(méi)有相關(guān)信息也沒(méi)有關(guān)系示损,將參數(shù)設(shè)置成vafCol = NULL,plotVaf
會(huì)自動(dòng)計(jì)算并進(jìn)行可視化:
> plotVaf(maf = maf, vafCol = NULL)
(7)Somatic Interactions
對(duì)top基因的突變事件畫(huà)相關(guān)性圖:
> somaticInteractions(maf = maf, top = 25, pvalue = c(0.05, 0.01))
gene1 gene2 pValue oddsRatio 00 11 01 10
1: CDKN2A TP53 3.652900e-09 5.7583311 158 91 249 10
2: DMD PAPPA2 2.851697e-07 7.9922491 436 15 29 28
3: TTN RYR2 3.054658e-07 4.8395698 299 38 15 156
4: FLG MUC16 2.233155e-06 3.9689241 373 28 69 38
5: PCLO CSMD3 2.291170e-06 3.8123129 372 29 60 47
---
296: PAPPA2 PIK3CA 1.000000e+00 0.9221460 385 7 79 37
297: FAM135B KMT2D 1.000000e+00 0.9455631 389 7 72 40
298: CASP8 USH2A 1.000000e+00 0.9510660 401 6 55 46
299: PKHD1L1 CASP8 1.000000e+00 0.9716819 411 5 47 45
300: XIRP2 CASP8 1.000000e+00 0.9961390 412 5 47 44
Event pair event_ratio
1: Co_Occurence CDKN2A, TP53 91/259
2: Co_Occurence DMD, PAPPA2 15/57
3: Co_Occurence RYR2, TTN 38/171
4: Co_Occurence FLG, MUC16 28/107
5: Co_Occurence CSMD3, PCLO 29/107
---
296: Mutually_Exclusive PAPPA2, PIK3CA 7/116
297: Mutually_Exclusive FAM135B, KMT2D 7/112
298: Mutually_Exclusive CASP8, USH2A 6/101
299: Mutually_Exclusive CASP8, PKHD1L1 5/92
300: Mutually_Exclusive CASP8, XIRP2 5/91
(8)鑒定腫瘤driver 基因
maftools有一個(gè)功能oncodrive
祷嘶,它可以鑒定一個(gè)MAF文件里的腫瘤driver基因屎媳。這個(gè)功能根據(jù)oncodriveCLUST算法進(jìn)行運(yùn)算夺溢。這個(gè)概念是基于大多數(shù)致癌基因的變異在少數(shù)特定位點(diǎn)(即熱點(diǎn))富集论巍。
> hnsc.sig = oncodrive(maf = maf, AACol = 'HGVSp_Short', minMut = 5, pvalMethod = 'zscore')
Estimating background scores from synonymous variants..
|===========================================================================| 100%Not enough genes to build background. Using predefined values. (Mean = 0.279; SD = 0.13)
Estimating cluster scores from non-syn variants..
|===========================================================================| 100%Comapring with background model and estimating p-values..
Done !
> head(hnsc.sig)
Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
1: HRAS 0 1 0 0
2: MAPK1 0 0 0 0
3: RAC1 0 0 0 0
4: HIST1H4K 0 0 0 0
5: HIST1H3C 0 0 0 0
6: MOS 0 0 0 0
Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site
1: 28 1 0 0
2: 7 1 0 0
3: 12 0 0 0
4: 5 1 0 0
5: 6 0 0 0
6: 6 0 0 0
Translation_Start_Site total MutatedSamples AlteredSamples clusters
1: 0 30 29 29 2
2: 0 8 8 8 1
3: 0 12 12 12 3
4: 0 6 6 6 2
5: 0 6 6 6 1
6: 0 6 6 6 2
muts_in_clusters clusterScores protLen zscore pval fdr
1: 26 0.8666667 189 4.520513 3.084500e-06 0.0006719096
2: 7 0.8750000 360 4.584615 2.274114e-06 0.0006719096
3: 11 0.8627961 211 4.490739 3.548818e-06 0.0006719096
4: 5 0.7845178 103 3.888598 5.041239e-05 0.0071585591
5: 5 0.7500000 136 3.623077 1.455596e-04 0.0165355705
6: 4 0.6666667 346 2.982051 1.431620e-03 0.1355267092
fract_muts_in_clusters
1: 0.8666667
2: 0.8750000
3: 0.9166667
4: 0.8333333
5: 0.8333333
6: 0.6666667
> plotOncodrive(res = hnsc.sig, fdrCutOff = 0.01, useFraction = TRUE,labelSize = 1)
plotOncodrive將結(jié)果繪制成散點(diǎn)圖,點(diǎn)的大小與基因中發(fā)現(xiàn)的簇?cái)?shù)成比例风响。x軸顯示了在這些clusters中觀察到的突變數(shù)量(或突變的部分)嘉汰。
(9)添加pfam domain區(qū)域
maftools附帶pfamDomains功能,可以在氨基酸changes里添加pfam域信息状勤。根據(jù)受影響的結(jié)構(gòu)域鞋怀,還總結(jié)了氨基酸結(jié)構(gòu)域的變化双泪。這有助于了解特定癌癥群體中哪些domain最容易受到影響。
> hnsc.pfam = pfamDomains(maf = maf, AACol = 'HGVSp_Short', top = 10)
#Protein summary (Printing first 7 columns for display convenience)
> hnsc.pfam$proteinSummary[,1:7, with = FALSE]
HGNC AAPos Variant_Classification N total fraction
1: PIK3CA 545 Missense_Mutation 24 88 0.27272727
2: CDKN2A 80 Nonsense_Mutation 20 104 0.19230769
3: PIK3CA 542 Missense_Mutation 18 88 0.20454545
4: PIK3CA 1047 Missense_Mutation 15 88 0.17045455
5: TP53 273 Missense_Mutation 13 398 0.03266332
---
65993: ZZZ3 90 Missense_Mutation 1 6 0.16666667
65994: pk 669 Missense_Mutation 1 4 0.25000000
65995: pk 448 Missense_Mutation 1 4 0.25000000
65996: pk 466 Missense_Mutation 1 4 0.25000000
65997: pk 353 Missense_Mutation 1 4 0.25000000
DomainLabel
1: PI3Ka_I
2: ANK
3: PI3Ka_I
4: PI3Kc_IA_alpha
5: P53
---
65993: <NA>
65994: <NA>
65995: <NA>
65996: <NA>
65997: <NA>
#用下面的代碼查看計(jì)算出的pfam區(qū)域的信息
#Domain summary (Printing first 3 columns for display convenience)
> hnsc.pfam$domainSummary[,1:3, with = FALSE]
DomainLabel nMuts nGenes
1: 7tm_1 1490 495
2: COG5048 969 343
3: Cadherin_repeat 814 106
4: FN3 544 115
5: I-set 380 86
---
5725: zf-UBR 1 1
5726: zf-ZPR1 1 1
5727: zf-piccolo 1 1
5728: zf-rbx1 1 1
5729: zinc finger 1 1
(10)臨床富集分析
clinicalEnrichment
功能可以把任何臨床特征與相關(guān)的樣品進(jìn)行富集分析密似。 比如我想把臨床分期與突變基因之間的關(guān)系表示出來(lái):
> adcc_clinical_stage.ce = clinicalEnrichment(maf = maf, clinicalFeature = 'ajcc_clinical_stage')
Sample size per factor in ajcc_clinical_stage:
'-- Stage I Stage II Stage III Stage IVA Stage IVB Stage IVC
14 20 95 104 260 9 6
> adcc_clinical_stage.ce$groupwise_comparision[p_value < 0.001]
Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2
1: ANO9 '-- Rest 3 of 14 4 of 494
2: ADAMTSL1 Stage III Rest 8 of 104 4 of 404
3: INPP4B Stage IVB Rest 3 of 9 8 of 499
4: FMN2 '-- Rest 5 of 14 26 of 494
5: ITGB1 Stage III Rest 7 of 104 3 of 404
p_value OR_low OR_high fdr
1: 0.0005491163 0 0.1740052 1
2: 0.0005801867 0 0.3843828 1
3: 0.0005938635 0 0.1801677 1
4: 0.0008275002 0 0.3323002 1
5: 0.0008798140 0 0.3826840 1
比如上面的ADAMTSL1 這個(gè)基因焙矛,主要在stageIII這個(gè)分期有突變的富集,同樣的残腌,INPP4B基因的突變主要在Stage IVB這個(gè)分期中出現(xiàn)村斟。因?yàn)樵谂R床樣品的信息里,有些病人的信息并不完整抛猫,所以有一個(gè)群體是--蟆盹,這里就會(huì)出現(xiàn)有的基因在“--”群體里有突變的富集。
> plotEnrichmentResults(enrich_res = adcc_clinical_stage.ce, pVal = 0.001)
(11)藥物-基因相互作用
drugInteractions
功能是用來(lái)檢查藥物-基因相互作用闺金,以及基因的成藥性逾滥,這個(gè)分析是由 Drug Gene Interaction database匯編而成的。
> dgi = drugInteractions(maf = maf, fontSize = 0.75)
上面的圖顯示了在這個(gè)MAF文件中涉及到的前5個(gè)基因败匹,以及潛在的可給藥的基因類別寨昙。也可以提取藥物-基因相互作用的信息(基因名稱、相互作用類型掀亩、藥物名稱)毅待。例如,下面是已知/報(bào)道的藥物與RYR2相互作用的結(jié)果归榕。NOTE:這個(gè)藥物-基因信息只用于研究使用尸红,并不具有臨床指導(dǎo)意義
> RYR2.dgi = drugInteractions(genes = "RYR2", drugs = TRUE)
Number of claimed drugs for given genes:
Gene N
1: RYR2 9
> RYR2.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]
Gene interaction_types drug_name
1: RYR2 activator,channel blocker RYANODINE
2: RYR2 activator SURAMIN
3: RYR2 antagonist MAGNESIUM
4: RYR2 activator ADENOSINE TRIPHOSPHATE
5: RYR2 activator CAFFEINE
6: RYR2 activator,antagonist cA2
7: RYR2 antagonist
8: RYR2 channel blocker PROCAINE
9: RYR2 channel blocker RUTHENIUM RED
drug_claim_name
1: 4303
2: 1728
3: 708
4: 1713
5: 407
6: 707
7: DAP000499
8: 4291
9: 2432
(12)Oncogenic信號(hào)通路富集
OncogenicPathways
功能可以查看樣品里關(guān)于已知Oncogenic信號(hào)通路的富集情況:
> OncogenicPathways(maf = maf)
Pathway N n_affected_genes fraction_affected
1: NRF2 3 3 1.0000000
2: TP53 6 6 1.0000000
3: TGF-Beta 7 7 1.0000000
4: Cell_Cycle 15 10 0.6666667
5: MYC 13 11 0.8461538
6: PI3K 29 26 0.8965517
7: Hippo 38 36 0.9473684
8: NOTCH 71 58 0.8169014
9: WNT 68 61 0.8970588
10: RTK-RAS 85 80 0.9411765
Mutated_samples Fraction_mutated_samples
1: 57 0.11220472
2: 349 0.68700787
3: 47 0.09251969
4: 127 0.25000000
5: 37 0.07283465
6: 163 0.32086614
7: 279 0.54921260
8: 274 0.53937008
9: 171 0.33661417
10: 278 0.54724409
也可以對(duì)某一個(gè)完整的信號(hào)通路進(jìn)行可視化:
> PlotOncogenicPathways(maf = maf, pathways = "PI3K")
(13)腫瘤異質(zhì)性和MATH分?jǐn)?shù)
腫瘤通常是異質(zhì)性的盅蝗,即由多個(gè)克隆組成。這種異質(zhì)性可以通過(guò)聚類變異等位基因頻率來(lái)推斷姆蘸。inferHeterogeneity
函數(shù)使用vaf信息聚類變異(使用mclust)墩莫,以推斷克隆。默認(rèn)情況下逞敷,inferHeterogeneity
函數(shù)查找MAF文件里包含的vaf信息的(t_vaf)狂秦。但是,如果你的MAF文件里的vaf信息名稱不是t_vaf推捐,可以使用參數(shù)vafCol手動(dòng)指定裂问。在我選的這個(gè)MAF文件里并沒(méi)有vaf信息,參數(shù)設(shè)置vafCol = NULL,它會(huì)自動(dòng)計(jì)算堪簿。
#從MAF文件里隨便選取一個(gè)樣品來(lái)查看腫瘤異質(zhì)性
> tcga.cn.4741.het = inferHeterogeneity(maf = maf, tsb = 'TCGA-CN-4741', vafCol = NULL)
t_vaf field is missing, but found t_ref_count & t_alt_count columns. Estimating vaf..
Processing TCGA-CN-4741..
> print(tcga.cn.4741.het$clusterMeans)
Tumor_Sample_Barcode cluster meanVaf
1: TCGA-CN-4741 2 0.3846020
2: TCGA-CN-4741 1 0.0878295
3: TCGA-CN-4741 3 0.7790545
4: TCGA-CN-4741 outlier 0.2222222
> plotClusters(clusters = tcga.cn.4741.het)
從上圖中可以清楚的看到有三個(gè)克隆痊乾,平均變異等位基因頻率約為77.9%(主要克隆),38.46%和8.8%椭更。
盡管突變等位基因頻率的聚類給我們一個(gè)關(guān)于異質(zhì)性的概念哪审,它也可能是衡量異質(zhì)性的程度的一個(gè)數(shù)值。MATH分?jǐn)?shù)(在上面的圖中是54.585)是一種簡(jiǎn)單的腫瘤內(nèi)異質(zhì)性的定量測(cè)量虑瀑,它計(jì)算出vaf分布的寬度协饲。MATH的分?jǐn)?shù)越高,代表著越差的outcome(可以理解為預(yù)后效果)缴川。所以MATH分?jǐn)?shù)也可以作為生存分析的代理變量茉稠。
(14)Mutational Signatures
每一種癌癥在其發(fā)展過(guò)程中都會(huì)留下以特定的核苷酸替換為特征的標(biāo)志。Alexandrov等人從7000多個(gè)癌癥樣本中發(fā)現(xiàn)了這種突變特征(Mutational Signatures)把夸。這些特征可以通過(guò)分解核苷酸置換矩陣來(lái)提取而线,并根據(jù)突變堿基周圍的直接堿基將其劃分為96個(gè)置換類別。提取的特征也可以與那些已經(jīng)經(jīng)過(guò)驗(yàn)證的特征進(jìn)行比較恋日。
特征分析的第一步是獲得突變堿基周圍的堿基膀篮,形成突變矩陣。注意:早期版本的maftools需要一個(gè)fasta文件作為輸入岂膳。從1.8.0版本開(kāi)始誓竿,BSgenome對(duì)象被用于更快的序列提取。
#這里需要注意的是谈截,你要知道你的樣品是根據(jù)哪一個(gè)基因組比對(duì)來(lái)的筷屡,使用的包也會(huì)不一樣,
#這里官網(wǎng)用的是hg19簸喂,然而我這里如果用hg19就會(huì)報(bào)錯(cuò)
#當(dāng)你用maf查看你的MAF文件信息summary的時(shí)候毙死,如果NCBI_build那一項(xiàng)是hg19,這里就要用hg19;我的MAF文件是用hg38來(lái)build的喻鳄,所以這里也要用相應(yīng)的基因組
> BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
> library(BSgenome.Hsapiens.UCSC.hg38, quietly = TRUE)
> hnsc.tnm = trinucleotideMatrix(maf = maf, ref_genome = "BSgenome.Hsapiens.UCSC.hg38")
-Extracting 5' and 3' adjacent bases
-Extracting +/- 20bp around mutated bases for background C>T estimation
-Estimating APOBEC enrichment scores
--Performing one-way Fisher's test for APOBEC enrichment
---APOBEC related mutations are enriched in 43.984 % of samples (APOBEC enrichment score > 2 ; 223 of 507 samples)
-Creating mutation matrix
--matrix of dimension 507x96
上面的分析過(guò)程主要是分兩步:
(1)計(jì)算APOBEC富集分?jǐn)?shù)
(2)準(zhǔn)備特征分析的突變矩陣
APOBEC誘導(dǎo)的突變?cè)趯?shí)體腫瘤中更為常見(jiàn)扼倘,主要與TCW motif中發(fā)生的C>T轉(zhuǎn)換事件有關(guān)。使用Roberts等人所描述的方法計(jì)算上述命令中的APOBEC富集分?jǐn)?shù)除呵。簡(jiǎn)單地說(shuō)再菊,在一個(gè)給定的樣本中,將TCW motif中發(fā)生的C>T突變與所有C>T突變的富集情況的比例與背景胞嘧啶和發(fā)生在突變堿基20bp內(nèi)的TCWs進(jìn)行比較颜曾。
我們還可以分析APOBEC富集和非APOBEC富集的樣品在突變模式上的差異纠拔。plotApobecDiff
函數(shù)采用trinucleotideMatrix計(jì)算APOBEC富集分?jǐn)?shù),將樣本分為APOBEC富集和非APOBEC富集泛啸。分組后绿语,比較這兩組秃症,以確定改變的基因的差異候址。
> plotApobecDiff(tnm = hnsc.tnm, maf = maf, pVal = 0.01)
Showing top 10 of 24 differentially mutated genes
$results
Hugo_Symbol Enriched nonEnriched pval or ci.up ci.low adjPval
1: NSD1 13 44 0.0005989521 0.3383091 0.6608655 0.16251777 1
2: CASP8 34 18 0.0011488374 2.6532642 5.1495987 1.40831724 1
3: SMTN 8 0 0.0013041485 Inf Inf 2.22107156 1
4: FSTL5 2 18 0.0019547843 0.1341226 0.5700077 0.01491342 1
5: TGFBR2 16 5 0.0029623612 4.3012453 15.2586418 1.47518180 1
---
10105: ZSCAN4 3 3 1.0000000000 1.2766417 9.6232806 0.16933136 1
10106: ZSCAN5B 3 4 1.0000000000 0.9546251 5.7067946 0.13838894 1
10107: ZSWIM2 2 3 1.0000000000 0.8479356 7.4690287 0.07022655 1
10108: ZWILCH 1 2 1.0000000000 0.6356815 12.2842428 0.01072353 1
10109: ZZZ3 3 3 1.0000000000 1.2766417 9.6232806 0.16933136 1
$SampleSummary
Cohort SampleSize Mean Median
1: Enriched 223 133.915 108
2: nonEnriched 284 131.824 81
特征分析
特征分析包括以下幾步:
(1)estimateSignatures:在一系列的值上運(yùn)行NMF并測(cè)量擬合優(yōu)度吕粹。
(2)plotCophenetic:繪制elblow圖并幫助你決定特征的最佳數(shù)量。最佳可能的特征是共同相關(guān)性顯著下降的值岗仑。
(3)extractSignatures:使用非負(fù)矩陣分解將矩陣分解為n個(gè)特征匹耕。根據(jù)上述兩個(gè)步驟選擇n的值。如果你可以自己估計(jì)n的值荠雕,可以跳過(guò)以上兩步稳其。
(4)compareSignatures:從上一步驟中提取的特征可以與COSMIC數(shù)據(jù)庫(kù)中已知的特征進(jìn)行比較,并計(jì)算余弦相似度以確定最佳匹配炸卑。
(5)plotSignatures:可視化
> library('NMF')
> hnsc.sign = estimateSignatures(mat = hnsc.tnm, nTry = 10) #這里我用的nTry是10既鞠,官網(wǎng)用的是6
-Running NMF for 10 ranks
Compute NMF rank= 2 ... + measures ... OK
Compute NMF rank= 3 ... + measures ... OK
Compute NMF rank= 4 ... + measures ... OK
Compute NMF rank= 5 ... + measures ... OK
Compute NMF rank= 6 ... + measures ... OK
Compute NMF rank= 7 ... + measures ... OK
Compute NMF rank= 8 ... + measures ... OK
Compute NMF rank= 9 ... + measures ... OK
Compute NMF rank= 10 ... + measures ... OK
-Finished in 00:06:53 elapsed (29.9s cpu)
> plotCophenetic(res = hnsc.sign)
最佳可能值是y軸上相關(guān)值顯著下降之前的值。那么上圖的話盖文,有兩個(gè)顯著下降嘱蛋,一個(gè)是2,另一個(gè)是在6五续。這里官網(wǎng)的示例數(shù)據(jù)里只有一個(gè)顯著下降的值洒敏,并且官網(wǎng)在這里寫到:對(duì)于突變負(fù)荷較高的實(shí)體腫瘤,如果有足夠的樣本數(shù)量疙驾,會(huì)有更多的特征凶伙。我所用的HNSC腫瘤突變負(fù)荷比官網(wǎng)使用的LAML突變負(fù)荷要高,所以我選擇n=6它碎。確定好n的值函荣,就可以正式分析了。
> hnsc.sig = extractSignatures(mat = hnsc.tnm, n = 6)
-Running NMF for factorization rank: 6
-Finished in9.390s elapsed (7.940s cpu)
> hnsc.og30.cosm = compareSignatures(nmfRes = hnsc.sig, sig_db = "legacy")
-Comparing against COSMIC signatures
------------------------------------
--Found Signature_1 most similar to COSMIC_13
Aetiology: APOBEC Cytidine Deaminase (C>G) [cosine-similarity: 0.832]
--Found Signature_2 most similar to COSMIC_1
Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.937]
--Found Signature_3 most similar to COSMIC_4
Aetiology: exposure to tobacco (smoking) mutagens [cosine-similarity: 0.946]
--Found Signature_4 most similar to COSMIC_7
Aetiology: UV exposure [cosine-similarity: 0.938]
--Found Signature_5 most similar to COSMIC_6
Aetiology: defective DNA mismatch repair [cosine-similarity: 0.798]
--Found Signature_6 most similar to COSMIC_5
Aetiology: Unknown [cosine-similarity: 0.805]
------------------------------------
> hnsc.v3.cosm = compareSignatures(nmfRes = hnsc.sig, sig_db = "SBS")
-Comparing against COSMIC signatures
------------------------------------
--Found Signature_1 most similar to SBS2
Aetiology: APOBEC Cytidine Deaminase (C>T) [cosine-similarity: 0.714]
--Found Signature_2 most similar to SBS1
Aetiology: spontaneous or enzymatic deamination of 5-methylcytosine [cosine-similarity: 0.933]
--Found Signature_3 most similar to SBS4
Aetiology: exposure to tobacco (smoking) mutagens [cosine-similarity: 0.919]
--Found Signature_4 most similar to SBS7b
Aetiology: UV exposure [cosine-similarity: 0.979]
--Found Signature_5 most similar to SBS6
Aetiology: defective DNA mismatch repair [cosine-similarity: 0.719]
--Found Signature_6 most similar to SBS5
Aetiology: Unknown [cosine-similarity: 0.78]
------------------------------------
> library('pheatmap')
> pheatmap::pheatmap(mat = hnsc.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")
> maftools::plotSignatures(nmfRes = hnsc.sig, title_size = 1, sig_db = "SBS")
特征富集分析
上面鑒定出的特征可以進(jìn)一步分配到樣品上扳肛,并且使用signatureEnrichment功能進(jìn)行富集分析偏竟,鑒定每一個(gè)特征的突變富集。
> hnsc.se <- signatureEnrichment(maf = maf, sig_res = hnsc.sig)
Running k-means for signature assignment..
Performing pairwise and groupwise comparisions..
Sample size per factor in Signature:
Signature_1 Signature_2
194 313
Estimating mutation load and signature exposures..
這一步有一點(diǎn)問(wèn)題敞峭,上一步計(jì)算特征數(shù)量的時(shí)候踊谋,如果我用的n=6得出的hnsc.sig,在這里就會(huì)報(bào)錯(cuò)旋讹。我查閱了網(wǎng)上殖蚕,也有人和我有一樣的報(bào)錯(cuò)情況,他是把n值降低沉迹,這一步就不會(huì)報(bào)錯(cuò)了(signatureEnrichment not assigning all samples to signatures),所以我試了n=2又重新運(yùn)行了一遍睦疫,這里就沒(méi)有出現(xiàn)報(bào)錯(cuò)信息,在這個(gè)網(wǎng)址里也提到鞭呕,如果他使用n>3的數(shù)值也會(huì)報(bào)錯(cuò)蛤育,不知道是不是個(gè)bug。當(dāng)然上面的圖也要重新做。這里我就不再貼一遍圖了瓦糕。
>plotEnrichmentResults(enrich_res = hnsc.se, pVal = 0.005)