突變相關(guān)分析的時(shí)候我們經(jīng)常會(huì)選擇瀑布圖進(jìn)行展示,瀑布圖看起來(lái)十分的復(fù)雜高端奕剃,但是實(shí)際上只需要用一個(gè)GenVisR的R包即可解決。今天就讓我們來(lái)看一下如何做瀑布圖吧捐腿。另外纵朋,因?yàn)楸救瞬皇亲鲈摲较虻模绻忻枋霾粶?zhǔn)確的地方茄袖,請(qǐng)大家指正操软。
什么是瀑布圖 Waterfall Plot
Wiki上介紹的瀑布圖分為兩種,一類是2D形式宪祥,另一類是3D形式聂薪。我們簡(jiǎn)單介紹一下2D形式的瀑布圖。該類瀑布圖用于描述一系列中間正值或負(fù)值如何影響初始值品山。通常胆建,初始值和最終值(端點(diǎn))由整列表示,而中間值則顯示為基于上一列的值開始的浮動(dòng)列肘交。這些列可以用不同的顏色標(biāo)記笆载,以區(qū)分正值和負(fù)值。
可以看到該例子展示了獲利能力的分析涯呻。但是用于展示突變的瀑布圖和傳統(tǒng)的瀑布圖并不完全一樣凉驻,不過(guò)他們的展現(xiàn)形式很像。
在SNP的瀑布圖中复罐,橫軸表示的是不同的樣本涝登,縱軸是基因,填充則代表該基因發(fā)生突變效诅,不同的顏色代表不同的突變情況胀滚。上面的柱狀圖是對(duì)于每個(gè)樣本突變情況的統(tǒng)計(jì)。
所以從該圖我們既能夠獲得每個(gè)樣本的具體信息乱投,也能夠從全局了解這一組樣本的整體情況咽笼,很好地展示了突變的情況。
怎么做瀑布圖
本次作圖我們使用一個(gè)叫做GenVisR的R包,該包除了提供瀑布圖還提供了多種其他形式較為復(fù)雜的戚炫、用于展現(xiàn)多個(gè)樣本突變情況的數(shù)據(jù)圖(見(jiàn)下圖)剑刑,具體的作圖方法大家可以參考GenVisR使用手冊(cè)。
今天我們要使用該包提供的一個(gè)叫做brcaMAF的數(shù)據(jù)表,通過(guò)名字也可以看出這是乳腺癌的數(shù)據(jù)施掏,該數(shù)據(jù)包含50個(gè)樣本钮惠,來(lái)源于TCGA,格式為MAF格式七芭。
MAF格式是Mutation Annotation Format的縮寫素挽,是一個(gè)以制表符分隔的文本文件,其中聚合了來(lái)自VCF文件的突變信息抖苦,該文件格式標(biāo)準(zhǔn)由TCGA制定毁菱,包含了一些關(guān)于突變的常見(jiàn)信息,進(jìn)一步的具體信息詳見(jiàn)MAF格式介紹
1)需要什么格式的數(shù)據(jù)
我們首先來(lái)看一下brcaMAF數(shù)據(jù)的情況锌历,可以看到該數(shù)據(jù)包括了55列信息税灌,如Hugo_Symbol柴钻、Chromosome等等,一共觀察到了2773個(gè)突變隐轩。
colnames(brcaMAF)
[1] "Hugo_Symbol" "Entrez_Gene_Id" "Center" "NCBI_Build" "Chromosome" "Start_Position"
[7] "End_Position" "Strand" "Variant_Classification" "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
[13] "Tumor_Seq_Allele2" "dbSNP_RS" "dbSNP_Val_Status" "Tumor_Sample_Barcode" "Matched_Norm_Sample_Barcode" "Match_Norm_Seq_Allele1"
[19] "Match_Norm_Seq_Allele2" "Tumor_Validation_Allele1" "Tumor_Validation_Allele2" "Match_Norm_Validation_Allele1" "Match_Norm_Validation_Allele2" "Verification_Status"
[25] "Validation_Status" "Mutation_Status" "Sequencing_Phase" "Sequence_Source" "Validation_Method" "Score"
[31] "BAM_File" "Sequencer" "Tumor_Sample_UUID" "Matched_Norm_Sample_UUID" "chromosome_name_WU" "start_WU"
[37] "stop_WU" "reference_WU" "variant_WU" "type_WU" "gene_name_WU" "transcript_name_WU"
[43] "transcript_species_WU" "transcript_source_WU" "transcript_version_WU" "strand_WU" "transcript_status_WU" "trv_type_WU"
[49] "c_position_WU" "amino_acid_change_WU" "ucsc_cons_WU" "domain_WU" "all_domains_WU" "deletion_substructures_WU"
[55] "transcript_error"
head(brcaMAF)
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status
1 A2ML1 144568 genome.wustl.edu 37 12 8994108 8994108 + Missense_Mutation SNP G G C novel
2 AADAC 13 genome.wustl.edu 37 3 151545656 151545656 + Missense_Mutation SNP A A G novel
3 AADAT 51166 genome.wustl.edu 37 4 170991750 170991750 + Silent SNP G G A novel
4 AASS 10157 genome.wustl.edu 37 7 121756793 121756793 + Missense_Mutation SNP G G A novel
5 ABAT 0 genome.wustl.edu 37 16 8857982 8857982 + Silent SNP G G A novel
6 ABCA3 21 genome.wustl.edu 37 16 2335631 2335631 + Missense_Mutation SNP C T T novel
Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2
1 TCGA-A1-A0SO-01A-22D-A099-09 TCGA-A1-A0SO-10A-03D-A099-09 G G G C G G
2 TCGA-A2-A0EU-01A-22W-A071-09 TCGA-A2-A0EU-10A-01W-A071-09 A A
3 TCGA-A2-A0ER-01A-21W-A050-09 TCGA-A2-A0ER-10A-01W-A055-09 G G
4 TCGA-A2-A0EN-01A-13D-A099-09 TCGA-A2-A0EN-10A-01D-A099-09 G G G A G G
5 TCGA-A1-A0SI-01A-11D-A142-09 TCGA-A1-A0SI-10B-01D-A142-09 G G
6 TCGA-A2-A0D0-01A-11W-A019-09 TCGA-A2-A0D0-10A-01W-A021-09 C C
Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID
1 Unknown Valid Somatic Phase_IV WXS Illumina_WXS_gDNA 1 dbGAP Illumina GAIIx b3568259-c63c-4eb1-bbc7-af711ddd33db 17ba8cdb-e35b-4496-a787-d1a7ee7d4a1e
2 Unknown Untested Somatic Phase_IV WXS none 1 dbGAP Illumina GAIIx de30da8f-903f-428e-a63d-59625fc858a9 1583a7c5-c835-44fa-918a-1448abf6533d
3 Unknown Untested Somatic Phase_IV WXS none 1 dbGAP Illumina GAIIx 31ed187e-9bfe-4ca3-8cbb-10c1e0184331 2bc2fdaf-fb2f-4bfd-9e20-e20edff6633a
4 Unknown Valid Somatic Phase_IV WXS Illumina_WXS_gDNA 1 dbGAP Illumina GAIIx 12362ad7-6866-4e7a-9ec6-8a0a68df8896 ad478c68-a18b-4529-ad7a-86039e6da6b1
5 Unknown Untested Somatic Phase_IV WXS none 1 dbGAP Illumina GAIIx e218c272-a7e1-4bc9-b8c5-d2d1c903550f fbcab9dc-4a6b-4928-9459-699c9932e3e1
6 Unknown Untested Somatic Phase_IV WXS none 1 dbGAP Illumina GAIIx 3f20d0fe-aaa1-40f1-b2c1-7f070f93aef5 bbf1c43d-d7b3-4574-a074-d22ad537829c
chromosome_name_WU start_WU stop_WU reference_WU variant_WU type_WU gene_name_WU transcript_name_WU transcript_species_WU transcript_source_WU transcript_version_WU strand_WU transcript_status_WU trv_type_WU
1 12 8994108 8994108 G C SNP A2ML1 NM_144670.3 human genbank 58_37c 1 validated missense
2 3 151545656 151545656 A G SNP AADAC NM_001086.2 human genbank 58_37c 1 reviewed missense
3 4 170991750 170991750 G A SNP AADAT NM_016228.3 human genbank 58_37c -1 reviewed silent
4 7 121756793 121756793 G A SNP AASS NM_005763.3 human genbank 58_37c -1 reviewed missense
5 16 8857982 8857982 G A SNP ABAT NM_000663.4 human genbank 58_37c 1 reviewed silent
6 16 2335631 2335631 C T SNP ABCA3 NM_001089.2 human genbank 58_37c -1 reviewed missense
c_position_WU amino_acid_change_WU ucsc_cons_WU domain_WU
1 c.1224 p.W408C 0.995 NULL
2 c.896 p.N299S 0.000 HMMPfam_Abhydrolase_3,superfamily_alpha/beta-Hydrolases
3 c.708 p.L236 1.000 HMMPfam_Aminotran_1_2,superfamily_PLP-dependent transferases
4 c.788 p.T263M 1.000 HMMPfam_AlaDh_PNT_C
5 c.423 p.E141 0.987 HMMPfam_Aminotran_3,superfamily_PyrdxlP-dep_Trfase_major
6 c.3295 p.D1099N 0.980 NULL
all_domains_WU
1 HMMPfam_A2M,HMMPfam_A2M_N,superfamily_Terpenoid cyclases/Protein prenyltransferases,HMMPfam_A2M_recep,superfamily_Alpha-macroglobulin receptor domain,HMMPfam_A2M_N_2,HMMPfam_A2M_comp,HMMPfam_Thiol-ester_cl,PatternScan_ALPHA_2_MACROGLOBULIN
2 PatternScan_LIPASE_GDXG_SER,HMMPfam_Abhydrolase_3,superfamily_alpha/beta-Hydrolases
3 HMMPfam_Aminotran_1_2,superfamily_PLP-dependent transferases
4 HMMPfam_Saccharop_dh,HMMPfam_AlaDh_PNT_C,HMMPfam_AlaDh_PNT_N,superfamily_NAD(P)-binding Rossmann-fold domains,superfamily_Formate/glycerate dehydrogenase catalytic domain-like,superfamily_Glyceraldehyde-3-phosphate dehydrogenase-like C-terminal domain
5 HMMPfam_Aminotran_3,PatternScan_AA_TRANSFER_CLASS_3,superfamily_PyrdxlP-dep_Trfase_major
6 HMMPfam_ABC_tran,HMMSmart_SM00382,PatternScan_ABC_TRANSPORTER_1,superfamily_P-loop containing nucleoside triphosphate hydrolases
deletion_substructures_WU transcript_error
1 - no_errors
2 - no_errors
3 - no_errors
4 - no_errors
5 - no_errors
6 - no_errors
那么我們的MAF文件也需要那么多信息嗎图云?并非如此,和很多其他作圖所需數(shù)據(jù)一樣卤材,其中有一些信息是必須提供的遮斥,另外一些是非必須的。
具體地分為三類情況扇丛,瀑布圖地功能提供了三種數(shù)據(jù)格式的選擇:
1.MAF
必須包括列名為"Tumor_Sample_Barcode","Hugo_Symbol","Variant_Classification"的信息
2.MGI
必須包括列名為"sample","gene_name","trv_type"的信息
3.Custom
必須包括列名為"sample", "gene", "variant_class"的信息
MGI也是一種以制表符分割的文本文件术吗,具體的可以見(jiàn)鏈接MGI格式介紹
2)如何作圖
waterfall
函數(shù)有很多參數(shù),可以根據(jù)需求展示突變信息帆精,那么下面就來(lái)一步作圖较屿,我們展示幾種常用的參數(shù)用途,其他更多具體參數(shù)的意義可以查看幫助?waterfall
(本文的代碼來(lái)源GenVisR官方手冊(cè))
# 最基本的作圖
waterfall(brcaMAF, fileType="MAF")
# 展示至少在6%的樣本中存在的突變
waterfall(brcaMAF, mainRecurCutoff = 0.06)
#特定基因的突變圖譜
waterfall(brcaMAF, plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"))
我們常常需要結(jié)合樣本的臨床信息分析突變情況卓练,那么要怎樣同時(shí)展示樣本的臨床信息呢隘蝎?
#建立臨床信息
#分組情況為5組
subtype <- c("lumA", "lumB", "her2", "basal", "normal")
#隨機(jī)放回的分配組別
subtype <- sample(subtype, 50, replace = TRUE)
#年齡分為4組
age <- c("20-30", "31-50", "51-60", "61+")
#隨機(jī)分配年齡
age <- sample(age, 50, replace = TRUE)
#獲取樣本號(hào)
sample <- as.character(unique(brcaMAF$Tumor_Sample_Barcode))
#合并獲得臨床數(shù)據(jù)
clinical <- as.data.frame(cbind(sample, subtype, age))
head(clinical)
sample subtype age
1 TCGA-A1-A0SO-01A-22D-A099-09 basal 51-60
2 TCGA-A2-A0EU-01A-22W-A071-09 basal 31-50
3 TCGA-A2-A0ER-01A-21W-A050-09 normal 61+
4 TCGA-A2-A0EN-01A-13D-A099-09 normal 31-50
5 TCGA-A1-A0SI-01A-11D-A142-09 lumA 31-50
6 TCGA-A2-A0D0-01A-11W-A019-09 normal 61+
# Melt the clinical data into 'long' format.
library(reshape2)
#整理數(shù)據(jù)表
clinical <- melt(clinical, id.vars = c("sample"))
head(clinical)
sample variable value
1 TCGA-A1-A0SO-01A-22D-A099-09 subtype basal
2 TCGA-A2-A0EU-01A-22W-A071-09 subtype basal
3 TCGA-A2-A0ER-01A-21W-A050-09 subtype normal
4 TCGA-A2-A0EN-01A-13D-A099-09 subtype normal
5 TCGA-A1-A0SI-01A-11D-A142-09 subtype lumA
6 TCGA-A2-A0D0-01A-11W-A019-09 subtype normal
# Run waterfall
waterfall(brcaMAF, clinDat = clinical,
clinVarCol = c(lumA = "blue4", lumB = "deepskyblue",
her2 = "hotpink2", basal = "firebrick2", normal = "green4", `20-30` = "#ddd1e7", `31-50` = "#bba3d0", `51-60` = "#9975b9", `61+` = "#7647a2"), #設(shè)定顏色
plotGenes = c("PIK3CA", "TP53", "USH2A", "MLL3", "BRCA1"), clinLegCol = 2,
clinVarOrder = c("lumA", "lumB", "her2", "basal", "normal", "20-30", "31-50", "51-60", "61+"))#設(shè)定順序
自己的數(shù)據(jù)可以整理成MAF格式,也可以選擇Custom格式襟企,要注意的是MAF格式和MGI格式對(duì)mutation type的類別名字有固定要求嘱么,如果你的mutation命名方式或者有不在下列類型中的突變類型,請(qǐng)選擇Custom類別顽悼,該作圖方式對(duì)mutation type的類別名字沒(méi)有限制曼振。
MAF | MGI |
---|---|
Nonsense_Mutation | nonsense |
Frame_Shift_Ins | frame_shift_del |
Frame_Shift_Del | frame_shift_ins |
Translation_Start_Site | splice_site_del |
Splice_Site | splice_site_ins |
Nonstop_Mutation | splice_site |
In_Frame_Ins | nonstop |
In_Frame_Del | in_frame_del |
Missense_Mutation | in_frame_ins |
5’Flank | missense |
3’Flank | splice_region_del |
5’UTR | splice_region_ins |
3’UTR | splice_region |
RNA | 5_prime_flanking_region |
Intron | 3_prime_flanking_region |
IGR | 3_prime_untranslated_region |
Silent | 5_prime_untranslated_region |
Targeted_Region | rna |
intronic | |
silent |
那么關(guān)于瀑布圖的分享就到這里啦~
往期R數(shù)據(jù)可視化分享
R數(shù)據(jù)可視化12: 曼哈頓圖
R數(shù)據(jù)可視化11: 相關(guān)性圖
R數(shù)據(jù)可視化10: 蜜蜂圖 Beeswarm
R數(shù)據(jù)可視化9: 棒棒糖圖 Lollipop Chart
R數(shù)據(jù)可視化8: 金字塔圖和偏差圖
R數(shù)據(jù)可視化7: 氣泡圖 Bubble Plot
R數(shù)據(jù)可視化6: 面積圖 Area Chart
R數(shù)據(jù)可視化5: 熱圖 Heatmap
R數(shù)據(jù)可視化4: PCA和PCoA圖
R數(shù)據(jù)可視化3: 直方/條形圖
R數(shù)據(jù)可視化2: 箱形圖 Boxplot
R數(shù)據(jù)可視化1: 火山圖