背景介紹
隨著癌癥基因組學的進步过咬,突變注釋格式(MAF)被廣泛接受并用于存儲檢測到的體細胞變體钱慢。 癌癥基因組圖譜項目對30多種不同的癌癥進行了測序柳弄,每種癌癥類型的樣本量超過200種祖屏。由體細胞變體組成的結果數(shù)據(jù)以MAF格式形式存儲窜护。 只要數(shù)據(jù)采用MAF格式效斑,該軟件包就會嘗試從TCGA源或任何內(nèi)部研究中有效地匯總,分析柱徙,注釋和可視化MAF文件.
準備
使用前要先將文件轉換為maf格式,對于VCF格式文件缓屠,可以使用vcf2maf進行格式轉換.
maf文件包含的內(nèi)容:
Mandatory fields(必填項): Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
Recommended optional fields(選填項): non MAF specific fields containing VAF (Variant Allele Frequecy) and amino acid change information.
格式轉換
#將突變結果進行注釋,得到txt文件
for i in *.somatic.anno;do perl ~/software/Desktop/annovar/table_annovar.pl $sra_file /home/yang.zou/database/humandb_new/ -buildver hg19 -out variants --otherinfo -remove -protocol ensGene -operation g -nastring NA -outfile;done
#然后將所有.hg19_multianno.txt文件添加一列填入文件名前綴并將所有txt文件拼接成一個文件护侮,提取出含有外顯子的信息
for i in *.hg19_multianno.txt;do sed '1d' $i | sed "s/$/${i%%.*}/" >> all_annovar;done
grep -P "\texonic\t" all_annovar > all_annovar2
#格式轉換
perl to-maftools.pl all_annovar2 #將文件轉換為maf格式
#to-maftools.pl
use strict;
use warnings;
open (FA,"all_annovar2");
open (FB,">all_annovar3");
print FB "Chr\tStart\tEnd\tRef\tAlt\tFunc.ensGene\tGene.ensGene\tGeneDetail.ensGene\tExonicFunc.ensGene\tAAChange.ensGene\tTumor_Sample_Barcode\n";
while (<FA>){
chomp;
my @l=split /\t/,$_;
print FB $l[0],"\t",$l[1],"\t",$l[2],"\t",$l[3],"\t",$l[4],"\t",$l[5],"\t",$l[6],"\t",$l[7],"\t",$l[8],"\t",$l[9],"\t",$l[10],"\n";
}
總體分析框架
maftools安裝
source("http://bioconductor.org/biocLite.R")
biocLite("maftools")
library(maftools)
注:安裝過程特別麻煩敌完,按了好幾天,R版本要求3.3以上羊初,也不要使用最新版本滨溉,可能有的包新版本還沒同步什湘。我使用的是:
version.string R version 3.4.1 (2017-06-30)
正式處理
讀入annovar文件轉換為maf——annovarToMaf
#read maf
var.annovar.maf = annovarToMaf(annovar = "all_annovar3", Center = 'NA', refBuild = 'hg19', tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene',sep = "\t")
write.table(x=var.annovar.maf,file="var_annovar_maf",quote= F,sep="\t",row.names=F)
annovarToMaf函數(shù)說明
Description
Converts variant annotations from Annovar into a basic MAF.將annovar格式轉換為maf格式
Usage
annovarToMaf(annovar, Center = NULL, refBuild = "hg19", tsbCol = NULL,
table = "refGene", basename = NULL, sep = "\t", MAFobj = FALSE,
sampleAnno = NULL)
Arguments
| 參數(shù)
|詳細解釋 |
| annovar
| input annovar annotation file.|
| Center
| Center field in MAF file will be filled with this value. Default NA.(MAF文件中的中心字段將填充此值。 默認NA)|
| refBuild
| NCBI_Build field in MAF file will be filled with this value. Default hg19.(MAF文件中的NCBI_Build字段將填充此值晦攒。 默認hg19)|
| tsbCol
| column name containing Tumor_Sample_Barcode or sample names in input file.(列名包含Tumor_Sample_Barcode或輸入文件中的示例名稱) |
| table
| reference table used for gene-based annotations. Can be 'ensGene' or 'refGene'. Default 'refGene'(用于基于基因的注釋的參考表闽撤。 可以是'ensGene'或'refGene'。 默認'refGene)|
| basename
| If provided writes resulting MAF file to an output file. (將結果MAF文件寫入輸出文件)|
| sep
| field seperator for input file. Default tab seperated.|
| MAFobj
| If TRUE, returns results as an [MAF](http://127.0.0.1:37698/help/library/maftools/help/MAF
object.|
| sampleAnno
| annotations associated with each sample/Tumor_Sample_Barcode in input annovar file. If provided it will be included in MAF object. Could be a text file or a data.frame. Ideally annotation would contain clinical data, survival information and other necessary features associated with samples. Default NULL.(與輸入annovar文件中的每個樣本/ Tumor_Sample_Barcode相關聯(lián)的注釋脯颜。 如果提供哟旗,它將包含在MAF對象中。 可以是文本文件或data.frame栋操。 理想情況下闸餐,注釋將包含臨床數(shù)據(jù),生存信息和與樣本相關的其他必要特征矾芙。 默認為NULL)|
然后用linux處理掉那些無義突變舍沙,也可以在后續(xù)設置參數(shù)去掉無義突變
sed 's/^NA/unknown/' var_annovar_maf > var_annovar_maf2
grep -v "^NA" var_annovar_maf | grep -v -P "\tUNKNOWN\t"> var_annovar_maf2
讀入maf文件——read.maf
var_maf = read.maf(maf ="var_annovar_maf2")
plotmafSummary(maf = var_maf, rmOutlier = TRUE, addStat = 'median')
oncoplot(maf = var_maf, top = 400, writeMatrix=T,removeNonMutated = F,showTumorSampleBarcodes=T)
Description
Takes tab delimited MAF (can be plain text or gz compressed) file as an input and summarizes it in various ways. Also creates oncomatrix - helpful for visualization.
Usage
read.maf(maf, clinicalData = NULL, removeDuplicatedVariants = TRUE,
useAll = TRUE, gisticAllLesionsFile = NULL, gisticAmpGenesFile = NULL,
gisticDelGenesFile = NULL, gisticScoresFile = NULL, cnLevel = "all",
cnTable = NULL, isTCGA = FALSE, vc_nonSyn = NULL, verbose = TRUE)
Arguments
-
maf
tab delimited MAF file. File can also be gz compressed. Required. Alternatively, you can also provide already read MAF file as a dataframe.(制表符分隔的MAF文件。 文件也可以是gz壓縮的剔宪。 也可以將已讀取的MAF文件作為數(shù)據(jù)框提供
clinicalData
Clinical data associated with each sample/Tumor_Sample_Barcode in MAF. Could be a text file or a data.frame. Default NULL.(與MAF中每個樣本/ Tumor_Sample_Barcode相關的臨床數(shù)據(jù)拂铡。 可以是文本文件或data.frame。 默認為NULL)removeDuplicatedVariants
removes repeated variants in a particuar sample, mapped to multiple transcripts of same Gene. See Description. Default TRUE.(去除特定樣本中的重復變體歼跟,映射到同一基因的多個轉錄本和媳。 見說明。 默認值為TRUE)useAll
logical. Whether to use all variants irrespective of values in Mutation_Status. Defaults to TRUE. If FALSE, only uses with values Somatic.(邏輯哈街。 是否使用所有變體而不管Mutation_Status中的值留瞳。 默認為TRUE。 如果為FALSE骚秦,則僅使用值Somatic)gisticAllLesionsFile
All Lesions file generated by gistic. e.g; all_lesions.conf_XX.txt, where XX is the confidence level. Default NULL.gisticAmpGenesFile
Amplification Genes file generated by gistic. e.g; amp_genes.conf_XX.txt, where XX is the confidence level. Default NULL.(擴增由gistic生成的基因文件她倘。 例如; amp_genes.conf_XX.txt)gisticDelGenesFile
Deletion Genes file generated by gistic. e.g; del_genes.conf_XX.txt, where XX is the confidence level. Default NULL.(刪除由gistic生成的Genes文件。 例如; del_genes.conf_XX.txt)gisticScoresFile
scores.gistic file generated by gistic. Default NULL(由gistic生成的scores.gistic文件)cnLevel
level of CN changes to use. Can be 'all', 'deep' or 'shallow'. Default uses all i.e, genes with both 'shallow' or 'deep' CN changes(CN級別改變使用作箍。 可以是“全部”硬梁,“深層”或“淺層”。 默認使用所有胞得,即具有“淺”或“深”CN變化的基因)cnTable
Custom copynumber data if gistic results are not available. Input file or a data.frame should contain three columns with gene name, Sample name and copy number status (either 'Amp' or 'Del'). Default NULL.(如果gistic結果不可用荧止,則自定義copynumber數(shù)據(jù)。 輸入文件或data.frame應包含三列阶剑,其中包含基因名稱跃巡,樣品名稱和拷貝編號狀態(tài)('Amp'或'Del')。 默認為NULL)isTCGA
Is input MAF file from TCGA source. If TRUE uses only first 12 characters from Tumor_Sample_Barcode.(是來自TCGA源的輸入MAF文件牧愁。 如果TRUE僅使用Tumor_Sample_Barcode中的前12個字符素邪。)vc_nonSyn
NULL. Provide manual list of variant classifications to be considered as non-synonymous. Rest will be considered as silent variants. Default uses Variant Classifications with High/Moderate variant consequences. http://asia.ensembl.org/Help/Glossary?id=535: "Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site","Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del","In_Frame_Ins", "Missense_Mutation"verbose
TRUE logical. Default to be talkative and prints summary.
繪制maf文件的摘要——plotmafSummary
該文件將每個樣本中的變體數(shù)顯示為堆積條形圖,將變體類型顯示為Variant_Classification匯總的箱形圖猪半。 我們可以在堆積的條形圖中添加平均線或中線兔朦,以顯示整個群組中變體的平均值/中值數(shù)
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
Description
Plots maf summary.
Usage
plotmafSummary(maf, file = NULL, rmOutlier = TRUE, dashboard = TRUE,
titvRaw = TRUE, width = 10, height = 7, addStat = NULL,
showBarcodes = FALSE, fs = 10, textSize = 2, color = NULL,
statFontSize = 3, titleSize = c(10, 8), titvColor = NULL, top = 10)
Arguments
file
If given pdf file will be generated.rmOutlier
If TRUE removes outlier from boxplot.dashboard
If FALSE plots simple summary instead of dashboard style.titvRaw
TRUE. If false instead of raw counts, plots fraction.width
plot parameter for output file.height
plot parameter for output file.addStat
Can be either mean or median. Default NULL.showBarcodes
include sample names in the top bar plot.fs
base size for text. Default 10.color
named vector of colors for each Variant_Classification.titvColor
colors for SNV classifications.top
include top n genes dashboard plot. Default 10.
繪制瀑布圖——oncoplots
Oncoplot函數(shù)使用“ComplexHeatmap”來繪制oncoplots2偷线。 具體來說,oncoplot是ComplexHeatmap的OncoPrint功能的包裝器沽甥,幾乎沒有任何修改和自動化声邦,使繪圖更容易。 側面條形圖和頂部條形圖可分別由drawRowBar和drawColBar參數(shù)控制摆舟。
top的值需要視情況而定
oncoplot(maf = var_maf, top = 400, writeMatrix=T,removeNonMutated = F,showTumorSampleBarcodes=T)
takes output generated by read.maf and draws an oncoplot
Usage
oncoplot(maf, top = 20, genes = NULL, mutsig = NULL, mutsigQval = 0.1,
drawRowBar = TRUE, drawColBar = TRUE, clinicalFeatures = NULL,
annotationDat = NULL, annotationColor = NULL, genesToIgnore = NULL,
showTumorSampleBarcodes = FALSE, removeNonMutated = TRUE, colors = NULL,
sortByMutation = FALSE, sortByAnnotation = FALSE,
annotationOrder = NULL, keepGeneOrder = FALSE, GeneOrderSort = TRUE,
sampleOrder = NULL, writeMatrix = FALSE, fontSize = 10,
SampleNamefontSize = 10, titleFontSize = 15, legendFontSize = 12,
annotationFontSize = 12, annotationTitleFontSize = 12)
Arguments
- maf
an MAF object generated by read.maf| - top
how many top genes to be drawn. defaults to 20.(顯示多少基因)| - genes
Just draw oncoplot for these genes. Default NULL.(顯示特定基因)| - mutsig
Mutsig resuts if availbale. Usually file named sig_genes.txt If provided plots significant genes and correpsonding Q-values as side row-bar. Default NULL.(如果可以的話翔忽,Mutsig會重新開始。 通常名為sig_genes.txt的文件如果提供盏檐,則繪制重要基因并將Q值作為側行條對應。 默認為NULL) | - mutsigQval
Q-value to choose significant genes from mutsig results. Default 0.1(從mutsig結果中選擇重要基因的Q值驶悟。 默認值為0.1)| - genesToIgnore
do not show these genes in Oncoplot. Default NULL. - showTumorSampleBarcodes
logical to include sample names.(邏輯包含樣本名稱) -
removeNonMutated
Logical. IfTRUE
removes samples with no mutations in the oncoplot for better visualization. DefaultTRUE
.(消除無義突變) - sortByMutation
Force sort matrix according mutations. Helpful in case of MAF was read along with copy number data. Default FALSE.(根據(jù)突變強制排序矩陣胡野。 在閱讀MAF的情況下有用以及拷貝數(shù)數(shù)據(jù)。 默認為FALSE)| - sortByAnnotation
logical sort oncomatrix (samples) by provided 'clinicalFeatures'. Sorts based on first 'clinicalFeatures'. Defaults to FALSE. column-sort| - annotationOrder
Manually specify order for annotations. Works only for first 'clinicalFeatures'. Default NULL. | - keepGeneOrder
logical whether to keep order of given genes. Default FALSE, order according to mutation frequency| - GeneOrderSort
logical this is applicable when 'keepGeneOrder' is TRUE. Default TRUE| - sampleOrder
Manually speify sample names for oncolplot ordering. Default NULL.| - writeMatrix
writes character coded matrix used to generate the plot to an output file. This can be used as an input for ComplexHeatmap oncoPrint function if you wish to customize the plot.(將用于生成繪圖的字符編碼矩陣寫入輸出文件痕鳍。 如果您想自定義繪圖硫豆,則可以將其用作ComplexHeatmap oncoPrint函數(shù)的輸入)
通過包括與樣本相關的注釋(臨床特征),改變變體分類的顏色并包括顯著性的q值(從MutSig或類似程序生成)笼呆,可以進一步改善Oncoplots熊响。
col = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
names(col) = c('Frame_Shift_Del','Missense_Mutation', 'Nonsense_Mutation', 'Multi_Hit', 'Frame_Shift_Ins','In_Frame_Ins', 'Splice_Site', 'In_Frame_Del')
#Color coding for FAB classification; try getAnnotations(x = laml) to see available annotations.
fabcolors = RColorBrewer::brewer.pal(n = 8,name = 'Spectral')
names(fabcolors) = c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7")
fabcolors = list(FAB_classification = fabcolors)
#MutSig reusults
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")
oncoplot(maf = laml, colors = col, mutsig = laml.mutsig, mutsigQval = 0.01, clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, annotationColor = fabcolors)
[圖片上傳失敗...(image-fc6334-1536734778754)]
使用oncostrip函數(shù)可視化任何一組基因,它們在每個樣本中繪制類似于cBioPortal上的OncoPrinter工具的突變诗赌。 oncostrip可用于使用top或gene參數(shù)繪制任意數(shù)量的基因
#顯示特定基因
oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'))
繪制箱線圖—— titv
titv函數(shù)將SNP分類為Transitions_vs_Transversions汗茄,并以各種方式返回匯總表的列表。 匯總數(shù)據(jù)也可以顯示為一個箱線圖铭若,顯示六種不同轉換的總體分布洪碳,并作為堆積條形圖顯示每個樣本中的轉換比例。
Description
takes output generated by read.maf and classifies Single Nucleotide Variants into Transitions and Transversions.
Usage
titv(maf, useSyn = FALSE, plot = TRUE, file = NULL)
Arguments
useSyn
Logical. Whether to include synonymous variants in analysis. Defaults to FALSE.
-
plot
plots a titv fractions. default TRUE.
-
file
basename for output file name. If given writes summaries to output file. Default NULL.
#繪制棒棒圖——lollipopPlot
棒棒糖圖是簡單且最有效的方式叼屠,顯示蛋白質(zhì)結構上的突變點瞳腌。許多致癌基因具有比任何其他基因座更頻繁突變的優(yōu)先位點。這些斑點被認為是突變熱點镜雨,棒棒糖圖可以用于顯示它們以及其他突變嫂侍。我們可以使用函數(shù)lollipopPlot繪制這樣的圖。這個功能要求我們在maf文件中有氨基酸改變信息荚坞。然而挑宠,MAF文件沒有關于命名氨基酸變化字段的明確指南,不同的研究具有不同的氨基酸變化的字段(或列)名稱西剥。默認情況下痹栖,lollipopPlot查找列AAChange,如果在MAF文件中找不到它瞭空,則會打印所有可用字段并顯示警告消息揪阿。對于以下示例疗我,MAF文件包含字段/列名稱“Protein_Change”下的氨基酸變化。我們將使用參數(shù)AACol手動指定它南捂。此函數(shù)還將繪圖作為ggplot對象返回吴裤,如果需要,用戶稍后可以修改該對象溺健。
maftools還可以制作很多圖麦牺,比如
還可以用函數(shù)geneCloud繪制突變基因的詞云圖。 每個基因的大小與其突變/改變的樣品總數(shù)成比例鞭缭。
癌癥中的許多引起疾病的基因共同發(fā)生或在其突變模式中顯示出強烈的排他性剖膳。 可以使用somaticInteractions函數(shù)檢測這種相互排斥或共同發(fā)生的基因組,其執(zhí)行成對的Fisher's Exact檢驗以檢測這種顯著的基因?qū)Α?somaticInteractions函數(shù)還使用cometExactTest來識別涉及> 2個基因的潛在改變的基因集
maftools包 功能很強大岭辣,具體可參考:
http://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html