融合檢測之FACTERA

FACTERA:https://factera.stanford.edu/download.php

FACTERA (Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm) is a tool for detection of genomic fusions in paired-end targeted (or genome-wide) sequencing data.

Command

perl factera.pl [options] tumor.bam exons.bed hg19.2bit [optional: targets.bed]
主程序是perl腳本,可以自行修改一些內(nèi)容,使檢測到的融合更多杏慰。

Input

tumor.bam should consist of paired-end reads aligned by a mapping algorithm capable of soft-clipping, such as BWA. The BAM file does not need to be realigned or deduped, but should be position-sorted and have a corresponding index file (bam.bai created using SAMtools index) in the same directory in order to estimate the total sequencing depth in the neighborhood of each detected fusion.

exons.bed contains chromosomal coordinates (such as exon boundaries) in 3-column BED format (chr start end). The fourth column contains gene names, exon names, or any arbitrary identifier, and will be used to group corresponding coordinates. This allows the resolution of fusion detection to be restricted to inter-gene or inter-exon fusions, for example. Make sure to use coordinates from the same genome version as the 2bit reference sequence required in the third argument.

Users can download this exons.bed file, which combines hg19 RefSeq and Gencodev17 exon coordinates (downloaded from UCSC 02-23-14) with corresponding HUGO gene symbols in column 4. With this file, FACTERA will identify inter-gene fusions. To identify inter-exon fusions in hg19, use this exons.bed file.

hg19.2bit is a 2 bit encoded human reference genome, used for fast genome subsequence retrieval. Of note, FACTERA is not restricted to human sequences, and any 2bit reference genome can be used as long as coordinates in exons.bed are consistent. To create a 2bit file for a genome of interest, download the FASTA to 2BIT conversion tool from the appropriate system folder (<u>faToTwoBit</u>) and follow these <u>instructions</u>.

targets.bedis optional and allows the user to restrict the FACTERA search to genomic regions of interest, such as those targeted by a sequencing capture library. Format is a standard 3-column BED (chr start end). The use of a targets.bed file can greatly improve running time when only a subset of sequenced regions is known to be relevant for fusion detection.

Output

Each FACTERA run produces 9 main output files, each of which is described below:

parameters.txt = all input files and parameter values.

discordantpair.depth.txt = ranked list of discordant read clusters.

disordantpair.details.txt = discordant read positions.

fusiontargets.bed = bed coordinates of candidate fusions – used to restrict search space for soft-clipped reads.

blastreads.fa = used to build blast database of soft-clipped, improperly paired, and unmapped reads.

blastquery.fa = file used to search individual candidate fusion sequences (query) for hits in blastreads.fa (target database).

fusionseqs.fa = all detected breakpoints with 500bp of additional flanking sequences.

fusions.bed = bed output for detected fusions. Useful for comparing runs or somatic vs germline (column 4 is fusion ID).

fusions.txt = all detected fusion events, including details, described below:

Field Description
Est_Type Estimated structural variant type: TRA = translocation; INV = inversion; DEL = deletion; '-' = not determined
Region1 Name of genomic region closest to breakpoint 1 (e.g., gene 1, exon 1, etc.)
Region2 Name of genomic region closest to breakpoint 2 (e.g., gene 2, exon 2, etc.)
Break1 Chromosomal breakpoint 1
Break2 Chromosomal breakpoint 2
Break_support1 Number of reads supporting breakpoint 1
Break_support2 Number of reads supporting breakpoint 2
Break_Offset Breakpoint adjustment in bases (e.g., owing to microhomology)
Order1 Orientation of read clipping with respect to breakpoint 1: CN, clipped followed by not clipped; NC, vice versa
Order2 Same as Order1, but for breakpoint 2
Break_depth Number of breakpoint-spanning reads
Proper_pair_support Number of properly paired and previously soft-clipped reads that map to fusion
Unmapped_support Number of previously unmapped reads that map to fusion
Improper_pair_support Number of previously discordantly paired reads that map to fusion
Paired_end_depth Total number of paired-end reads that flank breakpoint
Total_depth Mean total depth for regions flanking both breakpoints (+/-500bp by default)
Fusion_seq Estimated fusion sequence (50 bases flanking breakpoint by default)
Non-templated_seq Non-templated (i.e., non-reference) sequence segment (if any) enclosed in brackets

Requirements

Unix operating system (Linux, Mac OS X, etc.)
Perl 5, with the following external dependency: Statistics::Descriptive.
To install Statistics::Descriptive from CPAN, issue the following command:
sudo cpan Statistics::Descriptive
Other Perl dependencies are included in the Perl 5 Core Modules and should already be installed: IPC::Open3, List::Util, File::Spec, Symbol, Getopt::Std, File::Basename.
twoBitToFa
Find and download executable from the appropriate system folder, then copy/link/move to PATH (i.e., /usr/bin).
hg19.2bit to run FACTERA on the hg19 human genome.
Note that hg38.2bit is now available. To use another reference genome, make sure that input BED coordinates are consistent (the exons.bed file provided here is currently hg19 only).
blast+
After downloading, find blastn and makeblastdb in ncbi-blast-version/bin and copy/link/move to PATH (i.e., /usr/bin).
SAMtools
After downloading, find samtools and copy/link/move to PATH (i.e., /usr/bin).

Options (defaults): 描述
-o Output directory (tumor.bam directory).
-r <int> Minimum number of breakpoint-spanning reads needed for output (5).
-m <int> Minimum number of discordant reads needed for a candidate fusion (2).
-x <int> Maximum number of breakpoints to examine for any given pair of genomic regions (5).
-s <int> Minimum number of reads with the same breakpoint (1).
-f <0-1> Minimum fraction of read bases required for alignment to fusion template (0.9).
-S <0-1> Minimum similarity required for alignment of read to fusion template (0.95).
-k <int> k-mer size for fragment comparison (10 bases).
-c <int> Minimum size of soft-clipped region to consider (16 bases).
-b <int> Number of bases flanking breakpoint for fusion template (500).
-p <int> Number of threads for blastn search (4; 10 or more recommended).
-a <int> Number of bases flanking breakpoint to provide in output (50).
-e Disable grouping of input coordinates by column 4 of exons.bed (off).
-v Disable verbose output (off).
-t Disable running time output (off).
-C Disable addition of 'chr' prefix to chromosome names (off)***
-F Force remake of BLAST database for a particular input (off).
Required if 'chr' is absent from all input files, including reference.2bit.

FAQ

1.Which aligners are supported?

Answer: FACTERA was developed and optimized using targeted sequencing data aligned by bwa aln, and we currently recommend that users employ bwa aln for best performance. While FACTERA can be applied to data mapped by bwa mem, users should be aware of the following considerations when interpreting results. The most notable difference between bwa aln and mem with respect to fusion detection is the use of hard clipping in addition to soft clipping by bwa mem. Absent from bwa aln, hard clipping enables bwa mem to improve the mapping rate by realigning (rather than truncating) sufficiently long read segments in chimeric sequences. In contrast, bwa aln will truncate such reads without realignment (soft-clipping), and FACTERA leverages soft clipped, but not hard clipped, reads for breakpoint detection. Hard clipped reads will be supported in a future release of FACTERA, and we will notify registered users when this version is available.

2.According to the paper, FACTERA has high specificity. Why does FACTERA report some fusions that appear to be false positives?

Answer: False positive calls may arise from mapping artifacts (due to repeat sequences), PCR template switching, and other sequencing errors, and are increasingly difficult to avoid as the sequencing space grows in size and complexity. While we have implemented a variety of post-processing algorithms to reduce the false positive rate compared to previous methods (paper), the elimination of all fusions with repetitive content would risk discarding genuine events. We therefore recommend that users inspect the FACTERA output for possible false positives by using BLAT and the UCSC human genome browser. This is particularly important when using FACTERA to analyze exome or genome-scale datasets. In cases where paired normal datasets are available, we recommend leveraging this information to reduce the FPR. Finally, we would welcome suggestions from users on how to best discriminate real fusions in repeat regions from sequencing artifacts. Please send us your feedback/suggestions along with fusion results that you suspect are not real. This will help us to compile a blacklist of poorly behaving genomic regions that might be useful as a post-processing filter.

Reference

Aaron M. Newman, Scott V. Bratman, Henning Stehr, Luke J. Lee, Chih Long Liu, Maximilian Diehn* and Ash A. Alizadeh* (2014) FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution, Bioinformatics DOI: 10.1093/bioinformatics/btu549.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末坠七,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子曹体,更是在濱河造成了極大的恐慌仆百,老刑警劉巖驶社,帶你破解...
    沈念sama閱讀 212,454評論 6 493
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件绽淘,死亡現(xiàn)場離奇詭異涵防,居然都是意外死亡,警方通過查閱死者的電腦和手機收恢,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,553評論 3 385
  • 文/潘曉璐 我一進店門武学,熙熙樓的掌柜王于貴愁眉苦臉地迎上來祭往,“玉大人伦意,你說我怎么就攤上這事∨鸩梗” “怎么了驮肉?”我有些...
    開封第一講書人閱讀 157,921評論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長已骇。 經(jīng)常有香客問我离钝,道長,這世上最難降的妖魔是什么褪储? 我笑而不...
    開封第一講書人閱讀 56,648評論 1 284
  • 正文 為了忘掉前任卵渴,我火速辦了婚禮,結(jié)果婚禮上鲤竹,老公的妹妹穿的比我還像新娘浪读。我一直安慰自己,他們只是感情好辛藻,可當我...
    茶點故事閱讀 65,770評論 6 386
  • 文/花漫 我一把揭開白布碘橘。 她就那樣靜靜地躺著,像睡著了一般吱肌。 火紅的嫁衣襯著肌膚如雪痘拆。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,950評論 1 291
  • 那天氮墨,我揣著相機與錄音纺蛆,去河邊找鬼。 笑死规揪,一個胖子當著我的面吹牛桥氏,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播粒褒,決...
    沈念sama閱讀 39,090評論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼识颊,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起祥款,我...
    開封第一講書人閱讀 37,817評論 0 268
  • 序言:老撾萬榮一對情侶失蹤清笨,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后刃跛,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體抠艾,經(jīng)...
    沈念sama閱讀 44,275評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,592評論 2 327
  • 正文 我和宋清朗相戀三年桨昙,在試婚紗的時候發(fā)現(xiàn)自己被綠了检号。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 38,724評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡蛙酪,死狀恐怖齐苛,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情桂塞,我是刑警寧澤凹蜂,帶...
    沈念sama閱讀 34,409評論 4 333
  • 正文 年R本政府宣布,位于F島的核電站阁危,受9級特大地震影響玛痊,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜狂打,卻給世界環(huán)境...
    茶點故事閱讀 40,052評論 3 316
  • 文/蒙蒙 一擂煞、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧趴乡,春花似錦对省、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,815評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至粟瞬,卻和暖如春同仆,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背裙品。 一陣腳步聲響...
    開封第一講書人閱讀 32,043評論 1 266
  • 我被黑心中介騙來泰國打工俗批, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人市怎。 一個月前我還...
    沈念sama閱讀 46,503評論 2 361
  • 正文 我出身青樓岁忘,卻偏偏與公主長得像,于是被迫代替她去往敵國和親区匠。 傳聞我的和親對象是個殘疾皇子干像,可洞房花燭夜當晚...
    茶點故事閱讀 43,627評論 2 350

推薦閱讀更多精彩內(nèi)容