基于參考基因組的基因組組裝和注釋

將基因組組裝到染色體水平無非就是兩種方式：

獨(dú)立組裝（de novo）逝薪；
基于參考基因組的組裝（reference-guided），即基于近緣或同一物種的基因組染色體同源性進(jìn)行組裝苍息。

獨(dú)立組裝中的contigs或scaffolds一般用遺傳圖譜、BioNano光學(xué)圖譜或HiC技術(shù)進(jìn)行染色體水平的掛闰集，現(xiàn)在比較常用是HiC技術(shù)恰梢。基于參考基因組的組裝一般是將contigs或scaffolds與近緣或同一物種的基因組進(jìn)行比對酿秸，從而提升至染色體水平灭翔，比較常用的是RagTag工具包，是RaGOO的升級版。有的時(shí)候肝箱，我們沒有相關(guān)圖譜數(shù)據(jù)哄褒，只能采用此方法將基因組提升至染色體水平。

RagTag可以進(jìn)行錯(cuò)誤組裝校正煌张、scaffold組裝和修補(bǔ)呐赡、scaffold合并等。之后骏融，可以用Liftoff進(jìn)行基因注釋链嘀。

RagTag

Liftoff

RagTag簡介

RagTag的流程一共四步：correct，scaffold档玻，patch怀泊，merge。

image-20210710151440177

軟件安裝

conda install -c bioconda ragtag

correct

校正是使用參考基因組來鑒定和校正contigs中的組裝錯(cuò)誤误趴，該步驟不會(huì)將序列減少或增加霹琼，僅僅是將序列在錯(cuò)誤組裝的位置進(jìn)行打斷。

用法

usage: ragtag.py correct <reference.fa> <query.fa>

Homology-based assembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>

positional arguments:
  <reference.fa>        reference fasta file (uncompressed or bgzipped)
  <query.fa>            query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help            show this help message and exit

correction options:
  -f INT                minimum unique alignment length [1000]
  --remove-small        remove unique alignments shorter than -f
  -q INT                minimum mapq (NA for Nucmer alignments) [10]
  -d INT                maximum alignment merge distance [100000]
  -b INT                minimum break distance from contig ends [5000]
  -e <exclude.txt>      list of reference headers to ignore [null]
  -j <skip.txt>         list of query headers to leave uncorrected [null]
  --inter               only break misassemblies between reference sequences
  --intra               only break misassemblies within reference sequences
  --gff <features.gff>  don't break sequences within gff intervals [null]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unaltered sequence headers

mapping options:
  -t INT                number of minimap2/unimap threads [1]
  --aligner PATH        whole genome aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
  --mm2-params STR      space delimited minimap2 whole genome alignment parameters ['-x asm5']
  --unimap-params STR   space delimited unimap parameters ['-x asm5']
  --nucmer-params STR   space delimted nucmer whole genome alignment parameters ['--maxmatch -l 100 -c 500']

validation options:
  --read-aligner PATH   read aligner executable (only 'minimap2' is allowed) [minimap2]
  -R <reads.fasta>      validation reads (uncompressed or gzipped) [null]
  -F <reads.fofn>       same as '-R', but a list of files [null]
  -T STR                read type. 'sr', 'ont' and 'corr' accepted for Illumina, nanopore and error corrected long-reads, respectively
                        [null]
  -v INT                coverage validation window size [10000]
  --max-cov INT         break sequences at regions at or above this coverage level [AUTO]
  --min-cov INT         break sequences at regions at or below this coverage level [AUTO]

scaffold

該步驟是將相鄰的contigs序列用100個(gè)N連起來凉当，序列的位置和方向需要根據(jù)與參考基因組的比對結(jié)果確定枣申。

用法

usage: ragtag.py scaffold <reference.fa> <query.fa>

Homology-based assembly scaffolding: Order and orient sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>

positional arguments:
  <reference.fa>       reference fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

scaffolding options:
  -e <exclude.txt>     list of reference sequences to ignore [null]
  -j <skip.txt>        list of query sequences to leave unplaced [null]
  -J <hard-skip.txt>   list of query headers to leave unplaced and exclude from 'chr0' ('-C') [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than '-f'
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               maximum alignment merge distance [100000]
  -i FLOAT             minimum grouping confidence score [0.2]
  -a FLOAT             minimum location confidence score [0.0]
  -s FLOAT             minimum orientation confidence score [0.0]
  -C                   concatenate unplaced contigs and make 'chr0'
  -r                   infer gap sizes. if not, all gaps are 100 bp
  -g INT               minimum inferred gap size [100]
  -m INT               maximum inferred gap size [100000]

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2/unimap threads [1]
  --aligner PATH       aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
  --mm2-params STR     space delimited minimap2 parameters ['-x asm5']
  --unimap-params STR  space delimited unimap parameters ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['--maxmatch -l 100 -c 500']

patch

該步驟是用contigs序列對上一步得到的scaffold序列進(jìn)行g(shù)ap填補(bǔ)。該步驟比較耗時(shí)纤怒，如果急需使用基因組進(jìn)行后續(xù)分析糯而，可以省略該步驟。

用法

usage: ragtag.py patch <target.fa> <query.fa>

Homology-based continuous assembly scaffolding and gap-filling: Make continuous joins and fill gaps in 'target.fa' using sequences from 'query.fa'

positional arguments:
  <target.fa>          target fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

patching:
  -e <exclude.txt>     list of target sequences to ignore [null]
  -j <skip.txt>        list of query sequences to ignore [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than '-f'
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               maximum alignment merge distance [100000]
  -s INT               minimum merged alignment length [50000]
  -i FLOAT             maximum merged alignment distance from sequence terminus. fraction of the sequence length if < 1 [0.05]
  --fill-only          only fill existing target gaps. do not join target sequences
  --join-only          only join and patch target sequences. do not fill existing gaps

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2/unimap threads [1]
  --aligner PATH       aligner executable ('nucmer' (recommended), 'unimap' or 'minimap2') [nucmer]
  --mm2-params STR     space delimited minimap2 parameters ['-x asm5']
  --unimap-params STR  space delimited unimap parameters ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['--maxmatch -l 100 -c 500']

merge

在scaffolding過程中泊窘，可能會(huì)根據(jù)不同參數(shù)或圖譜數(shù)據(jù)產(chǎn)生多個(gè)版本的基因組組裝結(jié)果，該步驟可以將多個(gè)結(jié)果根據(jù)權(quán)重進(jìn)行最終組裝結(jié)果的生成像寒。

如果有HiC數(shù)據(jù)烘豹，還可以加入HiC數(shù)據(jù)生成比較好的組裝結(jié)果。

用法

usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]

Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'

positional arguments:
  <asm.fasta>           assembly fasta file (uncompressed or bgzipped)
  <scf1.agp> <scf2.agp> [...]
                        scaffolding AGP files

optional arguments:
  -h, --help            show this help message and exit

merging options:
  -f FILE               CSV list of (AGP file,weight) [null]
  -j <skip.txt>         list of query headers to leave unplaced [null]
  -l INT                minimum assembly sequence length [100000]
  -e FLOAT              minimum edge weight. NA if using Hi-C [0.0]
  --gap-func STR        function for merging gap lengths {'min', 'max', or 'mean'} [min]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unplaced sequence headers

Hi-C options:
  -b FILE               Hi-C alignments in BAM format, sorted by read name [null]
  -r STR                CSV list of restriction enzymes/sites or 'DNase' [GATC]
  -p FLOAT              portion of the sequence termini to consider for links [1.0]
  --list-enzymes        list all available restriction enzymes/sites

Liftoff簡介

Liftoff 是一個(gè)可以準(zhǔn)確根據(jù)同一物種或近緣物種基因組進(jìn)行基因注釋映射的工具诺祸。該工具僅需兩個(gè)基因組序列和參考基因組的基因注釋文件即可進(jìn)行基因注釋携悯。Liftoff使用minimap2將參考基因組的基因序列與目標(biāo)基因組比對，這樣的好處是筷笨，即使兩個(gè)基因組間存在許多結(jié)構(gòu)上的差異憔鬼，也可將基因結(jié)構(gòu)鑒定出來。對于每一個(gè)鑒定的基因胃夏，Liftoff找到外顯子區(qū)的比對轴或，并使序列的一致性最大，同時(shí)保留轉(zhuǎn)錄本和基因結(jié)構(gòu)仰禀。如果兩個(gè)基因錯(cuò)誤地比對到同一個(gè)位點(diǎn)照雁，Liftoff會(huì)確定最好基因結(jié)構(gòu)。此外答恶，Liftoff還可以找到在目標(biāo)基因組中存在而在參考基因組中不存在的基因拷貝饺蚊。

安裝

conda install -c bioconda liftoff

用法

usage: liftoff [-h] (-g GFF | -db DB) [-o FILE] [-u FILE] [-exclude_partial]
               [-dir DIR] [-mm2_options =STR] [-a A] [-s S] [-d D] [-flank F]
               [-V] [-p P] [-m PATH] [-f TYPES] [-infer_genes]
               [-infer_transcripts] [-chroms TXT] [-unplaced TXT] [-copies]
               [-sc SC] [-overlap O] [-mismatch M] [-gap_open GO]
               [-gap_extend GE]
               target reference

Lift features from one genome assembly to another

Required input (sequences):
  target              target fasta genome to lift genes to
  reference           reference fasta genome to lift genes from

Required input (annotation):
  -g GFF              annotation file to lift over in GFF or GTF format
  -db DB              name of feature database; if not specified, the -g
                      argument must be provided and a database will be built
                      automatically

Output:
  -o FILE             write output to FILE in GFF3 format; by default, output
                      is written to terminal (stdout)
  -u FILE             write unmapped features to FILE; default is
                      "unmapped_features.txt"
  -exclude_partial    write partial mappings below -s and -a threshold to
                      unmapped_features.txt; if true partial/low sequence
                      identity mappings will be included in the gff file with
                      partial_mapping=True, low_identity=True in comments
  -dir DIR            name of directory to save intermediate fasta and SAM
                      files; default is "intermediate_files"

Alignments:
  -mm2_options =STR   space delimited minimap2 parameters. By default ="-a
                      --end-bonus 5 --eqx -N 50 -p 0.5"
  -a A                designate a feature mapped only if it aligns with
                      coverage ≥A; by default A=0.5
  -s S                designate a feature mapped only if its child features
                      (usually exons/CDS) align with sequence identity ≥S; by
                      default S=0.5
  -d D                distance scaling factor; alignment nodes separated by
                      more than a factor of D in the target genome will not be
                      connected in the graph; by default D=2.0
  -flank F            amount of flanking sequence to align as a fraction
                      [0.0-1.0] of gene length. This can improve gene
                      alignment where gene structure differs between target
                      and reference; by default F=0.0

Miscellaneous settings:
  -h, --help          show this help message and exit
  -V, --version       show program version
  -p P                use p parallel processes to accelerate alignment; by
                      default p=1
  -m PATH             Minimap2 path
  -f TYPES            list of feature types to lift over
  -infer_genes        use if annotation file only includes transcripts,
                      exon/CDS features
  -infer_transcripts  use if annotation file only includes exon/CDS features
                      and does not include transcripts/mRNA
  -chroms TXT         comma seperated file with corresponding chromosomes in
                      the reference,target sequences
  -unplaced TXT       text file with name(s) of unplaced sequences to map
                      genes from after genes from chromosomes in chroms.txt
                      are mapped; default is "unplaced_seq_names.txt"
  -copies             look for extra gene copies in the target genome
  -sc SC              with -copies, minimum sequence identity in exons/CDS for
                      which a gene is considered a copy; must be greater than
                      -s; default is 1.0
  -overlap O          maximum fraction [0.0-1.0] of overlap allowed by 2
                      features; by default O=0.1
  -mismatch M         mismatch penalty in exons when finding best mapping; by
                      default M=2
  -gap_open GO        gap open penalty in exons when finding best mapping; by
                      default GO=2
  -gap_extend GE      gap extend penalty in exons when finding best mapping;
                      by default GE=1

參數(shù)的詳細(xì)說明可以參考這里萍诱。

下面是我最近用此方法組裝一個(gè)基因組并進(jìn)行注釋的流程。

基因組組裝與注釋分析實(shí)戰(zhàn)

contigs校正

ragtag.py correct westar.fasta ZYCR1-genome.clean.fa -t 30 -u

輸出文件：

ragtag.correct.fasta: 校正的contigs序列污呼；
ragtag.correct.agp：包含contigs序列的斷點(diǎn)坐標(biāo)的文件裕坊。

生成scaffold

ragtag.py scaffold westar.fasta ragtag_output/ragtag.correct.fasta -t 30 -u -C

輸出文件：

ragtag.scaffold.agp：contigs序列的方向和順序（AGP格式）；
ragtag.scaffold.fasta：scaffold序列燕酷；
ragtag.scaffold.stats：scaffolding過程中的一些統(tǒng)計(jì)信息碍庵。

基因注釋

liftoff -g westar.gff3 -o ZYCR1.gene.gff3 -p 28 -chroms chr_pairs.txt ZYCR1.genome.asm.final.fasta westar.fa

為了提高基因映射的準(zhǔn)確性，這里僅允許相同染色體間的基因進(jìn)行映射悟狱，因此需要指定兩個(gè)基因組間的染色體對應(yīng)關(guān)系静浴，即-chroms參數(shù)，其格式如下：

chrA01,chrA01
chrA02,chrA02
chrA03,chrA03
chrA04,chrA04
chrA05,chrA05
chrA06,chrA06
chrA07,chrA07
chrA08,chrA08
chrA09,chrA09
chrA10,chrA10
chrC01,chrC01
chrC02,chrC02
chrC03,chrC03
chrC04,chrC04
chrC05,chrC05
chrC06,chrC06
chrC07,chrC07
chrC08,chrC08
chrC09,chrC09
Uncluster,Uncluster

基因重命名

這里可以使用一個(gè)腳本GeneIDRename進(jìn)行：

perl GeneIDRename -g ZYCR1.gene.gff3 -o ZYCR1.gene.rename.gff3 -c 6

提取mRNA和protein序列

gffread ZYCR1.gene.rename.gff3 -g ZYCR1.genome.asm.final.fasta -x ZYCR1.cds.fa -y ZYCR1.pep.fa

參考資料

https://github.com/malonge/RagTag
https://github.com/malonge/RaGOO
https://github.com/agshumate/Liftoff
Alonge, M., Soyk, S., Ramakrishnan, S. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol 20, 224 (2019). https://doi.org/10.1186/s13059-019-1829-6
Alaina Shumate, Steven L Salzberg, Liftoff: accurate mapping of gene annotations, Bioinformatics, 2021;, btaa1016, https://doi.org/10.1093/bioinformatics/btaa1016

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末挤渐，一起剝皮案震驚了整個(gè)濱河市苹享，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌浴麻，老刑警劉巖得问，帶你破解...
沈念sama閱讀 218,682評論 6贊 507
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異软免，居然都是意外死亡宫纬，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,277評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門膏萧，熙熙樓的掌柜王于貴愁眉苦臉地迎上來漓骚，“玉大人，你說我怎么就攤上這事榛泛◎蝓澹” “怎么了？”我有些...
開封第一講書人閱讀 165,083評論 0贊 355
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵曹锨，是天一觀的道長孤个。經(jīng)常有香客問我，道長沛简，這世上最難降的妖魔是什么齐鲤？我笑而不...
開封第一講書人閱讀 58,763評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮椒楣，結(jié)果婚禮上给郊，老公的妹妹穿的比我還像新娘。我一直安慰自己撒顿，他們只是感情好丑罪，可當(dāng)我...
茶點(diǎn)故事閱讀 67,785評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般吩屹。火紅的嫁衣襯著肌膚如雪跪另。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,624評論 1贊 305
城市分裂傳說
那天煤搜，我揣著相機(jī)與錄音免绿，去河邊找鬼。笑死擦盾，一個(gè)胖子當(dāng)著我的面吹牛嘲驾，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播迹卢，決...
沈念sama閱讀 40,358評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼辽故，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了腐碱？” 一聲冷哼從身側(cè)響起誊垢，我...
開封第一講書人閱讀 39,261評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎症见，沒想到半個(gè)月后喂走，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,722評論 1贊 315
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡谋作，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,900評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年芋肠，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片遵蚜。...
茶點(diǎn)故事閱讀 40,030評論 1贊 350
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡帖池，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出谬晕，到底是詐尸還是另有隱情碘裕，我是刑警寧澤，帶...
沈念sama閱讀 35,737評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布攒钳，位于F島的核電站，受9級特大地震影響雷滋，放射性物質(zhì)發(fā)生泄漏不撑。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,360評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一晤斩、第九天我趴在偏房一處隱蔽的房頂上張望焕檬。院中可真熱鬧，春花似錦澳泵、人聲如沸实愚。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,941評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽腊敲。三九已至击喂，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間碰辅，已是汗流浹背懂昂。一陣腳步聲響...
開封第一講書人閱讀 33,057評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留没宾，地道東北人凌彬。一個(gè)月前我還...
沈念sama閱讀 48,237評論 3贊 371
代替公主和親
正文我出身青樓，卻偏偏與公主長得像循衰，于是被迫代替她去往敵國和親铲敛。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,976評論 2贊 355