將基因組組裝到染色體水平無非就是兩種方式:
- 獨(dú)立組裝(de novo)逝薪;
- 基于參考基因組的組裝(reference-guided),即基于近緣或同一物種的基因組染色體同源性進(jìn)行組裝苍息。
獨(dú)立組裝中的contigs或scaffolds一般用遺傳圖譜、BioNano光學(xué)圖譜或HiC技術(shù)進(jìn)行染色體水平的掛闰集,現(xiàn)在比較常用是HiC技術(shù)恰梢。基于參考基因組的組裝一般是將contigs或scaffolds與近緣或同一物種的基因組進(jìn)行比對酿秸,從而提升至染色體水平灭翔,比較常用的是RagTag工具包,是RaGOO的升級版。有的時(shí)候肝箱,我們沒有相關(guān)圖譜數(shù)據(jù)哄褒,只能采用此方法將基因組提升至染色體水平。
RagTag可以進(jìn)行錯(cuò)誤組裝校正煌张、scaffold組裝和修補(bǔ)呐赡、scaffold合并等。之后骏融,可以用Liftoff進(jìn)行基因注釋链嘀。
RagTag簡介
RagTag的流程一共四步:correct,scaffold档玻,patch怀泊,merge。
軟件安裝
conda install -c bioconda ragtag
correct
校正是使用參考基因組來鑒定和校正contigs中的組裝錯(cuò)誤误趴,該步驟不會(huì)將序列減少或增加霹琼,僅僅是將序列在錯(cuò)誤組裝的位置進(jìn)行打斷。
用法
usage: ragtag.py correct <reference.fa> <query.fa>
Homology-based assembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>
positional arguments:
<reference.fa> reference fasta file (uncompressed or bgzipped)
<query.fa> query fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
correction options:
-f INT minimum unique alignment length [1000]
--remove-small remove unique alignments shorter than -f
-q INT minimum mapq (NA for Nucmer alignments) [10]
-d INT maximum alignment merge distance [100000]
-b INT minimum break distance from contig ends [5000]
-e <exclude.txt> list of reference headers to ignore [null]
-j <skip.txt> list of query headers to leave uncorrected [null]
--inter only break misassemblies between reference sequences
--intra only break misassemblies within reference sequences
--gff <features.gff> don't break sequences within gff intervals [null]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unaltered sequence headers
mapping options:
-t INT number of minimap2/unimap threads [1]
--aligner PATH whole genome aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
--mm2-params STR space delimited minimap2 whole genome alignment parameters ['-x asm5']
--unimap-params STR space delimited unimap parameters ['-x asm5']
--nucmer-params STR space delimted nucmer whole genome alignment parameters ['--maxmatch -l 100 -c 500']
validation options:
--read-aligner PATH read aligner executable (only 'minimap2' is allowed) [minimap2]
-R <reads.fasta> validation reads (uncompressed or gzipped) [null]
-F <reads.fofn> same as '-R', but a list of files [null]
-T STR read type. 'sr', 'ont' and 'corr' accepted for Illumina, nanopore and error corrected long-reads, respectively
[null]
-v INT coverage validation window size [10000]
--max-cov INT break sequences at regions at or above this coverage level [AUTO]
--min-cov INT break sequences at regions at or below this coverage level [AUTO]
scaffold
該步驟是將相鄰的contigs序列用100個(gè)N連起來凉当,序列的位置和方向需要根據(jù)與參考基因組的比對結(jié)果確定枣申。
用法
usage: ragtag.py scaffold <reference.fa> <query.fa>
Homology-based assembly scaffolding: Order and orient sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>
positional arguments:
<reference.fa> reference fasta file (uncompressed or bgzipped)
<query.fa> query fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
scaffolding options:
-e <exclude.txt> list of reference sequences to ignore [null]
-j <skip.txt> list of query sequences to leave unplaced [null]
-J <hard-skip.txt> list of query headers to leave unplaced and exclude from 'chr0' ('-C') [null]
-f INT minimum unique alignment length [1000]
--remove-small remove unique alignments shorter than '-f'
-q INT minimum mapq (NA for Nucmer alignments) [10]
-d INT maximum alignment merge distance [100000]
-i FLOAT minimum grouping confidence score [0.2]
-a FLOAT minimum location confidence score [0.0]
-s FLOAT minimum orientation confidence score [0.0]
-C concatenate unplaced contigs and make 'chr0'
-r infer gap sizes. if not, all gaps are 100 bp
-g INT minimum inferred gap size [100]
-m INT maximum inferred gap size [100000]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unplaced sequence headers
mapping options:
-t INT number of minimap2/unimap threads [1]
--aligner PATH aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
--mm2-params STR space delimited minimap2 parameters ['-x asm5']
--unimap-params STR space delimited unimap parameters ['-x asm5']
--nucmer-params STR space delimted nucmer parameters ['--maxmatch -l 100 -c 500']
patch
該步驟是用contigs序列對上一步得到的scaffold序列進(jìn)行g(shù)ap填補(bǔ)。該步驟比較耗時(shí)纤怒,如果急需使用基因組進(jìn)行后續(xù)分析糯而,可以省略該步驟。
用法
usage: ragtag.py patch <target.fa> <query.fa>
Homology-based continuous assembly scaffolding and gap-filling: Make continuous joins and fill gaps in 'target.fa' using sequences from 'query.fa'
positional arguments:
<target.fa> target fasta file (uncompressed or bgzipped)
<query.fa> query fasta file (uncompressed or bgzipped)
optional arguments:
-h, --help show this help message and exit
patching:
-e <exclude.txt> list of target sequences to ignore [null]
-j <skip.txt> list of query sequences to ignore [null]
-f INT minimum unique alignment length [1000]
--remove-small remove unique alignments shorter than '-f'
-q INT minimum mapq (NA for Nucmer alignments) [10]
-d INT maximum alignment merge distance [100000]
-s INT minimum merged alignment length [50000]
-i FLOAT maximum merged alignment distance from sequence terminus. fraction of the sequence length if < 1 [0.05]
--fill-only only fill existing target gaps. do not join target sequences
--join-only only join and patch target sequences. do not fill existing gaps
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unplaced sequence headers
mapping options:
-t INT number of minimap2/unimap threads [1]
--aligner PATH aligner executable ('nucmer' (recommended), 'unimap' or 'minimap2') [nucmer]
--mm2-params STR space delimited minimap2 parameters ['-x asm5']
--unimap-params STR space delimited unimap parameters ['-x asm5']
--nucmer-params STR space delimted nucmer parameters ['--maxmatch -l 100 -c 500']
merge
在scaffolding過程中泊窘,可能會(huì)根據(jù)不同參數(shù)或圖譜數(shù)據(jù)產(chǎn)生多個(gè)版本的基因組組裝結(jié)果,該步驟可以將多個(gè)結(jié)果根據(jù)權(quán)重進(jìn)行最終組裝結(jié)果的生成像寒。
如果有HiC數(shù)據(jù)烘豹,還可以加入HiC數(shù)據(jù)生成比較好的組裝結(jié)果。
用法
usage: ragtag.py merge <asm.fa> <scf1.agp> <scf2.agp> [...]
Scaffold merging: derive a consensus scaffolding solution by reconciling distinct scaffoldings of 'asm.fa'
positional arguments:
<asm.fasta> assembly fasta file (uncompressed or bgzipped)
<scf1.agp> <scf2.agp> [...]
scaffolding AGP files
optional arguments:
-h, --help show this help message and exit
merging options:
-f FILE CSV list of (AGP file,weight) [null]
-j <skip.txt> list of query headers to leave unplaced [null]
-l INT minimum assembly sequence length [100000]
-e FLOAT minimum edge weight. NA if using Hi-C [0.0]
--gap-func STR function for merging gap lengths {'min', 'max', or 'mean'} [min]
input/output options:
-o PATH output directory [./ragtag_output]
-w overwrite intermediate files
-u add suffix to unplaced sequence headers
Hi-C options:
-b FILE Hi-C alignments in BAM format, sorted by read name [null]
-r STR CSV list of restriction enzymes/sites or 'DNase' [GATC]
-p FLOAT portion of the sequence termini to consider for links [1.0]
--list-enzymes list all available restriction enzymes/sites
Liftoff簡介
Liftoff 是一個(gè)可以準(zhǔn)確根據(jù)同一物種或近緣物種基因組進(jìn)行基因注釋映射的工具诺祸。該工具僅需兩個(gè)基因組序列和參考基因組的基因注釋文件即可進(jìn)行基因注釋携悯。Liftoff使用minimap2將參考基因組的基因序列與目標(biāo)基因組比對,這樣的好處是筷笨,即使兩個(gè)基因組間存在許多結(jié)構(gòu)上的差異憔鬼,也可將基因結(jié)構(gòu)鑒定出來。對于每一個(gè)鑒定的基因胃夏,Liftoff找到外顯子區(qū)的比對轴或,并使序列的一致性最大,同時(shí)保留轉(zhuǎn)錄本和基因結(jié)構(gòu)仰禀。如果兩個(gè)基因錯(cuò)誤地比對到同一個(gè)位點(diǎn)照雁,Liftoff會(huì)確定最好基因結(jié)構(gòu)。此外答恶,Liftoff還可以找到在目標(biāo)基因組中存在而在參考基因組中不存在的基因拷貝饺蚊。
安裝
conda install -c bioconda liftoff
用法
usage: liftoff [-h] (-g GFF | -db DB) [-o FILE] [-u FILE] [-exclude_partial]
[-dir DIR] [-mm2_options =STR] [-a A] [-s S] [-d D] [-flank F]
[-V] [-p P] [-m PATH] [-f TYPES] [-infer_genes]
[-infer_transcripts] [-chroms TXT] [-unplaced TXT] [-copies]
[-sc SC] [-overlap O] [-mismatch M] [-gap_open GO]
[-gap_extend GE]
target reference
Lift features from one genome assembly to another
Required input (sequences):
target target fasta genome to lift genes to
reference reference fasta genome to lift genes from
Required input (annotation):
-g GFF annotation file to lift over in GFF or GTF format
-db DB name of feature database; if not specified, the -g
argument must be provided and a database will be built
automatically
Output:
-o FILE write output to FILE in GFF3 format; by default, output
is written to terminal (stdout)
-u FILE write unmapped features to FILE; default is
"unmapped_features.txt"
-exclude_partial write partial mappings below -s and -a threshold to
unmapped_features.txt; if true partial/low sequence
identity mappings will be included in the gff file with
partial_mapping=True, low_identity=True in comments
-dir DIR name of directory to save intermediate fasta and SAM
files; default is "intermediate_files"
Alignments:
-mm2_options =STR space delimited minimap2 parameters. By default ="-a
--end-bonus 5 --eqx -N 50 -p 0.5"
-a A designate a feature mapped only if it aligns with
coverage ≥A; by default A=0.5
-s S designate a feature mapped only if its child features
(usually exons/CDS) align with sequence identity ≥S; by
default S=0.5
-d D distance scaling factor; alignment nodes separated by
more than a factor of D in the target genome will not be
connected in the graph; by default D=2.0
-flank F amount of flanking sequence to align as a fraction
[0.0-1.0] of gene length. This can improve gene
alignment where gene structure differs between target
and reference; by default F=0.0
Miscellaneous settings:
-h, --help show this help message and exit
-V, --version show program version
-p P use p parallel processes to accelerate alignment; by
default p=1
-m PATH Minimap2 path
-f TYPES list of feature types to lift over
-infer_genes use if annotation file only includes transcripts,
exon/CDS features
-infer_transcripts use if annotation file only includes exon/CDS features
and does not include transcripts/mRNA
-chroms TXT comma seperated file with corresponding chromosomes in
the reference,target sequences
-unplaced TXT text file with name(s) of unplaced sequences to map
genes from after genes from chromosomes in chroms.txt
are mapped; default is "unplaced_seq_names.txt"
-copies look for extra gene copies in the target genome
-sc SC with -copies, minimum sequence identity in exons/CDS for
which a gene is considered a copy; must be greater than
-s; default is 1.0
-overlap O maximum fraction [0.0-1.0] of overlap allowed by 2
features; by default O=0.1
-mismatch M mismatch penalty in exons when finding best mapping; by
default M=2
-gap_open GO gap open penalty in exons when finding best mapping; by
default GO=2
-gap_extend GE gap extend penalty in exons when finding best mapping;
by default GE=1
參數(shù)的詳細(xì)說明可以參考這里萍诱。
下面是我最近用此方法組裝一個(gè)基因組并進(jìn)行注釋的流程。
基因組組裝與注釋分析實(shí)戰(zhàn)
contigs校正
ragtag.py correct westar.fasta ZYCR1-genome.clean.fa -t 30 -u
輸出文件:
- ragtag.correct.fasta: 校正的contigs序列污呼;
- ragtag.correct.agp:包含contigs序列的斷點(diǎn)坐標(biāo)的文件裕坊。
生成scaffold
ragtag.py scaffold westar.fasta ragtag_output/ragtag.correct.fasta -t 30 -u -C
輸出文件:
- ragtag.scaffold.agp:contigs序列的方向和順序(AGP格式);
- ragtag.scaffold.fasta:scaffold序列燕酷;
- ragtag.scaffold.stats:scaffolding過程中的一些統(tǒng)計(jì)信息碍庵。
基因注釋
liftoff -g westar.gff3 -o ZYCR1.gene.gff3 -p 28 -chroms chr_pairs.txt ZYCR1.genome.asm.final.fasta westar.fa
為了提高基因映射的準(zhǔn)確性,這里僅允許相同染色體間的基因進(jìn)行映射悟狱,因此需要指定兩個(gè)基因組間的染色體對應(yīng)關(guān)系静浴,即-chroms
參數(shù),其格式如下:
chrA01,chrA01
chrA02,chrA02
chrA03,chrA03
chrA04,chrA04
chrA05,chrA05
chrA06,chrA06
chrA07,chrA07
chrA08,chrA08
chrA09,chrA09
chrA10,chrA10
chrC01,chrC01
chrC02,chrC02
chrC03,chrC03
chrC04,chrC04
chrC05,chrC05
chrC06,chrC06
chrC07,chrC07
chrC08,chrC08
chrC09,chrC09
Uncluster,Uncluster
基因重命名
這里可以使用一個(gè)腳本GeneIDRename進(jìn)行:
perl GeneIDRename -g ZYCR1.gene.gff3 -o ZYCR1.gene.rename.gff3 -c 6
提取mRNA和protein序列
gffread ZYCR1.gene.rename.gff3 -g ZYCR1.genome.asm.final.fasta -x ZYCR1.cds.fa -y ZYCR1.pep.fa
參考資料
Alonge, M., Soyk, S., Ramakrishnan, S. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol 20, 224 (2019). https://doi.org/10.1186/s13059-019-1829-6
Alaina Shumate, Steven L Salzberg, Liftoff: accurate mapping of gene annotations, Bioinformatics, 2021;, btaa1016, https://doi.org/10.1093/bioinformatics/btaa1016