一. 長(zhǎng)讀段比對(duì) (Long-Read Mapping)常用的比對(duì)軟件
長(zhǎng)讀段比對(duì)算法與一代/二代測(cè)序數(shù)據(jù)的比對(duì)算法有很大的不同,因?yàn)殚L(zhǎng)讀段通常更長(zhǎng)、包含更多錯(cuò)誤和變異谱俭,并且需要更復(fù)雜的比對(duì)策略履肃。
PacBio三代測(cè)序數(shù)據(jù)有很多比對(duì)軟件可供選擇,例如:
LAST:https://github.com/mcfrith/last-genome-alignments
BlasR:https://github.com/mchaisso/blasr
BLASR是第一個(gè)針對(duì)PacBio序列的比對(duì)工具斤程,2012年發(fā)表在《BMC Bioinformatics》期刊上,由PacBio研究團(tuán)隊(duì)開發(fā)菩混,最新版本停在2019.4.11忿墅,目前Google引用次數(shù)為1170(截止2023.10.25)。
BWA-MEM:https://github.com/lh3/bwa
BWA老牌比對(duì)軟件了沮峡。BWA-MEM 是一種新的比對(duì)算法疚脐,用于將測(cè)序 reads 或者組裝后 contigs 比對(duì)至大型參考基因組,例如人參考基因組邢疙。它該算法對(duì)測(cè)序錯(cuò)誤有良好的穩(wěn)定性棍弄,適用的reads 長(zhǎng)度范圍較廣,從70bp至幾Mb疟游。
GraphMap:https://github.com/isovic/graphmap
MECAT:https://github.com/xiaochuanle/MECAT
中山大學(xué)研究團(tuán)隊(duì)開發(fā)的新工具M(jìn)ECAT呼畸,集超快比對(duì)、校正颁虐、組裝于一體蛮原,可提高三代測(cè)序數(shù)據(jù)序列比對(duì),校正和組裝的運(yùn)算速度另绩,降低計(jì)算資源的消耗儒陨。
minimap2:https://github.com/lh3/minimap2
Minimaps2是李恒大神在2018年發(fā)表在Bioinformatics上的一款針對(duì)三代數(shù)據(jù)開發(fā)的比對(duì)工具。
NGMLR: https://github.com/philres/ngmlr
NextGenMap-LR(ngmlr)主要用于三代測(cè)序的長(zhǎng)reads(PacBio 笋籽、Oxford Nanopore)與參考基因組的比對(duì)蹦漠。
pbmm2: https://github.com/PacificBiosciences/pbmm2 等等。
現(xiàn)在較為常用的是pbmm2车海、minimap2笛园、NGMLR,其中pbmm2是PacBio官方基于minimap2進(jìn)行優(yōu)化的版本。
二. pbmm2的使用教程
在得到sample.CCS.bam
文件后研铆, 因?yàn)镠iFi數(shù)據(jù)質(zhì)量較高埋同,一把不需要額外的質(zhì)控步驟,就可以將HiFi數(shù)據(jù)和下載的參考基因組序列進(jìn)行比對(duì)了蚜印。
1.參考基因組的獲取
分析前莺禁,除測(cè)序數(shù)據(jù)外,我們還需準(zhǔn)備對(duì)應(yīng)物種的參考基因組fasta文件窄赋。對(duì)此可以根據(jù)自己研究的需要哟冬,在NCBI、Ensembl忆绰、UCSC等常見(jiàn)數(shù)據(jù)庫(kù)中進(jìn)行下載浩峡。
主流的基因注釋版本有三種:RefSeq/Ensembl/UCSC
Refseq=NCBI;Ensembl=Gencode
Ensemble注釋更全面错敢,Refseq適合那些不那么復(fù)雜的注釋翰灾。
對(duì)于人類和小鼠來(lái)說(shuō),我們還可以從Gencode數(shù)據(jù)庫(kù)中進(jìn)行下載稚茅。Gencode綜合HAVANA和Ensembl數(shù)據(jù)庫(kù)中的信息纸淮,通過(guò)實(shí)驗(yàn)手段加以驗(yàn)證,從而構(gòu)建了一個(gè)高質(zhì)量的注釋信息數(shù)據(jù)庫(kù)亚享。Gencode的注釋來(lái)源于兩部分咽块。分別是Ensembl-Havana團(tuán)隊(duì)生成的手動(dòng)基因注釋和Ensembl-genebuild的自動(dòng)基因注釋。
gencode中標(biāo)識(shí)HAVANA來(lái)源的欺税,這表示它是人工注釋的侈沪。但是這些注釋也有可能是由于Havana手動(dòng)注釋和Ensembl自動(dòng)注釋合并的結(jié)果;
而如果標(biāo)識(shí)的是ENSEMBL,則表明這條注釋是由的確是Ensembl自動(dòng)注釋得到的晚凿。
以下載人類參考基因組為例:
Gencode:https://www.gencodegenes.org/human/
點(diǎn)擊 圖1 所示的 fasta 連接即可下載完整的人類參考基因組(GRCh38.p14)亭罪。
Ensembl: https://useast.ensembl.org/Homo_sapiens/Info/Index
點(diǎn)擊 圖2 所示的 Download DNA sequence (FASTA) 連接進(jìn)入下載頁(yè)面,進(jìn)一步選擇Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz , 即可下載完整的人類參考基因組(GRCh38.p14)歼秽。
注釋:
Ensembl提供的參考基因組有2種組裝形式和3種重復(fù)序列處理方式应役,分別是
primary
,toplevel
,unmasked(dna)
,soft-masked(dna_sm)
,masked(dna_rm)
。一般選擇dna.primary
和dna_sm.primary
哲银。
Toplevel: 這樣的數(shù)據(jù)包含了所有的序列區(qū)域(比如染色體扛吞、非染色體以及用大量N填充的單倍型haplotypes或基因組補(bǔ)丁patches區(qū)域),比如:Homo_sapiens.GRCh37.dna.toplevel.fa.gz
Primary Assembly: 在上面toplevel的基礎(chǔ)上荆责,排除了單倍型或基因組補(bǔ)丁區(qū)域。如果看到目錄中不存在這種類型的數(shù)據(jù)(比如這里果蠅就沒(méi)有亚脆,而人類的基因組數(shù)據(jù)就存在)做院,那么就意味著基因組不包含單倍型或基因組補(bǔ)丁區(qū)域,其實(shí)也就是等同于Toplevel.
dna:原原本本的DNA序列。
dna_rm(hard_mask):利用RepeatMasker工具將重復(fù)區(qū)域和低復(fù)雜度區(qū)域標(biāo)記成一串N键耕,會(huì)造成信息的丟失寺滚。
dna_sm(soft_mask):所有重復(fù)區(qū)域和低復(fù)雜度區(qū)域替換為小寫的堿基
RefSeq:https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/
點(diǎn)擊 圖3 中的Download即可下載人類基因參考組(GRCh38.p14)。
UCSC genome browser: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
點(diǎn)擊 圖4 中hg38.fa.gz
即可下載人類基因參考組(GRCh38.p14)屈雄。
2. pbmm2安裝
#使用conda安裝pbmm2
$ conda install -c bioconda pbmm2
#安裝版本
v1.13.0
3. pbmm2使用
- 建立人類參考基因組索引
Index: index reference and store as .mmi file
Usage: pbmm2 index [options] <ref.fa|xml> <out.mmi>
--preset 比對(duì)模式默認(rèn)為CCS, 如果為CCS - HiFi數(shù)據(jù)村视,則不用設(shè)置
#建立基因組文件夾
$ mkdir human_ref
#將參考基因組GRCh38.fa拷入human_ref 文件夾, 進(jìn)入human_ref
$ cd human_ref
#進(jìn)入?yún)⒖蓟蚪M所在文件夾,對(duì)參考序列進(jìn)行索引
$ pbmm2 index GRCh38.fa GRCh38.mmi #此過(guò)程在我們計(jì)算服務(wù)上比較快
#如果運(yùn)行時(shí)間較長(zhǎng)酒奶,使用 nohup 加 & 將程序不掛斷運(yùn)行并放入后臺(tái)
$ nohup pbmm2 index GRCh38.fa GRCh38.mmi &
- 數(shù)據(jù)與人類參考基因組進(jìn)行比對(duì)
Usage: pbmm2 align [options] <ref.fa|xml|mmi> <in.bam|xml|fa|fq> [out.aligned.bam|xml]
除了PacBio默認(rèn)的 bam文件蚁孔, fastq / fasta 作為輸入文件也是可以的。
#比對(duì)參考序列惋嚎,-j -J 如果在服務(wù)器空閑時(shí)可以不指定
$ pbmm2 align --preset CCS --sort -j 24 -J 4 \
--log-level INFO --log-file pbmm2.log --sample sample_name \
human_ref/GRCh38.mmi sample.ccs.bam sample.align.bam
# --log-level INFO杠氢,給出基本mapping統(tǒng)計(jì)結(jié)果和時(shí)間
# --sample 指定樣品名稱;-j 指定比對(duì)線程另伍;-J 8 指定排序步驟線程鼻百,8為最大;--sort 輸出排序后的bam摆尝,--preset選擇參數(shù)集-默認(rèn)
pbmm2 index 幫助文檔
pbmm2 index - Index reference and store as .mmi file
Usage:
pbmm2 index [options] <ref.fa|xml> <out.mmi>
ref.fa|xml STR Reference FASTA, ReferenceSet XML
out.mmi STR Output Reference Index
Parameter Set Option:
--preset STR Set alignment mode. See below for preset parameter details. Valid choices: (SUBREAD,
CCS, ISOSEQ, UNROLLED). [CCS]
Parameter Override Options:
-k INT k-mer size (no larger than 28). [-1]
-w INT Minimizer window size. [-1]
-u,--no-kmer-compression Disable homopolymer-compressed k-mer (compression is active for SUBREAD & UNROLLED
presets).
-h,--help Show this help and exit.
--version Show application version and exit.
-j,--num-threads INT Number of threads to use, 0 means autodetection. [0]
--log-level STR Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
--log-file FILE Log to a file, instead of stderr.
Alignment modes of --preset:
SUBREAD : -k 19 -w 10
CCS or HiFi : -k 19 -w 10 -u
ISOSEQ : -k 15 -w 5 -u
UNROLLED : -k 15 -w 15
Copyright (C) 2004-2023 Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.
pbmm2 align幫助文檔
pbmm2 align - Align PacBio reads to reference sequences
Usage:
pbmm2 align [options] <ref.fa|xml|mmi> <in.bam|xml|fa|fq|gz|fofn> [out.aligned.bam|xml]
ref.fa|xml|mmi STR Reference FASTA, ReferenceSet XML, or Reference Index
in.bam|xml|fa|fq|gz|fofn STR Input BAM, DataSet XML, FASTA, or FASTQ
out.aligned.bam|xml STR Output BAM or DataSet XML
Basic Options:
--chunk-size INT Process N records per chunk. [100]
Sorting Options:
--sort Generate sorted BAM file.
-m,--sort-memory STR Memory per thread for sorting. [768M]
-J,--sort-threads INT Number of threads used for sorting; 0 means 25% of -j, maximum 8. [0]
Parameter Set Options:
--preset STR Set alignment mode. See below for preset parameter details. Valid choices:
(SUBREAD, CCS, HIFI, ISOSEQ, UNROLLED). [CCS]
General Parameter Override Options:
-k INT k-mer size (no larger than 28). [-1]
-w INT Minimizer window size. [-1]
-u,--no-kmer-compression Disable homopolymer-compressed k-mer (compression is active for SUBREAD & UNROLLED
presets).
-A INT Matching score. [-1]
-B INT Mismatch penalty. [-1]
-z INT Z-drop score. [-1]
-Z INT Z-drop inversion score. [-1]
-r INT Bandwidth used in chaining and DP-based alignment. [-1]
-g INT Stop chain enlongation if there are no minimizers in N bp. [-1]
Gap Parameter Override Options (a k-long gap costs min{o+k*e,O+k*E}):
-o,--gap-open-1 INT Gap open penalty 1. [-1]
-O,--gap-open-2 INT Gap open penalty 2. [-1]
-e,--gap-extend-1 INT Gap extension penalty 1. [-1]
-E,--gap-extend-2 INT Gap extension penalty 2. [-1]
IsoSeq Parameter Override Options:
-G INT Max intron length (changes -r). [-1]
-C INT Cost for a non-canonical GT-AG splicing (effective in ISOSEQ preset). [-1]
--no-splice-flank Do not prefer splice flanks GT-AG (effective in ISOSEQ preset).
Read Group Options:
--sample STR Sample name for all read groups. Defaults, in order of precedence: SM field in
input read group, biosample name, well sample name, "UnnamedSample".
--rg STR Read group header line such as '@RG\tID:xyz\tSM:abc'. Only for FASTA/Q inputs.
Identity Filter Options (combined with AND):
-y,--min-gap-comp-id-perc FLOAT Minimum gap-compressed sequence identity in percent. [70]
Output Options:
-l,--min-length INT Minimum mapped read length in basepairs. [50]
-N,--best-n INT Output at maximum N alignments for each read, 0 means no maximum. [0]
--strip Remove all kinetic and extra QV tags. Output cannot be polished.
--split-by-sample One output BAM per sample.
--unmapped Include unmapped records in output.
--bam-index STR Generate index for sorted BAM output. Valid choices: (NONE, BAI, CSI). [BAI]
--short-sa-cigar Populate SA tag with short cigar representation.
Input Manipulation Options (mutually exclusive):
--median-filter Pick one read per ZMW of median length.
--zmw Process ZMW Reads, subreadset.xml input required (activates UNROLLED preset).
--hqregion Process HQ region of each ZMW, subreadset.xml input required (activates UNROLLED
preset).
Sequence Manipulation Options:
--collapse-homopolymers Collapse homopolymers in reads and reference.
-h,--help Show this help and exit.
--version Show application version and exit.
-j,--num-threads INT Number of threads to use, 0 means autodetection. [0]
--log-level STR Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
--log-file FILE Log to a file, instead of stderr.
Alignment modes of --preset:
SUBREAD : -k 19 -w 19 -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50 -r 2000 -g 5000
CCS or HiFi : -k 19 -w 19 -u -o 6 -O 26 -e 2 -E 1 -A 1 -B 4 -z 400 -Z 50 -r 2000 -g 5000
ISOSEQ : -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000
UNROLLED : -k 15 -w 15 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 2000 -g 10000
Copyright (C) 2004-2023 Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.
小技巧:
- 如果擔(dān)心比對(duì)結(jié)果不佳温艇,需要輸出未比對(duì)的reads進(jìn)行排查,可以加上--unmapped參數(shù)2堕汞。
- 由于pbmm2生成的bam文件MD tag格式有些特別勺爱,無(wú)法用于后續(xù)的sniffles進(jìn)行SV檢測(cè)。所以推薦使用pbsv進(jìn)行后續(xù)SV檢測(cè)2臼朗。
5. 公共數(shù)據(jù)演示:
(1) 從gencode數(shù)據(jù)庫(kù)下載人類參考基因組, 進(jìn)行pbmm2索引邻寿。
PacBio推薦人類參考基因組(詳細(xì)參照李恒博客),所以采用推薦基因組進(jìn)行后續(xù)分析视哑。
李恒2017年年底博客
https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use
推薦的人類參考基因組:GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
#新建文件夾
$ mkdir Human_ref
#下載參考基因組 GRCh38
$ wget -c -t 0 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
#對(duì)參考基因組進(jìn)行解壓
$ gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
#對(duì)參考基因組從新命名
$ mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa
#pbmm2 index
$ nohup pbmm2 index GRCh38.fa GRCh38.mmi
(2) 下載示例數(shù)據(jù)绣否。
Example Datasets
如圖6所示:下載示例人類基因組數(shù)據(jù)。
德系猶太人家系:HG002(子)挡毅、HG003(父)蒜撮、HG004(母),屬于個(gè)人基因組計(jì)劃中的樣本跪呈。
HG002_1:m84011_220902_175841_s1.hifi_reads.bam
HG003:m84010_220919_235306_s2.hifi_reads.bam
HG004:m84010_220919_232145_s1.hifi_reads.bam
(3) 示例數(shù)據(jù)與人類參考基因組進(jìn)行比對(duì)段磨。
#HG002_1
$ nohup pbmm2 align --preset CCS --sort -j 24 -J 4 \
--log-level INFO --log-file pbmm2.HG002_1.log --sample HG002_1 \
Human_ref/GRCh38.mmi m84011_220902_175841_s1.hifi_reads.bam HG002_1.align.bam &
#HG003
$ nohup pbmm2 align --preset CCS --sort -j 24 -J 4 \
--log-level INFO --log-file pbmm2.HG003.log --sample HG003 \
Human_ref/GRCh38.mmi m84010_220919_235306_s2.hifi_reads.bam HG003.align.bam &
#HG004
$ nohup pbmm2 align --preset CCS --sort -j 24 -J 4 \
--log-level INFO --log-file pbmm2.HG004.log --sample HG004 \
Human_ref/GRCh38.mmi m84010_220919_232145_s1.hifi_reads.bam HG004.align.bam &