2020-01-15 了解人類不同版本參考基因組及如何選擇

2013年發(fā)布了GRCh38痰哨，每年會在不改變序列和坐標的情況下發(fā)布一些Patches
https://www.ncbi.nlm.nih.gov/grc/help/patches/
**《Biostar Handbook》建議使用最新版本的基因組兄世，并且要知道如何在不同基因組之間映射信息（即liftover coordinates）

liftOver from UCSC (web工具和命令行工具)
https://www.ncbi.nlm.nih.gov/genome/tools/remap
remap from NCBI (web工具)
https://www.ncbi.nlm.nih.gov/genome/tools/remap
crossmap (命令行工具)
http://crossmap.sourceforge.net/

進行l(wèi)iftover需要一個chain data过咬，用于描述新舊build之間的差異：

conda install crossmap -y
CrossMap.py

# Get the chain file that maps from hg19 to hg38.
# 下載chain data
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
# Get a test data file that will be remapped.
# bed文件烈菌？
wget http://data.biostarhandbook.com/data/ucsc/test.hg19.bed
# Run the remapping process.
# 進行remap
CrossMap.py bed hg19ToHg38.over.chain.gz test.hg19.bed test.hg38.bed

*.bed文件不知道是什么锣披，學習：
《生信分析過程中這些常見文件（fastq/bed/gtf/sam/bam/wig）的格式以及查看方式你都知道嗎棒动？》https://blog.csdn.net/qazplm12_3/article/details/85222665

bwa作者Heng Li 2017年的博客給出了一些選擇參考基因組的建議：
https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

1. 比對至GRCh37(hg19)屎篱，使用hs37-1kg：

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

2. 比對至GRCh37，并且認為 decoy sequence* 有助于variant calling啰劲，使用hs37d5：

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

關于decoy sequence，在博文《關于人參考基因組fasta文件的組成部分說明》中有提及檀何，EB病毒基因組：
http://www.reibang.com/p/5b73773e30ef

3. 比對至GRCh38(hg38)：

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

GRCh37（hg19）和GRCh38（hg38）還有其它小版本蝇裤。

各個版本的基因組可能存在的問題：

1. Inclusion of ALT contigs.

由于基因組是用單倍體類型表現(xiàn)的，因此需要alt序列表示雙倍體中的等位基因等频鉴。
ALT contigs are large variations with very long flanking sequences nearly identical to the primary human assembly. Most read mappers will give mapping quality zero to reads mapped in the flanking sequences. This will reduce the sensitivity of variant calling and many other analyses. You can resolve this issue with an ALT-aware mapper, but no mainstream variant callers or other tools can take the advantage of ALT-aware mapping.

黃色部分為flanking sequence栓辜，起調(diào)控作用

2. Padding ALT contigs with long “N”s. (?)

This has the same problem with 1 and also increases the size of genome unnecessarily. It is worse.

3. Inclusion of multi-placed sequences.

偽常染色體序列（PARs）是X和Y染色體上核苷酸的同源序列，假常染色體基因（到目前為止至少發(fā)現(xiàn)了29個）表現(xiàn)出常染色體遺傳而不是性別相關的遺傳模式垛孔。

偽常染色體區(qū)域PAR1藕甩、PAR2是X和Y染色體上核苷酸的同源序列；正常雄性具有假常染色體基因的兩個副本：一個在其Y染色體的假常染色體區(qū)域中周荐，另一個在其X染色體的相應部分中狭莱。正常雌性也具有假常染色體基因的兩個副本：它們的兩個X染色體均包含假常染色體區(qū)域； X和Y染色體之間的cross over通常僅限假常染色體區(qū)域概作。因此腋妙，雌性可以繼承其父親的Y染色體上最初存在的等位基因。

alpha satellites在維基百科中重定向至centromere了
https://en.wikipedia.org/wiki/Centromere#The_centromeric_sequence

In both GRCh37 and GRCh38, the pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you use a reference genome that contains both copies, you will not be able to call any variants in PARs with a standard pipeline. In GRCh38, some alpha satellites are placed multiple times, too. The right solution is to hard mask PARs on chrY and those extra copies of alpha repeats.

4. Not using the rCRS mitochondrial sequence.

rCRS是1981年宣布的人類線粒體DNA的劍橋參考序列（CRS）的修訂版（rCRS）讯榕。儲存在Genebank NCBI數(shù)據(jù)庫骤素，檢索號NC_012920匙睹。
同時還有非洲（Yoruba）參考序列，非洲（Uganda）參考序列济竹，瑞典參考序列痕檬，日本參考序列，重構智人參考序列（RSRS）

rCRS is widely used in population genetics. However, the official GRCh37 comes with a mitochondrial sequence 2bp longer than rCRS. If you want to analyze mitochondrial phylogeny, this 2bp insertion will cause troubles. GRCh38 uses rCRS.

5. Converting semi-ambiguous IUB codes to “N”.

將RYKM等簡并堿基都替換成N

This is a very minor issue, though. Human chromosomal sequences contain few semi-ambiguous bases.

6. Using accession numbers instead of chromosome names.

使用檢索號而非染色體名

Do you know CM000663.2 corresponds to chr1 in GRCh38?

7. Not including unplaced and unlocalized contigs.

基因組中不包括來自unlocalized和unplaced序列送浊，導致來自這些序列的讀段被強制map到其它染色體上梦谜，導致錯誤的variant call.

This will force reads originated from these contigs to be mapped to the chromosomal assembly and lead to false variant calls.

不同版本基因組問題簡要總結：

Alt contigs的存在→variant calling和其它分析的敏感性降低→使用ALT-aware tools
用Ns填充Alt contigs→造成和1相似的問題
包括PARs→使用standard pipeline會call不到PARs上的variants→hard mask掉chrY上的PARs
不使用rCRS→在分析線粒體系統(tǒng)發(fā)育時會遇到問題
用N表示所有簡并堿基→不是什么大問題
使用Accession Number而非染色體名
不包括unlocalized和unplaced序列--導致false variant calls

hg19/chromFa.tar.gz from UCSC: 1, 3, 4 and 5.

hg38/hg38.fa.gz from UCSC: 1, 3 and 5.

GCA_000001405.15_GRCh38_genomic.fna.gz from NCBI: 1, 3, 5 and 6.

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from EnsEMBL: 3.

Homo_sapiens.GRCh38.dna.toplevel.fa.gz from EnsEMBL: 1, 2 and 3.