Dragon star Day 1 關于測序技術、NGS數(shù)據(jù)格式礼旅、變異檢測

Dragon star Day 1 20190729

關于測序技術膳叨、NGS數(shù)據(jù)格式、變異檢測

中英混雜3000字超長補丁痘系，一只酸菜魚瘋狂填補知識盲區(qū)的筆記

Dragonstar2019 by Kai Wang

Genomic technologies in disease studies

NGS data formats and variant calling

Part Ⅰ Genomic technologies in disease studies

1 Types of genetic variation

SNVs (Single Nucleotide Variants) - 單核苷酸變異
Indel (Insertion or deletion < 50 bp)
SV (Structural Variants) - 結構變異

can be balanced or unbalanced
- Balanced events do not involve gain or loss of genetic materials
- Inversions and translocations
  - Complex SVs (several types together)
- Unbalanced events:
- deletions/insertions/duplications (Deletions and duplications are two subtypes of CNVs (Copy Number Variants)
  - Chromosomal aneuploidies (such as trisomy 21 21三體) are extreme cases of unbalanced SV.
Allele frequency and effect size - 等位基因頻率與效應量/效應值

Manolio et al, Nature, 2009

Genome-wide association studies (GWAS) are effective in detecting common alleles that contribute to the inherited component of common multifactorial diseases. Typically, the alleles identified by this approach have modest effect sizes that cannot fully account for disease susceptibility. This discrepancy may exist because it is hard to identify rare alleles with a low to modest penetrance using GWAS. Penetrance is a measure of the proportion of individuals in a population carrying a particular allele that expresses the related phenotype. In contrast to multifactorial diseases, Mendelian diseases have a high penetrance and very rare allele frequency.

McCarthy, Mark I., et al. "Genome-wide association studies for complex traits: consensus, uncertainty and challenges." Nature reviews genetics 9.5 (2008): 356.

在統(tǒng)計學中菲嘴，效應值（Effect size）是量化現(xiàn)象強度的數(shù)值。[1]效應值實際的統(tǒng)計量包括了二個變數(shù)間的相關程度汰翠、回歸模型中的回歸系數(shù)龄坪、不同處理間平均值的差異……等等。無論哪種效應值奴璃，其絕對值越大表示效應越強悉默，也就是現(xiàn)象越明顯。效應值與特效檢驗的概念是互補的苟穆。在估算統(tǒng)計檢定力抄课、需要的樣本數(shù)與進行元分析時唱星，效應值經常扮演重要角色。

https://zh.wikipedia.org/wiki/效應值

Effect size is a simple way of quantifying the difference between two groups that has many advantages over the use of tests of statistical significance alone. Effect size emphasises the size of the difference rather than confounding this with sample size.*

Coe R. It's the effect size, stupid: What effect size is and why it is important[J]. 2002.

2 Revolution: single-molecule long-read sequencing

單分子長讀長測序

2.1 Oxford Nanopore Sequencing

測量電流

The DNA/RNA is sequenced when it is going though a protein pore.

DNA (or RNA)
Motor protein
Adapter sequence

The nucleotides in the DNA/RNA block the ionic current and induce changes of current, which can be measured.

2.2 PacBio Single-molecule real-time (SMRT) sequencing

測量熒光

SMRTbell library: 環(huán)狀
Adapter: 環(huán)狀
Types of SMRT sequencing reads
- Continuous Long Reads (CLR)
  
  Long inserts so that the polymerase can synthesize along a single strand
- Circular Consensus Sequencing (CCS跟磨， HiFi）
  
  Short inserts, so polymerase can continue around the entire SMRTbell multiple times and generate multiple sub-reads from the same single molecule
- 存在由于技術造成的 error insertion
  
  Insertions tend to be more than deletions and substitutions.

2.3 Linked-read Sequencing

依靠barcode

By adding a unique barcode to every short read generated from an individual molecule, the short reads are linked together.

10X Genomics 公司的 linked reads 技術本質上是將 barcode 序列引入長序列片段间聊，透過將長片段分配到不同的油滴微粒中，利用 GemCode 平臺對長片段序列進行擴增引入 barcode 序列以及定序接頭引物抵拘，然後將序列打斷成適合定序大小的片段進行定序哎榴，通過 barcode 序列資訊追蹤來自每個大片段DNA模板的多個 Reads，從而獲得大片段的遺傳資訊僵蛛。通過 linked reads 結合常規(guī)二代定序組裝得到的Scaffold尚蝌，可建構準確度更長的Scaffold。

http://toolsbiotech.blog.fc2.com/blog-entry-37.html

What are molecular barcodes?
The concept of molecular barcodes is that each original DNA fragment, within the same sample, is attached to a unique sequence barcode.

https://pdfs.semanticscholar.org/310b/3bac42989485c98406848217418ff22c22e7.pdf

Linked read use molecular barcoding to preserve long-range information
Short read pairs (2 x150 bp) are generated using barcode-containing primers.
Short reads that contains the same barcode and within a certain distance can be linked together to “reconstruct” the original long DNA fragment.

2.4 Single-molecule optical mapping (Bionano Genomics)

依靠限制性內切酶位點充尉，測量光譜

Optical mapping

By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence.

https://en.wikipedia.org/wiki/Optical_mapping

Part Ⅱ NGS data formats and variant calling

1 Basic Concepts in NGS

Insert – the DNA portion that is used for sequencing *note: 和突變中的insertion不同

Insert size is the length of the DNA (or RNA) that you want to sequence and that is "inserted" between the adapters (so adapters excluded).

https://www.biostars.org/p/95803/
Read – the part of the insert that is sequenced 讀長
Single End – a sequencing procedure by which the insert is sequenced from one end only 單端
Paired End – a sequencing procedure by which the insert is sequenced from both ends 雙端

2 Data formats

2.1 The Rawset of Raw Data

Typically: images 根據(jù)不同堿基發(fā)射出不同顏色的熒光

base calling: call nucleotides at each base of each read.

2.2 NGS data formats

FASTQ: The raw sequence data format

Millions of short reads from unknown genetic locations.

Basic Qualities: a "Phred" scale

BQ = -10log₁₀(ε) where ε is the probability of an error.

Phred: Phil's revised editor by Phil Green

https://www.pnas.org/content/101/39/13991>

Base quality conversion

Nowadays, we settled on using quality scores on the original Sanger format,(Phred+33).

需要知道測序年份/公司飘言，來確定Phred

https://en.wikipedia.org/wiki/FASTQ_format
BED: Genomic region format
- The first 3 required BED fields are:
  1. chrom: "chr1"
  2. chromStart: "0"
  3. chromEnd: "3"
  只包含這三種信息的BED: Minimal BED format. So-called BED3 format.
- 再加上：
  1. labels: "foo", "bar", "biz"...
  2. scores: "3.1"
  3. strands: "+", "-"
  This is so-called BED6 format.
GFF: annotates one line per feature
BAM/SAM: Genome alignment format
- SAM: Sequence Alignment/Map format (tab-delimited text file).
- BAM: The binary equivalent of a SAM file, which stores the same data in a compressed binary representation
https://www.samformat.info/sam-format-flag
CRAM

和 BAM/SAM 類似的另一種格式，無損壓縮驼侠，適用于大型人群基因組/外顯子組測序項目姿鸿，如 UK Biobank.
- CRAM was designed to be an efficient reference-based alternative to SAM/BAM file formats
- Better lossless compression than BAM, but also allow for controlled loss of BAM data
VCF: Variant Call Format
- A gold standard for describing variants.
- One locus per line, and it may contain more than one mutations, but most lines contain one variant only.
- 主要分為：
  - the header line
    
    包括：CHROM POS ID REF ALT QUAL FILTER INFO
    
    存在genotype時，還有：FORMAT
  - the INFO line
    
    用分號分隔的keys, 用等號定義可選的值
  - the genotypes
    
    定義數(shù)據(jù)類型及順序
gVCF (Genomic VCF):
- the basic format specification is the same as for a regular VCF, but gVCF contains extra information.
- gVCF儲存了變異及非變異位點的測序信息倒源，適用于臨床分析

2.3 Formats use different coordinate systems, which adds confusion

不同文件格式對染色體坐標的起始位置定義不同苛预，造成了一定的困擾

BED: 0-based, half-open
GFF: 1-based, fully closed
SAM: 1-based, fully closed
BAM: 0-based, half-open
VCF: 1-based, fully closed
…

3 Visualization of genomic data: IGV

強大的IGV

IGV: a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.

4 Coverage

4.1 What is coverage?

The depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length.

測序深度可理解為：(read長度/read數(shù)量)/單倍體基因組長度
The breadth of coverage is the percentage of target bases that have been sequenced for a given number of times.
The accuracy of variant calling is affected by sequence quality, uniformity of coverage and the threshold of false-discovery rate that is used.
What is variant calling?

Variant calling is the process by which we identify variants from sequence data (Figure 11).
1. Carry out whole genome or whole exome sequencing to create FASTQ files.
2. Align the sequences to a reference genome, creating BAM or CRAM files.
3. Identify where the aligned reads differ from the reference genome and write to a VCF file.
Identify where the aligned reads differ from the reference genome and write to a VCF file.

Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

reads中標紅的堿基為variants.

https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction-2019/variant-identification-and-analysis

4.2 Coverage: how many reads we need to cover the genome?

Depth of coverage model

鳥槍法測序中，genome size G, read size S, N reads

某個read出現(xiàn)在某個間隙(interval)的概率為：p=L/G

D: 出現(xiàn)在與read同等長度的間隙的read的數(shù)量

D ~ Binomial(N, L/G)笋熬，D與N, p 符合二項分布

即：b(x; n, P) = _nC_x* P^x * (1 - P) ^n-x**

? b(D; N, L/G) = _NC_D* (L/G)^D * (1 - L/G) ^N-D**

Let L = S, D is also the number of reads that cover the last position of the interval → D is depth of coverage.

因為 S << G, N的值特別大热某，depth可近似理解為泊松分布：

D ~ Poisson(λ)
λ= SN/G (average depth of coverage)

$P(D=k)=\frac{e^{-\lambda}\lambda^k}{k!}$
Fraction of genome that are covered

根據(jù)上述公式可以計算：

Given λ=40, the fraction of genome that are covered more than 30x (D>30) is: 0.938.

4.3 Overdispersion

Main cause of overdispersion

GC bias and other technical factors lead to systematic bias in coverage, resulting in overdispersion.
How to model overdispersion
- Ideal situation (Poisson distrition):
  $Var(D) = μ$
- Gamma-Poisson is equivalent to Negative Binomial, which is a commonly used model for dealing with overdispersion in count data:
  $Var(D) = μ+ μ^2/k$
- 過離散模型也適用于其他因素，如 biological noise

4.4 Question on coverage

Why do we need average 30-50x in a typical WGS experiment, and 100x in WES?

WGS is less biased compared to WES. We do not need as much depth to call a variant confidently. Check the following publications for more detailed information.

" Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. "

Article Variant detection sensitivity and biases in whole genome and...

Shu-Hong Lin

https://www.researchgate.net/post/Why_is_NGS_coverage_of_seemingly_less_complex_exome_sequences_higher_than_that_of_whole_genome_sequencing

"WGS is less biased compared to WES."的原因：WES的樣品首先需要經過PCR

Why 30X WGS beats 100X WES for variant coverage

https://www.variantyx.com/variantyx-posts/why-30x-wgs-beats-100x-wes-for-variant-coverage/

5 General strategy for variant calling

5.1 Possible Genotypes

當只有參考基因組時突诬，各種情形的概率均為1

P(reads|A/A, read mapped) = P(C observed|A/A, read mapped) = 1.0

P(reads|A/C, read mapped) = P(C observed|A/C, read mapped) = 1.0

P(reads|C/C, read mapped) = P(C observed|C/C, read mapped) = 1.0
假定error rate為0.01
當?shù)?條read的該位點為C時

P(reads|A/A, read mapped) = 0.01

P(reads|A/C, read mapped) = 0.5

P(reads|C/C, read mapped) = 1 - 0.01 = 0.99
當?shù)?條read的該位點為C時

P(reads|A/A, read mapped) = 0.01² = 0.0001

P(reads|A/C, read mapped) = 0.5² = 0.25

P(reads|C/C, read mapped) = 0.99² = 0.9801
當?shù)?條read的該位點為C時

P(reads|A/A, read mapped) = 0.01³ = 0.000001

P(reads|A/C, read mapped) = 0.5³ = 0.125

P(reads|C/C, read mapped) = 0.99³ = 0.970299
當?shù)?條read的該位點為A時

P(reads|A/A, read mapped) = 0.01³ * 0.99 = 0.00000099

P(reads|A/C, read mapped) = 0.5⁴ = 0.0625

P(reads|C/C, read mapped) = 0.99³ * 0.01 = 0.00970299
當?shù)?條read的該位點為A

P(reads|A/A, read mapped) = 0.01³ * 0.99² = 0.00000098

P(reads|A/C, read mapped) = 0.5⁵ = 0.03125

P(reads|C/C, read mapped) = 0.99³ * 0.01² = 0.0000970299 ≈ 0.000097

總結出貝葉斯公式：

Combine these likelihoods with a prior incorporating information from other individuals and flanking sites to assign a genotype.

5.2 From Sequence to Genotype: Individual Based Prior

Individual Based Prior: Evry site has 1/1000 probability of varying.

Ingredients That Go Into Prior

Most sites don’t vary

P(non-reference base) ~ 0.001

When a site does vary, it is usually heterozygous
P(non-reference heterozygote) ~ 0.001 * 2/3
P(non-reference homozygote) ~ 0.001 * 1/3

Mutation model
Transitions account for most variants (C?T or A?G)
Transversions account for minority of variants

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf

因此苫拍，各位點的先驗概率為：

Prior(A/A) = 0.001 * 1/3 ≈ 0.00033 0.00034?

Prior(A/C) = 0.001 * 2/3 ≈ 0.00067 0.00066?

Prior(C/C) = 1 - 0.001 = 0.999

琢磨了兩個小時并把google搜索翻到了第十頁芜繁，依然沒看懂這里的后驗概率是怎么算的旺隙，如果有統(tǒng)計大神看到這里，求賜教??

5.3 From Sequence to Genotype: Population Based Prior

Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20

因此骏令，各位點的先驗概率為：

Prior(A/A) = 0.2 * 0.2 = 0.04

Prior(A/C) = 1 - 0.04 - 0.64 = 0.32

Prior(C/C) = 0.8 * 0.8 = 0.64

同樣的迷惑：

5.4 Sequence Based Genotype Calls

Individual Based Prior

Assumes all sites have an equal probability of showing polymorphism

Specifically, assumption is that about 1/1000 bases differ from reference

If reads where error free and sampling Poisson …

… 14x coverage would allow for 99.8% genotype accuracy

… 30x coverage of the genome needed to allow for errors and clustering

Population Based Prior

Uses frequency information obtained from examining other individuals

Calling very rare polymorphisms still requires 20-30x coverage of the genome

Calling common polymorphisms requires much less data

https://pdfs.semanticscholar.org/0156/04c3ba76cf247a8010f74ec1386e58ceb530.pdf

最后蔬捷，向大家隆重推薦生信技能樹的一系列干貨！

生信技能樹全球公益巡講：https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
B站公益74小時生信工程師教學視頻合輯：https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
招學徒：https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

最后編輯于：2019.08.10 03:55:12

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末榔袋，一起剝皮案震驚了整個濱河市周拐，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌凰兑，老刑警劉巖妥粟，帶你破解...
沈念sama閱讀 206,968評論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異吏够，居然都是意外死亡勾给，警方通過查閱死者的電腦和手機滩报，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,601評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來播急，“玉大人脓钾，你說我怎么就攤上這事∽” “怎么了可训？”我有些...
開封第一講書人閱讀 153,220評論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長捶枢。經常有香客問我握截，道長，這世上最難降的妖魔是什么烂叔？我笑而不...
開封第一講書人閱讀 55,416評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任川蒙，我火速辦了婚禮，結果婚禮上长已，老公的妹妹穿的比我還像新娘畜眨。我一直安慰自己，他們只是感情好术瓮，可當我...
茶點故事閱讀 64,425評論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布康聂。她就那樣靜靜地躺著，像睡著了一般胞四。火紅的嫁衣襯著肌膚如雪恬汁。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,144評論 1贊 285
城市分裂傳說
那天辜伟，我揣著相機與錄音氓侧，去河邊找鬼。笑死导狡，一個胖子當著我的面吹牛约巷，可吹牛的內容都是我干的。我是一名探鬼主播旱捧，決...
沈念sama閱讀 38,432評論 3贊 401
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼独郎，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了枚赡？” 一聲冷哼從身側響起氓癌，我...
開封第一講書人閱讀 37,088評論 0贊 261
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎贫橙，沒想到半個月后贪婉，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經...
沈念sama閱讀 43,586評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡卢肃，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 36,028評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年疲迂，在試婚紗的時候發(fā)現(xiàn)自己被綠了星压。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 38,137評論 1贊 334
活死人
序言：一個原本活蹦亂跳的男人離奇死亡鬼譬，死狀恐怖娜膘，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情优质，我是刑警寧澤竣贪，帶...
沈念sama閱讀 33,783評論 4贊 324
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站巩螃，受9級特大地震影響演怎，放射性物質發(fā)生泄漏。R本人自食惡果不足惜避乏，卻給世界環(huán)境...
茶點故事閱讀 39,343評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一爷耀、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧拍皮，春花似錦歹叮、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,333評論 0贊 19
一樁弒父案咆耿，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至爹橱，卻和暖如春萨螺，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背愧驱。一陣腳步聲響...
開封第一講書人閱讀 31,559評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工慰技，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人组砚。一個月前我還...
沈念sama閱讀 45,595評論 2贊 355
代替公主和親
正文我出身青樓吻商，卻偏偏與公主長得像，于是被迫代替她去往敵國和親惫确。傳聞我的和親對象是個殘疾皇子手报，可洞房花燭夜當晚...
茶點故事閱讀 42,901評論 2贊 345

Dragon star Day 1 關于測序技術采驻、NGS數(shù)據(jù)格式审胚、變異檢測