《Modern Statistics for Modern Biology》Chapter 一: 離散數(shù)據(jù)模型的預(yù)測(cè)(1.1 - 1.3)
《Modern Statistics for Modern Biology》Chapter 一: 離散數(shù)據(jù)模型的預(yù)測(cè)(1.4 - 1.5)
《Modern Statistics for Modern Biology》Chapter 二: 統(tǒng)計(jì)建模(2.1-2.3)
《Modern Statistics for Modern Biology》Chapter 二: 統(tǒng)計(jì)建模(2.4-2.5)
《Modern Statistics for Modern Biology》Chapter 二 統(tǒng)計(jì)建模(2.6 - 2.7)
《Modern Statistics for Modern Biology》Chapter 二 統(tǒng)計(jì)建模(2.8 - 2.9)
從這章開始最開始記錄一些markdown等小知識(shí)
$\hat{p}=\frac{1}{12}$
:
掌握R語言中的apply函數(shù)族
卡方檢驗(yàn)
Hardy-Weinberg equilibrium( 哈迪-溫伯格平衡 )
帶你理解beta分布
簡(jiǎn)單介紹一下R中的幾種統(tǒng)計(jì)分布及常用模型
- 統(tǒng)計(jì)分布每一種分布有四個(gè)函數(shù):
d――density(密度函數(shù)),p――分布函數(shù)豌研,q――分位數(shù)函數(shù)淤击,r――隨機(jī)數(shù)函數(shù)
失晴。比如履恩,正態(tài)分布的這四個(gè)函數(shù)為dnorm菇爪,pnorm朴下,qnorm镜雨,rnorm
囱嫩。下面我們列出各分布后綴恃疯,前面加前綴d、p墨闲、q或r就構(gòu)成函數(shù)名:norm:正態(tài)
,t:t分布
鸳碧,f:F分布
盾鳞,chisq:卡方
(包括非中心)unif:均勻
,exp:指數(shù)
瞻离,weibull:威布爾,gamma:伽瑪
套利,beta:貝塔
lnorm:對(duì)數(shù)正態(tài)推励,logis:邏輯分布,cauchy:柯西
肉迫,binom:二項(xiàng)分布
验辞,geom:幾何分布
,hyper:超幾何
喊衫,nbinom:負(fù)二項(xiàng)
跌造,pois:泊松
signrank:符號(hào)秩,wilcox:秩和
族购,tukey:學(xué)生化極差
如何預(yù)測(cè)一條序列是否含有CpG島
2.10 示例:基因組中出現(xiàn)核苷酸模式
到目前為止壳贪,我們看到的例子集中在離散計(jì)數(shù)和分類數(shù)據(jù)的分布上。讓我們來看一個(gè)距離分布的例子联四,它是準(zhǔn)連續(xù)的撑碴。對(duì)基因組序列中特定基序?qū)嵗g距離分布的這一案例研究也將使我們能夠探索生物導(dǎo)體中的特定基因組序列操作。
Biostrings包提供了對(duì)序列數(shù)據(jù)進(jìn)行操作的工具朝墩,
DNAString
andDNAStringSet.
是R中基礎(chǔ)的數(shù)據(jù)結(jié)構(gòu)醉拓。這些工具能讓我對(duì)單條或者多條序列進(jìn)行有效的操作伟姐。
library("Biostrings")
? 問題
通過瀏覽教程小片段,了解Biostring包中提供的一些有用的數(shù)據(jù)和函數(shù)亿卤。
函數(shù) | 功能 |
---|---|
length | 返回序列長(zhǎng)度 |
names | 返回序列的名稱 |
[ | 從object中提取序列 |
head; tail | 從object中提取第一條或者最后一條序列 |
rev | 對(duì)序列進(jìn)行反向排序 |
width, nchar | 返回一個(gè)object中的所有序列的大蟹弑(比如堿基的數(shù)目) |
==, != | 兩個(gè)object元素范圍內(nèi)的比較 |
match, %in% | 匹配 |
duplicated, unique | 去重復(fù)功能 |
sort, order | 排序 |
relist, split, extractList |
-
更多內(nèi)容
Table 1:Low-level manipulation ofDNAStringSetandAAStringSetobjects.
? 解
- 第一行打印遺傳密碼信息,第二行返回
IUPAC
核苷酸類型排吴。第三行列出了Biostring軟件包中所有可用的
vignettes(有哪些參考說明書)秆乳,第四行顯示了一個(gè)特定的
vignettes`(功能參數(shù)預(yù)覽)。
> GENETIC_CODE
TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG
"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q"
CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG
"R" "R" "R" "R" "I" "I" "I" "M" "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A"
GAT GAC GAA GAG GGT GGC GGA GGG
"D" "D" "E" "E" "G" "G" "G" "G"
attr(,"alt_init_codons")
[1] "TTG" "CTG"
> IUPAC_CODE_MAP
A C G T M R W S Y K V H D B N
"A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG" "ACT" "AGT" "CGT" "ACGT"
> vignette(package = "Biostrings")
> vignette("BiostringsQuickOverview", package = "Biostrings")
-
BSgenome
包提供了許多物種的基因組信息钻哩,您可以通過鍵入以下命令來訪問包含整個(gè)基因組序列的數(shù)據(jù)包的名稱
> library("BSgenome")
> ag = available.genomes()
> length(ag) ## 可以看到有91個(gè)基因組的信息
[1] 91
> ag[1:2]
[1] "BSgenome.Alyrata.JGI.v1" "BSgenome.Amellifera.BeeBase.assembly4"
- 接下來我們將探索
AGGAGGT
在E.coli
(大腸桿菌)中出現(xiàn)屹堰。我們使用一個(gè)特定菌株的基因組序列(Escherichia coli str. K12 substr.DH10B),其NCBI登錄號(hào)為NC_010473街氢。
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome.Ecoli.NCBI.20080805", version = "3.8")
library("BSgenome.Ecoli.NCBI.20080805")
> Ecoli
E. coli genome:
# organism: Escherichia coli (E. coli)
# provider: NCBI
# provider version: 2008/08/05
# release date: NA
# release name: NA
# 13 sequences:
# NC_008253 NC_008563 NC_010468 NC_004431 NC_009801 NC_009800 NC_002655 NC_002695 NC_010498 NC_007946 NC_010473 NC_000913
# AC_000091
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)
> shineDalgarno = "AGGAGGT"
> ecoli = Ecoli$NC_010473
- 我們可以使用
countPattern
函數(shù)計(jì)算寬度為50000的窗口中出現(xiàn)的pattern
扯键。
> window = 50000
> starts = seq(1, length(ecoli) - window, by = window)
> ends = starts + window - 1
> numMatches = vapply(seq_along(starts), function(i) {
+ countPattern(shineDalgarno, ecoli[starts[i]:ends[i]],
+ max.mismatch = 0)
+ }, numeric(1))
> table(numMatches)
numMatches
0 1 2 3 4
48 32 8 3 2
> numMatches
[1] 0 1 0 1 4 0 0 0 1 0 0 1 0 0 1 0 0 0 0 3 0 0 4 0 1 0 2 2 1 0 1 0 1 1 1 0 0 0 0 1 3 1 1 1 1 0 0 0 0 1 0 0 1 0 2 1 2 1 1 0 0 0 0
[64] 0 1 0 1 1 0 1 2 0 0 0 0 0 0 0 0 1 1 0 3 1 0 1 2 1 2 1 2 0 1
> str(starts)
num [1:93] 1 50001 100001 150001 200001 ...
> str(numMatches)
num [1:93] 0 1 0 1 4 0 0 0 1 0 ...
?問題 What distribution might this table fit ?
? 解
- 我們可以使用
matchPattern
函數(shù)檢查AGGAGGT
匹配到的位置信息。
> sdMatches = matchPattern(shineDalgarno, ecoli, max.mismatch = 0)
> sdMatches
Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC...TGCTGCATGATATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC
views:
start end width
[1] 56593 56599 7 [AGGAGGT]
[2] 199644 199650 7 [AGGAGGT]
[3] 202176 202182 7 [AGGAGGT]
[4] 214433 214439 7 [AGGAGGT]
[5] 217429 217435 7 [AGGAGGT]
... ... ... ... ...
[61] 4438786 4438792 7 [AGGAGGT]
[62] 4498085 4498091 7 [AGGAGGT]
[63] 4536658 4536664 7 [AGGAGGT]
[64] 4546821 4546827 7 [AGGAGGT]
[65] 4611626 4611632 7 [AGGAGGT]
- 可以在R命令行中鍵入
sdMatches
以獲取此對(duì)象的summary
珊肃。它包含所有65
個(gè)模式匹配的位置荣刑,表示為一組所謂的原始序列視圖。它們之間的距離是多少伦乔?
> betweenmotifs = gaps(sdMatches)
> betweenmotifs
Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC...TGCTGCATGATATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC
views:
start end width
[1] 1 56592 56592 [AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAA...TGAATAATTTCGCTACCTACCACGCTCCAGGTGTCAGAACCCGGCAGAC]
[2] 56600 199643 143044 [AAAGCTACCGTTATCCAGAATGGTATAGCGGTTTTCGCCACGTTGTTTC...ACATGTTATGGGGCTATAGCTCAGCTGGGAGAGCGCCTGCTTTGCACGC]
[3] 199651 202175 2525 [CTGCGGTTCGATCCCGCATAGCTCCACCATCTCTGTAGTGGTTAAATAA...TAAAGAGTAACGGAGGAGCACGAAGGTTGGCTAATCCTGGTCGGACATC]
[4] 202183 214432 12250 [TAGTGCAATGGCATAAGCCAGCTTGACTGCGAGCGTGACGGCGCGAGCA...GATATTGGAAATATCTGATTTGCAAATTATCGTGTTATCGCCAGGCTTT]
[5] 214440 217428 2989 [TAATAACATGGGCAGGATAAGCTCGGGAGGAATGATGTTTAAGGCAATA...CCGTAGCGAGAATACTCAAAATCATCATAACGAAAAGCCCCTTACTTGT]
... ... ... ... ...
[62] 4438793 4498084 59292 [GGATTTAATCACGGTAACATTTTGCTGGCTTAAAGCATCTGCCAGACGC...GAGCAACAGGCGGCAGAGCAAGGTTGGGAGTCATTGCATCGTCAACTTC]
[63] 4498092 4536657 38566 [AGATCCGGTTGCGGCAGCAAGGATTCATCCAAATGATCCACAAAGGCTT...AGGATATAGTAAGCAAATTAAAACGGGGTAAATTTGAACTCCAAATACC]
[64] 4536665 4546820 10156 [GGAATTAAAGAATGCGATGGATGGTATATTTACGAAAAAATAATTGATG...GGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTATAACTTA]
[65] 4546828 4611625 64798 [GCAGATGCGTATTACCATAAAAAGATGGGGGAACAGTGCAGGTATGGTC...ATGGGGCACCGACACACCGAGCGTGGTGGCGCGCGCATCGCCGAGTGCA]
[66] 4611633 4686137 74505 [CGAGATCGCGGCAAAAACTCAGGCTCAGCGGCAGAAATAAAATCATCAG...TATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC]
- 因此厉亏,這些實(shí)際上是66個(gè)互補(bǔ)的區(qū)域。現(xiàn)在烈和,讓我們找到一個(gè)模型爱只,分布之間的差距大小的主題。如果模體發(fā)生在隨機(jī)位置斥杜,我們期望間隙長(zhǎng)度服從指數(shù)分布虱颗。下面的代碼(其輸出如
圖2.23
所示)評(píng)估了這一假設(shè)。如果指數(shù)分布是一個(gè)很好的擬合
蔗喂,點(diǎn)應(yīng)該大致在一條直線上忘渔。指數(shù)分布只有一個(gè)參數(shù),即速率
缰儿,并給出了與數(shù)據(jù)估計(jì)相對(duì)應(yīng)的斜率直線
畦粮。
> library("Renext")
載入需要的程輯包:evd
Warning messages:
1: 程輯包‘Renext’是用R版本3.5.2 來建造的
2: 程輯包‘evd’是用R版本3.5.2 來建造的
> expplot(width(betweenmotifs), rate = 1/mean(width(betweenmotifs)),
+ labels = "fit")
2.10.1在依賴關(guān)系的情況下建模(Modeling in the case of dependencies)
正如我們?cè)诘?.8節(jié)中看到的,核苷酸序列通常是依賴的:
在給定的位置發(fā)現(xiàn)某個(gè)核苷酸的概率往往取決于周圍的序列
乖阵。在這里宣赔,我們將使用馬爾可夫鏈
進(jìn)行依賴建模。我們將研究人類基因組第8號(hào)染色體的區(qū)域瞪浸,并試圖發(fā)現(xiàn)被稱為CpG島的區(qū)域與其他區(qū)域之間的差異儒将。我們使用數(shù)據(jù)(由Irizarry,Wu对蒲,和Feinberg(2009)生成)钩蚊,這些數(shù)據(jù)告訴我們
CpG
的起點(diǎn)和終點(diǎn)在基因組中的什么位置贡翘,并觀察核苷酸
的頻率和‘CG’、‘CT’砰逻、‘CA’鸣驱、‘CC’
的頻率。因此蝠咆,我們可以問踊东,核苷酸的出現(xiàn)之間是否存在依賴關(guān)系,如果存在刚操,如何對(duì)它們進(jìn)行建模闸翅。
> library("BSgenome.Hsapiens.UCSC.hg19")
> chr8 = Hsapiens$chr8
> setwd("e://compute language/NGS/Modern Statistics for Modern Biology/data/")
> CpGtab = read.table("model-based-cpg-islands-hg19.txt",header = TRUE)
> nrow(CpGtab)
[1] 65699
> head(CpGtab)
chr start end length CpGcount GCcontent pctGC obsExp
1 chr10 93098 93818 721 32 403 0.559 0.572
2 chr10 94002 94165 164 12 97 0.591 0.841
3 chr10 94527 95302 776 65 538 0.693 0.702
4 chr10 119652 120193 542 53 369 0.681 0.866
5 chr10 122133 122621 489 51 339 0.693 0.880
6 chr10 180265 180720 456 32 256 0.561 0.893
> irCpG = with(dplyr::filter(CpGtab, chr == "chr8"), IRanges(start = start, end = end))
> head(irCpG)
IRanges object with 6 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 32437 33708 1272
[2] 40354 40656 303
[3] 44536 46203 1668
[4] 99457 100721 1265
[5] 155954 156761 808
[6] 179033 179756 724
- 在上面的行中,我們將數(shù)據(jù)集
CpGtab
的子集(篩選)僅設(shè)為第8染色體
赡茸,然后創(chuàng)建一個(gè)IRange
對(duì)象缎脾,該對(duì)象的開始
和結(jié)束
位置由該數(shù)據(jù)集中命名相同的列定義。在IRange
函數(shù)調(diào)用(它從其參數(shù)構(gòu)造對(duì)象)中占卧,第一個(gè)開始是函數(shù)的參數(shù)名稱,第二個(gè)開始是作為filter
的輸出在數(shù)據(jù)集中獲得的列联喘;對(duì)于end
也是類似的华蜒。IRange
是一種通用的數(shù)學(xué)間隔容器。IRange中
的“I”代表“間隔”豁遭,Granges
中的“G”代表“基因組”叭喜。
> grCpG = GRanges(ranges = irCpG, seqnames = "chr8", strand = "+")
> grCpG
GRanges object with 2855 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr8 32437-33708 +
[2] chr8 40354-40656 +
[3] chr8 44536-46203 +
[4] chr8 99457-100721 +
[5] chr8 155954-156761 +
... ... ... ...
[2851] chr8 146126089-146128165 +
[2852] chr8 146175867-146176338 +
[2853] chr8 146228006-146228907 +
[2854] chr8 146232962-146233086 +
[2855] chr8 146276730-146278360 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> genome(grCpG) = "hg19"
> library("Gviz")
載入需要的程輯包:grid
Warning message:
程輯包‘Gviz’是用R版本3.5.1 來建造的
> ideo = IdeogramTrack(genome = "hg19", chromosome = "chr8")
Warning messages:
1: In readChar(f, 1) : 在non-UTF-8 MBCS語言環(huán)境里只能讀取字節(jié)
2: In readChar(f, 1) : 在non-UTF-8 MBCS語言環(huán)境里只能讀取字節(jié)
> ideo
Ideogram track 'chr8' for chromosome 8 of the hg19 genome
> plotTracks(
+ list(GenomeAxisTrack(),
+ AnnotationTrack(grCpG, name = "CpG"), ideo),
+ from = 2200000, to = 5800000,
+ shape = "box", fill = "#006400", stacking = "dense")
- 我們現(xiàn)在定義所謂的染色體序列的觀點(diǎn),對(duì)應(yīng)于
CpG島
蓖谢,irCpG
捂蕴,和之間的區(qū)域(gap(IrCpG)
)。生成的對(duì)象CGIview
和nonCGIview
只包含坐標(biāo)闪幽,而不包含序列本身(這些對(duì)象保存在對(duì)象Hsapiens$chr8
中)啥辨,因此它們?cè)诖鎯?chǔ)方面是相當(dāng)輕量級(jí)的。
> CGIview = Views(unmasked(Hsapiens$chr8), irCpG)
> CGIview
Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
start end width
[1] 32437 33708 1272 [CGGGTGCCATCTCAGCAGCTCACGGTGTGGAAACTGCGACACTCACA...ATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC]
[2] 40354 40656 303 [CGATTAGCGGGAGCGCAAAGCGAAGGGGCGGCCTGCGACGTTTTCCT...ACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG]
[3] 44536 46203 1668 [CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGT...TGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT]
[4] 99457 100721 1265 [CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGAC...GCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG]
[5] 155954 156761 808 [CGGGAAAGATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCCAAA...GGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC]
... ... ... ... ...
[2851] 146126089 146128165 2077 [CGGCAAGTAGCACACTCGACCAACTGCTGAAACGAAAGCAGCTCGTT...AAAGAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG]
[2852] 146175867 146176338 472 [CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGC...TATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA]
[2853] 146228006 146228907 902 [CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACACAAAGCGGGCGCT...GGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG]
[2854] 146232962 146233086 125 [CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGG...TGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA]
[2855] 146276730 146278360 1631 [CGTCTCAGGAAACACCGCTTTCCACTCCTGTGTCCCCGAAAGGAATC...ATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG]
> NonCGIview = Views(unmasked(Hsapiens$chr8), gaps(irCpG))
> NonCGIview
Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
start end width
[1] 33709 40353 6645 [TCTTCTCATCACGTGCTCTCAGGGGCCACGATGTCAACATGCCTCA...AATGCTGGGTGCCCGCAAGAACGCGGAGCAGCAGGCACTCATTCC]
[2] 40657 44535 3879 [TCTCCAAGTGCCGCCAGCTGCGGGATTTCCTCTGCAAAAGACAAAC...CCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGG]
[3] 46204 99456 53253 [GTGCTGTGTGAGAACATGTGTGTAGTGTTCACATGTCCTCTGTGCG...AGTAGGGGCGGCCGGGCAGAGGCGCCCCTCACCTCCCGGACGGGG]
[4] 100722 155953 55232 [CGGCTGCCCTAAGTGATCTTTTTAAGTAAAGGAGAAACTCACCAGG...AACCGTTCTAACTGGTCTCTGACCTTGATTATTAACGGCTGCAAC]
[5] 156762 179032 22271 [CTGCTGGCAGCTAGGGACATTGCAGGGCCCTCTTGCTCACATTGTA...CCATGTCTGCATCCCGGGCGGTATGTACGCTCTGAGAGATACATG]
... ... ... ... ...
[2850] 146125886 146126088 203 [GGGGGATGACTTTTATATATTTCTGACATGCCTGCATTAGAAGAAG...AGAAAAGACAGCATCGACAGCAAAATGGGGAGAGGATAAGACACC]
[2851] 146128166 146175866 47701 [AGGCATGGCTCAAATGTGGCTCCCACATGTGGGCTTAGGTCAAAGT...CCACCCCTTCACCCCGTCGCTCCGCCCGAGCTCGAATGGTGGCAC]
[2852] 146176339 146228005 51667 [CTCTGTGGTTCAGCCCCAGACCTGGTCCCACTAGCCCTGAGGCCAG...ACAGCCCGTAGACCTGCACGCCCCGCTGCCTCCGCTCGTGCCTCA]
[2853] 146228908 146232961 4054 [AGGGAGGACTTACGAGAATGGAAAGGAAAGCAGCCCGTCTTCAGTG...CTGAAGACAGCGCAGCTATCCGTAAGGATCACATGCGGCTTCCCT]
[2854] 146233087 146276729 43643 [GACCATAGTTCTCCAAGGGCACCTAGAGTGGATCCAGCTGAGTTCA...CCTTCGAGACCTTGCAGACCAGGCAGCTAACACTTCCCACCCCAC]
- 我們利用這些數(shù)據(jù)計(jì)算了
CpG島
和非CpG島
的躍遷計(jì)數(shù)盯腌。
> seqCGI = as(CGIview, "DNAStringSet")
> seqCGI
A DNAStringSet instance of length 2855
width seq
[1] 1272 CGGGTGCCATCTCAGCAGCTCACGGTGTGGAAACTGCGACACTCACACGGGTGCCATC...TCTCTGATTACATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC
[2] 303 CGATTAGCGGGAGCGCAAAGCGAAGGGGCGGCCTGCGACGTTTTCCTGTAAAGTTGGG...GCTCAGCACGGACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG
[3] 1668 CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGTGGGTTCTCTGT...GAGTCCCTGTGTGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT
[4] 1265 CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGACGGGGCGGCTGG...AGACTGAGACAGCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG
[5] 808 CGGGAAAGATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCCAAAGCCAGGCAGTG...GCTGGCGACGGGGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC
... ... ...
[2851] 2077 CGGCAAGTAGCACACTCGACCAACTGCTGAAACGAAAGCAGCTCGTTTTCGTCTTGAC...AGTTTTTGAGAAAAGAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG
[2852] 472 CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGCCCCTCGCGGCG...GCCGCTCGTTCTATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA
[2853] 902 CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACACAAAGCGGGCGCTCAGCACCCCGC...AGCGGAGTGAGGGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG
[2854] 125 CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGGGGCTTTTGGGG...GCTCGGGCGAGTGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA
[2855] 1631 CGTCTCAGGAAACACCGCTTTCCACTCCTGTGTCCCCGAAAGGAATCATCAGACATCT...GGAAGGGTGGAATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG
> seqNonCGI = as(NonCGIview, "DNAStringSet")
> dinucCpG = sapply(seqCGI, dinucleotideFrequency)
> dinucNonCpG = sapply(seqNonCGI, dinucleotideFrequency)
> dinucNonCpG[, 1]
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485
> dinucCpG[, 1]
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
49 127 55 8 132 44 116 112 54 129 107 100 4 104 112 18
> NonICounts = rowSums(dinucNonCpG)
> IslCounts = rowSums(dinucCpG)
> NonICounts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC
14223322 7129981 9795572 11291315 10241505 6931732 1174927 9779692 8372468 5706958 6934755 7112531 9602902 8359398
TG TT
10221420 14197986
> IslCounts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
64103 83109 122131 44000 113346 193847 133549 122377 105186 177326 194517 86752 31210 106738 114592 65861
- For a four state Markov chain as we have, we define the transition matrix as a matrix where the rows are the from state and the columns are the to state. (這里放原文比較好理解)
> TI = matrix( IslCounts, ncol = 4, byrow = TRUE)
> TI
[,1] [,2] [,3] [,4]
[1,] 64103 83109 122131 44000
[2,] 113346 193847 133549 122377
[3,] 105186 177326 194517 86752
[4,] 31210 106738 114592 65861
> TnI = matrix(NonICounts, ncol = 4, byrow = TRUE)
> dimnames(TI) = dimnames(TnI) =
+ list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
> TI
A C G T
A 64103 83109 122131 44000
C 113346 193847 133549 122377
G 105186 177326 194517 86752
T 31210 106738 114592 65861
- 我們使用每種類型的躍遷次數(shù)的計(jì)數(shù)來計(jì)算頻率溉知,并將它們放入兩個(gè)矩陣中 。
> TI
A C G T
A 64103 83109 122131 44000
C 113346 193847 133549 122377
G 105186 177326 194517 86752
T 31210 106738 114592 65861
> MI = TI /rowSums(TI)
> MI
A C G T
A 0.20457773 0.2652333 0.3897678 0.1404212
C 0.20128250 0.3442381 0.2371595 0.2173200
G 0.18657245 0.3145299 0.3450223 0.1538754
T 0.09802105 0.3352314 0.3598984 0.2068492
> MN = TnI / rowSums(TnI)
> MN
A C G T
A 0.3351380 0.1680007 0.23080886 0.2660524
C 0.3641054 0.2464366 0.04177094 0.3476871
G 0.2976696 0.2029017 0.24655406 0.2528746
T 0.2265813 0.1972407 0.24117528 0.3350027
? 問題
- 這意味著不同行中的轉(zhuǎn)換是否不同腕够?例如级乍,
? 解
- 轉(zhuǎn)變是不同的。例如帚湘,對(duì)于CpG島(MI)轉(zhuǎn)換矩陣玫荣,從C到A的轉(zhuǎn)換和從T到A的轉(zhuǎn)換看起來非常不同(0.201和0.098)。
? 問題
- CpG島上不同核苷酸的相對(duì)頻率是否不同大诸?
? 解
> freqIsl = alphabetFrequency(seqCGI, baseOnly = TRUE, collapse = TRUE)[1:4]
> freqIsl / sum(freqIsl)
A C G T
0.1781693 0.3201109 0.3206298 0.1810901
> freqNon = alphabetFrequency(seqNonCGI, baseOnly = TRUE, collapse = TRUE)[1:4]
> freqNon / sum(freqNon)
A C G T
0.3008292 0.1993832 0.1993737 0.3004139
- 從結(jié)果可以看出捅厂,在
CpG島
中材诽,C和G
的頻率大概是0.32
,在非-CpG島區(qū)
恒傻,有A和T
的頻率為0.30脸侥。
? 問題
- 我們?nèi)绾卫眠@些差異來決定一個(gè)給定的序列是否來自CpG島?
? 解
使用
χ2-平方(卡方檢驗(yàn))
統(tǒng)計(jì)量來比較觀察到的頻率與freqIsI
和freqNon
頻率之間的差異盈厘。對(duì)于較短的序列睁枕,這可能不夠敏感,下面給出一個(gè)更敏感的方法沸手。給定一個(gè)序列外遇,我們不知道它是否在一個(gè)CpG島上,我們可以問它與其他地方相比契吉,它屬于一個(gè)CpG島的概率有多大跳仿。我們根據(jù)所謂的優(yōu)勢(shì)比來計(jì)算分?jǐn)?shù)。讓我們舉個(gè)例子:假設(shè)我們的序列x是
ACGTTATACTACG
捐晶,我們想要決定它是否
來自CpG島
菲语。假設(shè)序列來自CpG島,如果我們將該序列建模為一階
Markov鏈
惑灵,則可以這樣寫:
我們將把這個(gè)概率與
非-CpG島
的概率進(jìn)行比較山上。正如我們?cè)谏厦婵吹降模@些概率往往是非常不同的英支。我們將取它們的比值佩憾,看看它是大于還是小于1。這些主權(quán)將是許多小term
的產(chǎn)物干花,并將變得非常小妄帘。我們可以通過取對(duì)數(shù)來解決這個(gè)問題。
這是對(duì)數(shù)似然比分?jǐn)?shù)池凄。為了加快計(jì)算速度抡驼,我們計(jì)算了對(duì)數(shù)比
,然后對(duì)相關(guān)的數(shù)據(jù)進(jìn)行匯總得到我們的分?jǐn)?shù)。
> alpha = log((freqIsl/sum(freqIsl)) / (freqNon/sum(freqNon)))
> alpha
A C G T
-0.5238084 0.4734387 0.4751059 -0.5061665
> beta = log(MI / MN)
> beta
A C G T
A -0.4935945 0.4566418 0.5239611 -0.6390469
C -0.5927341 0.3342289 1.7365319 -0.4699321
G -0.4671646 0.4383576 0.3360277 -0.4967508
T -0.8379216 0.5303960 0.4002977 -0.4821485
> x = "ACGTTATACTACG"
> scorefun = function(x) {
+ s = unlist(strsplit(x, ""))
+ score = alpha[s[1]]
+ if (length(s) >= 2)
+ for (j in 2:length(s))
+ score = score + beta[s[j-1], s[j]]
+ score
+ }
> scorefun(x)
A
-0.2824623
- 在下面的代碼中修赞,我們從
seqCGI
對(duì)象中的2855
個(gè)序列中選擇長(zhǎng)度為len=100
的序列婶恼,然后從seqNonCGI
對(duì)象中的2854
個(gè)序列中選擇(每個(gè)序列都是一個(gè)DNAStringSet
)。在generateRandomScore
函數(shù)的前三行中柏副,我們刪除了包含除A勾邦、C、T割择、G以外的任何字母的序列眷篇,例如"."
(用于未定義核苷酸的字符)。在其余的序列中荔泳,我們以與其長(zhǎng)度減去len成正比的概率進(jìn)行采樣蕉饼,然后從中提取長(zhǎng)度len的序列虐杯。對(duì)子序列的起始點(diǎn)進(jìn)行均勻采樣,同時(shí)滿足子序列必須滿足的約束條件昧港。
> generateRandomScores = function(s, len = 100, B = 1000) {
+ alphFreq = alphabetFrequency(s)
+ isGoodSeq = rowSums(alphFreq[, 5:ncol(alphFreq)]) == 0
+ s = s[isGoodSeq]
+ slen = sapply(s, length)
+ prob = pmax(slen - len, 0)
+ prob = prob / sum(prob)
+ idx = sample(length(s), B, replace = TRUE, prob = prob)
+ ssmp = s[idx]
+ start = sapply(ssmp, function(x) sample(length(x) - len, 1))
+ scores = sapply(seq_len(B), function(i)
+ scorefun(as.character(ssmp[[i]][start[i]+(1:len)]))
+ )
+ scores / len
+ }
> scoresCGI = generateRandomScores(seqCGI)
> scoresNonCGI = generateRandomScores(seqNonCGI)
> br = seq(-0.6, 0.7, length.out = 50)
> h1 = hist(scoresCGI, breaks = br, plot = FALSE)
> h2 = hist(scoresNonCGI, breaks = br, plot = FALSE)
> plot(h1, col = rgb(0, 0, 1, 1/4), xlim = c(-0.5, 0.5), ylim=c(0,120))
> plot(h2, col = rgb(1, 0, 0, 1/4), add = TRUE)
2.11 Summary of this chapter
Goodness of fit 我們使用了不同的可視化擎椰,并展示了如何運(yùn)行模擬實(shí)驗(yàn)來檢驗(yàn)我們的數(shù)據(jù)是否可以用一個(gè)公平的四盒多項(xiàng)式模型進(jìn)行擬合。我們遇到了
卡方檢驗(yàn)
创肥,看到了如何使用QQ圖
將模擬和理論進(jìn)行比較达舒。Estimation 我們解釋了
最大似然
和貝葉斯估計(jì)
過程。舉例說明了這些方法叹侄,包括核苷酸模式發(fā)現(xiàn)
和單倍型估計(jì)
巩搏。Prior and posterior distributions 在評(píng)估以前研究過的類型的數(shù)據(jù)(如單倍型)時(shí),計(jì)算數(shù)據(jù)的后驗(yàn)分布可能是有益的趾代。這使我們能夠通過簡(jiǎn)單的計(jì)算贯底,在決策過程中加入不確定性。只要有足夠的數(shù)據(jù)撒强,先驗(yàn)的選擇對(duì)結(jié)果影響不大禽捆。
CpG islands and Markov chains 我們了解了如何通過
馬爾可夫鏈
轉(zhuǎn)移來模擬DNA序列中的依賴關(guān)系
。我們利用這一點(diǎn)來建立基于似然比的分?jǐn)?shù)
尿褪,這樣我們就可以看到長(zhǎng)DNA序列是否來自CpG島
睦擂。當(dāng)我們制作分?jǐn)?shù)直方圖時(shí),我們?cè)?code>圖 2.25中看到了一個(gè)值得注意的特征:它似乎是由兩個(gè)部分組成的杖玲。這種雙峰是我們第一次遇到混合物,它們是第四章的主題淘正。這是在一些訓(xùn)練數(shù)據(jù)上建立模型的第一個(gè)例子:我們知道序列在CpG島上摆马,我們以后可以用這些序列對(duì)新的數(shù)據(jù)進(jìn)行分類。我們將在
第12章
中制定一種更完整的方法來實(shí)現(xiàn)這一點(diǎn)鸿吆。
References
Cleveland, William S. 1988. The Collected Works of John W. Tukey: Graphics 1965-1985. Vol. 5. CRC Press.
Rice, John. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.
Elson, D, and E Chargaff. 1952. “On the Desoxyribonucleic Acid Content of Sea Urchin Gametes.” Experientia 8 (4). Springer: 143–45.
Mourant, AE, Ada Kopec, and K Domaniewska-Sobczak. 1976. “The Distribution of the Human Blood Groups 2nd Edition.” Oxford University Press London.
Finetti, Bruno de. 1926. “Considerazioni Matematiche Sull’ereditarieta Mendeliana.” Metron 6: 3–41.
Cannings, Chris, and Anthony WF Edwards. 1968. “Natural Selection and the de Finetti Diagram.” Annals of Human Genetics 31 (4). Wiley Online Library: 421–28.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Gnome Biology 15 (12): 1–21.
Irizarry, Rafael A, Hao Wu, and Andrew P Feinberg. 2009. “A Species-Generalized Probabilistic Model-Based Definition of Cpg Islands.” Mammalian Genome 20 (9-10). Springer: 674–80.
Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press.
Freedman, David, Robert Pisani, and Roger Purves. 1997. Statistics. New York, NY: WW Norton.
Agresti, Alan. 2007. An Introduction to Categorical Data Analysis. John Wiley.
Grantham, Richard, Christian Gautier, Manolo Gouy, M Jacobzone, and R Mercier. 1981. “Codon Catalog Usage Is a Genome Strategy Modulated for Gene Expressivity.” Nucleic Acids Research 9 (1). Oxford Univ Press: 213–13.
Perrière, Guy, and Jean Thioulouse. 2002. “Use and Misuse of Correspondence Analysis in Codon Usage Studies.” Nucleic Acids Research 30 (20). Oxford Univ Press: 4548–55.
Robert, Christian, and George Casella. 2009. Introducing Monte Carlo Methods with R. Springer Science & Business Media.
Marin, Jean-Michel, and Christian Robert. 2007. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer Science & Business Media.
McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.