文件格式
可以是最基本的甲基化調(diào)用文件也可以是bismark下游的BAM/SAM文件。
在這里使用的是經(jīng)過處理得到的文本文件铡原。
典型的文件應(yīng)該是這樣的
## chrBase chr base strand coverage freqC freqT
## 1 chr21.9764539 chr21 9764539 R 12 25.00 75.00
## 2 chr21.9764513 chr21 9764513 R 12 0.00 100.00
## 3 chr21.9820622 chr21 9820622 F 13 0.00 100.00
## 4 chr21.9837545 chr21 9837545 F 11 0.00 100.00
## 5 chr21.9849022 chr21 9849022 F 124 72.58 27.42
導(dǎo)入文件
library(methylKit)
file.list <- list(
"brain_1.txt",
"brain_2.txt",
"adrenal_1.txt",
"adrenal_2.txt")
myobj <- methRead(
file.list,
sample.id=list(
"brain_1",
"brain_2",
"adrenal_1",
"adrenal_2"
),
assembly="hg19",
sep = " ",
treatment=c(1,1,0,0),
context="CpG",
mincov = 10
)
1.樣本描述性統(tǒng)計(jì)
1.1甲基化百分比的分布直方圖
一般呈現(xiàn)兩邊高中間低的特點(diǎn)。要么高甲基化瘩燥,要么低甲基化奸例。
1.2甲基化覆蓋率的直方圖
如果數(shù)據(jù)受到了PCR擴(kuò)增的影響供填,那么右邊還會(huì)出現(xiàn)一個(gè)峰渣磷。
1.3根據(jù)覆蓋率對(duì)位點(diǎn)進(jìn)行篩選婿着,去除過高或者過低的位點(diǎn)
2比較分析
2.1對(duì)樣本進(jìn)行merge
meth=unite(tile, destrand=FALSE)
head(tile)
得到在所有樣本中都有cover的甲基化位點(diǎn)or甲基化區(qū)域(這里是甲基化區(qū)域)。
methylBase object with 6 rows
--------------
chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4
1 chr1 714501 715000 * 218 3 215 204 6 198 176 5 171 168 3
2 chr1 799001 799500 * 84 78 6 245 217 28 98 68 30 144 106
3 chr1 805001 805500 * 573 26 547 1005 22 983 823 16 807 1219 28
4 chr1 809001 809500 * 76 76 0 56 54 2 46 40 6 77 68
5 chr1 839001 839500 * 132 34 98 161 52 109 157 103 54 254 170
6 chr1 839501 840000 * 281 43 238 374 56 318 327 39 288 508 73
numTs4
1 165
2 38
3 1191
4 9
5 84
6 435
--------------
sample.ids: brain_1 brain_2 adrenal_1 adrenal_2
destranded FALSE
assembly: hg19
context: CpG
treament: 1 1 0 0
resolution: region
2.2樣本相關(guān)性分析
得到相關(guān)系數(shù)圖
2.3樣本聚類信息
2.4 PCA分析(是什么醋界?)
We can also do a PCA analysis on our samples. The following function will plot a scree plot for importance of components.
We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster.
2.5找到差異甲基化區(qū)域
The calculateDiffMeth() function is the main function to calculate differential methylation. Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method (Wang, Tuominen, and Tsai 2011). If you have replicates, the function will automatically use logistic regression.
calculateDiffMeth()函數(shù)是計(jì)算差異甲基化的主要函數(shù)竟宋。根據(jù)每個(gè)組的樣本多少,它將使用Fisher精確檢驗(yàn)或邏輯回歸檢驗(yàn)來計(jì)算p值物独。使用SLIM方法將p值調(diào)整為q值(Wang, Tuominen, and Tsai 2011)袜硫。如果有重復(fù)樣本氯葬,函數(shù)將自動(dòng)使用邏輯回歸檢驗(yàn)挡篓。
這個(gè)原理我也不是很懂。
myDiff=calculateDiffMeth(meth)
# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
# get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
# get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)
可視化每條染色體的低甲基化/高甲基化堿基/區(qū)域的分布:
3.注釋差異甲基化位點(diǎn)或者區(qū)域
得到差異甲基化的基因組分布
Summary of target set annotation with genic parts
Rows in target set: 12066
-----------------------
percentage of target features overlapping with annotation:
promoter exon intron intergenic
14.63 26.44 60.27 31.86
percentage of target features overlapping with annotation:
(with promoter > exon > intron precedence):
promoter exon intron intergenic
14.63 16.91 36.61 31.86
percentage of annotation boundaries with feature overlap:
promoter exon intron
3.91 1.36 3.16
summary of distances to the nearest TSS:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2890 11562 36476 36108 1718029
得到差異甲基化位點(diǎn)的CpG島分布:
summary of target set annotation with feature annotation:
Rows in target set: 12066
----------------------------
percentage of target elements overlapping with features:
CpGi shores other
66.34 7.09 29.15
percentage of feature elements overlapping with target:
CpGi shores
15.25 1.96
得到距離TSS最近的基因名稱
target.row dist.to.feature feature.name feature.strand
57 1 -7315 uc001abu.1 +
92 2 3201 uc031pkn.1 +
64 3 -2155 uc031pko.1 +
64.1 4 -1655 uc031pko.1 +
106 5 1126 uc001ace.3 +
106.1 6 1626 uc001ace.3 +
讀取CpG島(CpGi)和CpG海岸注釋時(shí)報(bào)錯(cuò)了帚称,不過好像沒有什么關(guān)系官研,幫助文檔里也這么用。
a GenomicRangesList contatining one GRanges object for flanks and one for GRanges object for the main feature.
NOTE: This can not return a GRangesList at the moment because flanking regions do not have to have the same column name as the feature. GRangesList elements should resemble each other in the column content. We can not satisfy that criteria for the flanks
與內(nèi)含子/外顯子/啟動(dòng)子/CpG島等重疊的差異甲基化區(qū)域的百分比/數(shù)目
讀入甲基化調(diào)用文件之后會(huì)返回一個(gè)methylRawList對(duì)象闯睹。
但是當(dāng)樣本很多而且樣本的規(guī)格很大的時(shí)候戏羽,可以返回不一樣的對(duì)象
methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB統(tǒng)稱為methylDB objects。
Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.
Therefore methylKit offers a new group of classes, that are basically pendants to the original methylKit classes with one important difference: The methylation information, which normally is internally stored as data.frame, is stored in an external bgzipped file and is indexed by tabix (Li 2011), to enable fast retrieval of records or regions. This group contains methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB, let us call them methylDB objects.
We can now create a methylRawListDB object, which stores the same content as myobj from above. But the single methylRaw objects retrieve their data from the tabix-file linked under dbpath.