參考學(xué)習(xí)資料:ChIPseeker: an R package for ChIP peak Annotation, Comparison and Visualization
這個(gè)R包原來(lái)是作者設(shè)計(jì)用來(lái)注釋ChIP seq及可視化的一個(gè)工具包扯键,但是后面經(jīng)過(guò)多次更新完善了很多新的功能:
- ChIP profiling
- Peak Annotation
- Functional enrichment analysis
- ChIP peak data set comparison
- Statistical testing of ChIP seq overlap
-
Data Mining with ChIP seq data deposited in GEO
其實(shí)我更感興趣的是后面2個(gè)項(xiàng)目锥涕,用自己的數(shù)據(jù)去測(cè)試和這個(gè)軟件分析的ChIP seq overlap,還有最后一個(gè)在GEO數(shù)據(jù)庫(kù)中“挖礦”豪嚎。
這個(gè)包可以自己新建感興趣的TxDb
對(duì)象箫攀,通過(guò)從 UCSC和BioMart數(shù)據(jù)庫(kù)提取信息然后通過(guò) R函數(shù)TxDbFromBiomart
和makeTxDbFromUCSC
. TxDb
對(duì)象再拿來(lái)做peak annotation.
在R里面輸入?makeTxDbFromBiomart()
即可打開(kāi)幫助文檔通過(guò)實(shí)例來(lái)練習(xí):
以下是我從幫助文檔找到的例子较屿,換上自己感興趣的基因的轉(zhuǎn)錄本進(jìn)行練習(xí)忙厌。
## We can use listDatasets() from the biomaRt package to list the
## datasets available in the "ENSEMBL_MART_ENSEMBL" BioMart database:
library(biomaRt)
listMarts(host="www.ensembl.org")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
datasets <- listDatasets(mart)
head(datasets)
subset(datasets, grepl("elegans", dataset, ignore.case=TRUE))
## Retrieve the full transcript dataset for Worm:
txdb1 <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
txdb1
## Retrieve an incomplete transcript dataset for Human:
transcript_ids <- c(
"ENST00000313806",
"ENST00000443408",
"ENST00000399272",
"ENST00000381132",
"ENST00000381135",
"ENST00000620920",
"ENST00000481448",
"ENST00000492600"
)
if (interactive()) {
txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
transcript_ids=transcript_ids)
txdb2 # note that these annotations match the GRCh38 genome assembly
}
save(txdb2, file = "RCAN1-ALLisoform.Rdata")
## ---------------------------------------------------------------------
## B. ACCESSING THE EnsemblGenomes MARTS
## ---------------------------------------------------------------------
library(biomaRt)
## Note that BioMart is not currently available for Ensembl Bacteria.
## ---------------------
## --- Ensembl Fungi ---
mart <- useMart(biomart="fungi_mart", host="fungi.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
yeast_txdb <- makeTxDbFromBiomart(biomart="fungi_mart",
dataset="scerevisiae_eg_gene",
host="fungi.ensembl.org")
yeast_txdb
## Note that the dataset for Yeast on Ensembl Fungi is not necessarily
## the same as on the main Ensembl database:
yeast_txdb0 <- makeTxDbFromBiomart(dataset="scerevisiae_gene_ensembl")
all(transcripts(yeast_txdb0) %in% transcripts(yeast_txdb))
all(transcripts(yeast_txdb) %in% transcripts(yeast_txdb0))
## -----------------------
## --- Ensembl Metazoa ---
## The metazoa mart is slow and at the same time it doesn't seem to
## support requests that take more than 1 min at the moment. So a call to
## biomaRt::getBM() will fail with a "Timeout was reached" error if the
## requested data takes more than 1 min to download. This unfortunately
## happens with the example below so we don't try to run it for now.
## Not run:
mart <- useMart(biomart="metazoa_mart", host="metazoa.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
worm_txdb <- makeTxDbFromBiomart(biomart="metazoa_mart",
dataset="celegans_eg_gene",
host="metazoa.ensembl.org")
worm_txdb
## Note that even if the dataset for Worm on Ensembl Metazoa contains
## the same transcript as on the main Ensembl database, the transcript
## type might be annotated with slightly different terms (e.g. antisense
## vs antisense_RNA):
filter <- list(tx_name="Y71G12B.44")
transcripts(worm_txdb, filter=filter, columns=c("tx_name", "tx_type"))
transcripts(txdb1, filter=filter, columns=c("tx_name", "tx_type"))
## End(Not run)
## ----------------------
## --- Ensembl Plants ---
## Like the metazoa mart (see above), the plants mart is also slow and
## doesn't seem to support requests that take more than 1 min either.
## So we don't try to run the example below for now.
## Not run:
mart <- useMart(biomart="plants_mart", host="plants.ensembl.org")
datasets <- listDatasets(mart)
datasets[ , 1:2]
athaliana_txdb <- makeTxDbFromBiomart(biomart="plants_mart",
dataset="athaliana_eg_gene",
host="plants.ensembl.org")
athaliana_txdb
## End(Not run)
## ------------------------
## --- Ensembl Protists ---
mart <- useMart(biomart="protists_mart", host="protists.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
tgondii_txdb <- makeTxDbFromBiomart(biomart="protists_mart",
dataset="tgondii_eg_gene",
host="protists.ensembl.org")
tgondii_txdb
## ---------------------------------------------------------------------
## C. USING AN Ensembl MIRROR
## ---------------------------------------------------------------------
## You can use the 'host' argument to access the "ENSEMBL_MART_ENSEMBL"
## BioMart database at a mirror (e.g. at uswest.ensembl.org). A gotcha
## when doing this is that the name of the database on the mirror might
## be different! We can check this with listMarts() from the biomaRt
## package:
listMarts(host="useast.ensembl.org")
## Therefore in addition to setting 'host' to "uswest.ensembl.org" we
## might also need to specify the 'biomart' argument:
if (interactive()) {
txdb3 <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
transcript_ids=transcript_ids,
host="useast.ensembl.org")
txdb3
}
## ---------------------------------------------------------------------
## D. USING FILTERS
## ---------------------------------------------------------------------
## We can use listFilters() from the biomaRt package to get valid filter
## names:
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
host="www.ensembl.org")
head(listFilters(mart))
## Retrieve transcript dataset for Ensembl gene ENSG00000011198:
my_filter <- list(ensembl_gene_id="ENSG00000011198")
if (interactive()) {
txdb4 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
filter=my_filter)
txdb4
transcripts(txdb4, columns=c("tx_id", "tx_name", "gene_id"))
transcriptLengths(txdb4)
}
## ---------------------------------------------------------------------
## E. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------
chrominfo <- getChromInfoFromBiomart(dataset="celegans_gene_ensembl")
chrominfo
實(shí)例主要是通過(guò)線蟲(chóng)來(lái)演示涉馁,可能是因?yàn)檫@個(gè)生物的基因組小,數(shù)據(jù)量少的緣故峭火』傧埃總的來(lái)講還是不錯(cuò)的,后面還需要勤加練習(xí)卖丸。
這個(gè)包的可視化做的真的很棒例如用以下代碼實(shí)現(xiàn)疊加韋恩圖的效果:
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)
files <- getSampleFiles()
print(files)
peak <- readPeakFile(files[[4]])
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
#BiocManager::install("ggupset")
#BiocManager::install("ggplot2")
#BiocManager::install("ggimage")
library(ggimage)
library(ggupset)
library(ggplot2)
upsetplot(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)
關(guān)于富集分析ChIPseeker 還提供一個(gè)函數(shù), seq2gene
, 可以完成連接基因區(qū)域到基因多對(duì)多的映射. 作者考慮到host gene (exon/intron), promoter region and flanking gene from intergenic region 這些區(qū)域可能通過(guò)cis-regulation來(lái)調(diào)控纺且。 這個(gè)函數(shù)的設(shè)計(jì)初衷應(yīng)該是聯(lián)系基因組上編碼和非編碼基因之間的功能預(yù)測(cè)。
因?yàn)樯镅芯款I(lǐng)域富集分析運(yùn)用廣泛稍浆,這個(gè)包的作者Y叔也開(kāi)發(fā)了一些Bioconductor packages來(lái)進(jìn)行分析這浩如煙海的數(shù)據(jù)载碌,例如可以探究是否篩選的基因與某個(gè)特殊的生物學(xué)過(guò)程是相關(guān)的分析工具:包括 DOSE(Yu et al. 2015) 做疾病富集分析 Disease Ontology, ReactomePA 做 reactome pathway, 還有鼎鼎大名的clusterProfiler(Yu et al. 2012) 做的 Gene Ontology 和 KEGG 富集分析.
下一個(gè)重點(diǎn)就是注釋了
需要先下載ReactomePA
包
這個(gè)包有個(gè).db文件有點(diǎn)大,下載之前必須要配置好鏡像衅枫。否則可能導(dǎo)致失敗嫁艇。
下載后我查看了一下結(jié)果如下
Last login: Thu Nov 14 07:45:10 on console
(base) Cheng-MBP:~ chelsea$ cd /var/folders/tm/q03dw2_n18v2v6nlsw81yhc80000gn/T//RtmpLJTJI5/downloaded_packages
(base) Cheng-MBP:downloaded_packages chelsea$ ls
ChIPseeker_1.22.0.tgz ggplot2_3.2.1.tgz magick_2.2.tgz
ReactomePA_1.30.0.tgz ggupset_0.1.0.tgz reactome.db_1.70.0.tar.gz
ggimage_0.2.4.tgz graphite_1.32.0.tgz
(base) Cheng-MBP:downloaded_packages chelsea$ ll
total 964864
-rw-r--r-- 1 chelsea staff 3671144 Nov 14 08:00 ChIPseeker_1.22.0.tgz
-rw-r--r-- 1 chelsea staff 1419743 Nov 14 11:40 ReactomePA_1.30.0.tgz
-rw-r--r-- 1 chelsea staff 476247 Nov 14 11:07 ggimage_0.2.4.tgz
-rw-r--r-- 1 chelsea staff 3973186 Nov 14 10:59 ggplot2_3.2.1.tgz
-rw-r--r-- 1 chelsea staff 1416035 Nov 14 10:57 ggupset_0.1.0.tgz
-rw-r--r-- 1 chelsea staff 844840 Nov 14 11:38 graphite_1.32.0.tgz
-rw-r--r-- 1 chelsea staff 27211445 Nov 14 11:07 magick_2.2.tgz
-rw-r--r-- 1 chelsea staff 454979427 Nov 14 11:42 reactome.db_1.70.0.tar.gz
總共有400多兆,配置鏡像相關(guān)可以參考https://mp.weixin.qq.com/s/nf2bzkBAkWp4W-KB5uBj2g
然后是注釋結(jié)果可視化
#BiocManager::install("ReactomePA")
library(ReactomePA)
pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)
gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
dotplot(pathway2)
然后是對(duì)peak進(jìn)行比較分析
## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
分面顯示多個(gè)比較結(jié)果置信區(qū)間參數(shù)設(shè)置為
conf=0.95
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")
還可用熱圖展示:
tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)
除此之外還有對(duì)注釋后的比較
有2個(gè)函數(shù)可以接受peaks注釋后的列表信息(output of annotatePeak):plotAnnoBar
和 plotDistToTSS
.
peakAnnoList <- lapply(files, annotatePeak, TxDb=txdb,
tssRegion=c(-3000, 3000), verbose=FALSE)
接下來(lái)科研用plotAnnoBar
函數(shù)來(lái)比較genomic annotation.
plotAnnoBar(peakAnnoList)
比較結(jié)果如下圖所示:
R 函數(shù)
plotDistToTSS
可以以用來(lái)展示與TSS 的距離數(shù)據(jù)步咪。也就是展示轉(zhuǎn)錄因子結(jié)合位點(diǎn)在TSS附近的分布情況。
plotDistToTSS(peakAnnoList)
然后是功能分析:
genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster = genes,
fun = "enrichKEGG",
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")
尋找peaks和注釋的基因的overlap
genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)
作者認(rèn)為尋找overlap非常重要
Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.
此外益楼,找到重疊僅僅是流于表面的東西猾漫,更重要的是這個(gè)軟件可以幫助計(jì)算顯著性点晴,這樣得到的結(jié)果才可靠。
> p <- GRanges(seqnames=c("chr1", "chr3"),
+ ranges=IRanges(start=c(1, 100), end=c(50, 130)))
> shuffle(p, TxDb=txdb)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 227521923-227521972 *
[2] chr3 1960405-1960435 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
> enrichPeakOverlap(queryPeak = files[[5]],
+ targetPeak = unlist(files[1:4]),
+ TxDb = txdb,
+ pAdjustMethod = "BH",
+ nShuffle = 50,
+ chainFile = NULL,
+ verbose = FALSE)
qSample tSample qLen
ARmo_0M GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1174480_ARmo_0M_peaks.bed.gz 1663
ARmo_1nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1174481_ARmo_1nM_peaks.bed.gz 1663
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1174482_ARmo_100nM_peaks.bed.gz 1663
CBX6_BF GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663
tLen N_OL pvalue p.adjust
ARmo_0M 812 0 0.88235294 0.88235294
ARmo_1nM 2296 8 0.17647059 0.35294118
ARmo_100nM 1359 3 0.33333333 0.44444444
CBX6_BF 1331 968 0.01960784 0.07843137
可以看到結(jié)果顯示悯周,CBX6_BF分析結(jié)果是有顯著性的粒督。
作者講他們收集了目前GEO庫(kù)中17,000 bed文件,可以通過(guò)getGEOspecies()
函數(shù)得到他們的物種相關(guān)總結(jié)性的信息禽翼。
也可以通過(guò)以下函數(shù)下載bed文件相關(guān)信息
> downloadGEObedFiles(genome="hg19", destDir="hg19")
There were 50 or more warnings (use warnings() to see the first 50)
> gsm <- hg19$gsm[sample(nrow(hg19), 10)]
> downloadGSMbedFiles(gsm, destDir="hg19")
There were 36 warnings (use warnings() to see them)
至于那些警告是什么我也沒(méi)有繼續(xù)探究坠陈,提示什么非零,curl.rb的問(wèn)題捐康,我完全不知道在說(shuō)啥仇矾。不過(guò)暫時(shí)也沒(méi)時(shí)間去探究了。