2019-11-14學(xué)習(xí)使用ChIPseeker包

參考學(xué)習(xí)資料：ChIPseeker: an R package for ChIP peak Annotation, Comparison and Visualization
這個(gè)R包原來(lái)是作者設(shè)計(jì)用來(lái)注釋ChIP seq及可視化的一個(gè)工具包扯键，但是后面經(jīng)過(guò)多次更新完善了很多新的功能：

ChIP profiling
Peak Annotation
Functional enrichment analysis
ChIP peak data set comparison
Statistical testing of ChIP seq overlap
Data Mining with ChIP seq data deposited in GEO
其實(shí)我更感興趣的是后面2個(gè)項(xiàng)目锥涕，用自己的數(shù)據(jù)去測(cè)試和這個(gè)軟件分析的ChIP seq overlap，還有最后一個(gè)在GEO數(shù)據(jù)庫(kù)中“挖礦”豪嚎。

這個(gè)包可以自己新建感興趣的TxDb 對(duì)象箫攀，通過(guò)從 UCSC和BioMart數(shù)據(jù)庫(kù)提取信息然后通過(guò) R函數(shù)TxDbFromBiomart 和makeTxDbFromUCSC. TxDb 對(duì)象再拿來(lái)做peak annotation.
在R里面輸入?makeTxDbFromBiomart()即可打開(kāi)幫助文檔通過(guò)實(shí)例來(lái)練習(xí)：

以下是我從幫助文檔找到的例子较屿，換上自己感興趣的基因的轉(zhuǎn)錄本進(jìn)行練習(xí)忙厌。

## We can use listDatasets() from the biomaRt package to list the
## datasets available in the "ENSEMBL_MART_ENSEMBL" BioMart database:
library(biomaRt)
listMarts(host="www.ensembl.org")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
datasets <- listDatasets(mart)
head(datasets)
subset(datasets, grepl("elegans", dataset, ignore.case=TRUE))

## Retrieve the full transcript dataset for Worm:
txdb1 <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
txdb1

## Retrieve an incomplete transcript dataset for Human:
transcript_ids <- c(
  "ENST00000313806",
  "ENST00000443408",
  "ENST00000399272",
  "ENST00000381132",
  "ENST00000381135",
  "ENST00000620920",
  "ENST00000481448",
  "ENST00000492600"
)

if (interactive()) {
  txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids)
  txdb2  # note that these annotations match the GRCh38 genome assembly
}
save(txdb2, file = "RCAN1-ALLisoform.Rdata")
## ---------------------------------------------------------------------
## B. ACCESSING THE EnsemblGenomes MARTS
## ---------------------------------------------------------------------

library(biomaRt)

## Note that BioMart is not currently available for Ensembl Bacteria.

## ---------------------
## --- Ensembl Fungi ---

mart <- useMart(biomart="fungi_mart", host="fungi.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
yeast_txdb <- makeTxDbFromBiomart(biomart="fungi_mart",
                                  dataset="scerevisiae_eg_gene",
                                  host="fungi.ensembl.org")
yeast_txdb

## Note that the dataset for Yeast on Ensembl Fungi is not necessarily
## the same as on the main Ensembl database:
yeast_txdb0 <- makeTxDbFromBiomart(dataset="scerevisiae_gene_ensembl")
all(transcripts(yeast_txdb0) %in% transcripts(yeast_txdb))
all(transcripts(yeast_txdb) %in% transcripts(yeast_txdb0))

## -----------------------
## --- Ensembl Metazoa ---

## The metazoa mart is slow and at the same time it doesn't seem to
## support requests that take more than 1 min at the moment. So a call to
## biomaRt::getBM() will fail with a "Timeout was reached" error if the
## requested data takes more than 1 min to download. This unfortunately
## happens with the example below so we don't try to run it for now.

## Not run: 
mart <- useMart(biomart="metazoa_mart", host="metazoa.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
worm_txdb <- makeTxDbFromBiomart(biomart="metazoa_mart",
                                 dataset="celegans_eg_gene",
                                 host="metazoa.ensembl.org")
worm_txdb

## Note that even if the dataset for Worm on Ensembl Metazoa contains
## the same transcript as on the main Ensembl database, the transcript
## type might be annotated with slightly different terms (e.g. antisense
## vs antisense_RNA):
filter <- list(tx_name="Y71G12B.44")
transcripts(worm_txdb, filter=filter, columns=c("tx_name", "tx_type"))
transcripts(txdb1, filter=filter, columns=c("tx_name", "tx_type"))

## End(Not run)
## ----------------------
## --- Ensembl Plants ---

## Like the metazoa mart (see above), the plants mart is also slow and
## doesn't seem to support requests that take more than 1 min either.
## So we don't try to run the example below for now.

## Not run: 
mart <- useMart(biomart="plants_mart", host="plants.ensembl.org")
datasets <- listDatasets(mart)
datasets[ , 1:2]
athaliana_txdb <- makeTxDbFromBiomart(biomart="plants_mart",
                                      dataset="athaliana_eg_gene",
                                      host="plants.ensembl.org")
athaliana_txdb

## End(Not run)
## ------------------------
## --- Ensembl Protists ---

mart <- useMart(biomart="protists_mart", host="protists.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
tgondii_txdb <- makeTxDbFromBiomart(biomart="protists_mart",
                                    dataset="tgondii_eg_gene",
                                    host="protists.ensembl.org")
tgondii_txdb

## ---------------------------------------------------------------------
## C. USING AN Ensembl MIRROR
## ---------------------------------------------------------------------

## You can use the 'host' argument to access the "ENSEMBL_MART_ENSEMBL"
## BioMart database at a mirror (e.g. at uswest.ensembl.org). A gotcha
## when doing this is that the name of the database on the mirror might
## be different! We can check this with listMarts() from the biomaRt
## package:
listMarts(host="useast.ensembl.org")

## Therefore in addition to setting 'host' to "uswest.ensembl.org" we
## might also need to specify the 'biomart' argument:
if (interactive()) {
  txdb3 <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                               dataset="hsapiens_gene_ensembl",
                               transcript_ids=transcript_ids,
                               host="useast.ensembl.org")
  txdb3
}

## ---------------------------------------------------------------------
## D. USING FILTERS
## ---------------------------------------------------------------------

## We can use listFilters() from the biomaRt package to get valid filter
## names:
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl",
                host="www.ensembl.org")
head(listFilters(mart))

## Retrieve transcript dataset for Ensembl gene ENSG00000011198:
my_filter <- list(ensembl_gene_id="ENSG00000011198")

if (interactive()) {
  txdb4 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
                               filter=my_filter)
  txdb4
  transcripts(txdb4, columns=c("tx_id", "tx_name", "gene_id"))
  transcriptLengths(txdb4)
}

## ---------------------------------------------------------------------
## E. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------

chrominfo <- getChromInfoFromBiomart(dataset="celegans_gene_ensembl")
chrominfo

實(shí)例主要是通過(guò)線蟲(chóng)來(lái)演示涉馁，可能是因?yàn)檫@個(gè)生物的基因組小，數(shù)據(jù)量少的緣故峭火』傧埃總的來(lái)講還是不錯(cuò)的，后面還需要勤加練習(xí)卖丸。
這個(gè)包的可視化做的真的很棒例如用以下代碼實(shí)現(xiàn)疊加韋恩圖的效果：

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(clusterProfiler)
files <- getSampleFiles()
print(files)

peak <- readPeakFile(files[[4]])
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")

plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
#BiocManager::install("ggupset")
#BiocManager::install("ggplot2")
#BiocManager::install("ggimage")
library(ggimage)
library(ggupset)
library(ggplot2)
upsetplot(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)

疊加韋恩圖的效果

關(guān)于富集分析ChIPseeker 還提供一個(gè)函數(shù), seq2gene, 可以完成連接基因區(qū)域到基因多對(duì)多的映射. 作者考慮到host gene (exon/intron), promoter region and flanking gene from intergenic region 這些區(qū)域可能通過(guò)cis-regulation來(lái)調(diào)控纺且。這個(gè)函數(shù)的設(shè)計(jì)初衷應(yīng)該是聯(lián)系基因組上編碼和非編碼基因之間的功能預(yù)測(cè)。

因?yàn)樯镅芯款I(lǐng)域富集分析運(yùn)用廣泛稍浆，這個(gè)包的作者Y叔也開(kāi)發(fā)了一些Bioconductor packages來(lái)進(jìn)行分析這浩如煙海的數(shù)據(jù)载碌，例如可以探究是否篩選的基因與某個(gè)特殊的生物學(xué)過(guò)程是相關(guān)的分析工具：包括 DOSE(Yu et al. 2015) 做疾病富集分析 Disease Ontology, ReactomePA 做 reactome pathway, 還有鼎鼎大名的clusterProfiler(Yu et al. 2012) 做的 Gene Ontology 和 KEGG 富集分析.

下一個(gè)重點(diǎn)就是注釋了
需要先下載ReactomePA包
這個(gè)包有個(gè).db文件有點(diǎn)大，下載之前必須要配置好鏡像衅枫。否則可能導(dǎo)致失敗嫁艇。
下載后我查看了一下結(jié)果如下

Last login: Thu Nov 14 07:45:10 on console
(base) Cheng-MBP:~ chelsea$ cd /var/folders/tm/q03dw2_n18v2v6nlsw81yhc80000gn/T//RtmpLJTJI5/downloaded_packages
(base) Cheng-MBP:downloaded_packages chelsea$ ls
ChIPseeker_1.22.0.tgz     ggplot2_3.2.1.tgz         magick_2.2.tgz
ReactomePA_1.30.0.tgz     ggupset_0.1.0.tgz         reactome.db_1.70.0.tar.gz
ggimage_0.2.4.tgz         graphite_1.32.0.tgz
(base) Cheng-MBP:downloaded_packages chelsea$ ll
total 964864
-rw-r--r--  1 chelsea  staff    3671144 Nov 14 08:00 ChIPseeker_1.22.0.tgz
-rw-r--r--  1 chelsea  staff    1419743 Nov 14 11:40 ReactomePA_1.30.0.tgz
-rw-r--r--  1 chelsea  staff     476247 Nov 14 11:07 ggimage_0.2.4.tgz
-rw-r--r--  1 chelsea  staff    3973186 Nov 14 10:59 ggplot2_3.2.1.tgz
-rw-r--r--  1 chelsea  staff    1416035 Nov 14 10:57 ggupset_0.1.0.tgz
-rw-r--r--  1 chelsea  staff     844840 Nov 14 11:38 graphite_1.32.0.tgz
-rw-r--r--  1 chelsea  staff   27211445 Nov 14 11:07 magick_2.2.tgz
-rw-r--r--  1 chelsea  staff  454979427 Nov 14 11:42 reactome.db_1.70.0.tar.gz

總共有400多兆，配置鏡像相關(guān)可以參考https://mp.weixin.qq.com/s/nf2bzkBAkWp4W-KB5uBj2g

然后是注釋結(jié)果可視化

#BiocManager::install("ReactomePA")
library(ReactomePA)

pathway1 <- enrichPathway(as.data.frame(peakAnno)$geneId)
head(pathway1, 2)

gene <- seq2gene(peak, tssRegion = c(-1000, 1000), flankDistance = 3000, TxDb=txdb)
pathway2 <- enrichPathway(gene)
head(pathway2, 2)
dotplot(pathway2)

dot圖可視化

然后是對(duì)peak進(jìn)行比較分析

## promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
## tagMatrixList <- lapply(files, getTagMatrix, windows=promoter)
##
## to speed up the compilation of this vigenette, we load a precaculated tagMatrixList
data("tagMatrixList")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))

比較結(jié)果

分面顯示多個(gè)比較結(jié)果置信區(qū)間參數(shù)設(shè)置為conf=0.95

plotAvgProf(tagMatrixList, xlim=c(-3000, 3000), conf=0.95,resample=500, facet="row")

多個(gè)比較結(jié)果展示

還可用熱圖展示：

tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)

可能是畫(huà)布的設(shè)置有點(diǎn)問(wèn)題弦撩，沒(méi)有能顯示全面

除此之外還有對(duì)注釋后的比較

有2個(gè)函數(shù)可以接受peaks注釋后的列表信息(output of annotatePeak)：plotAnnoBar 和 plotDistToTSS .

peakAnnoList <- lapply(files, annotatePeak, TxDb=txdb,
                       tssRegion=c(-3000, 3000), verbose=FALSE)

接下來(lái)科研用plotAnnoBar函數(shù)來(lái)比較genomic annotation.

plotAnnoBar(peakAnnoList)

比較結(jié)果如下圖所示：

`plotAnnoBar`函數(shù)比對(duì)結(jié)果

R 函數(shù)plotDistToTSS 可以以用來(lái)展示與TSS 的距離數(shù)據(jù)步咪。也就是展示轉(zhuǎn)錄因子結(jié)合位點(diǎn)在TSS附近的分布情況。

plotDistToTSS(peakAnnoList)

plotDistToTSS函數(shù)可視化的結(jié)果

然后是功能分析：

genes = lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes) = sub("_", "\n", names(genes))
compKEGG <- compareCluster(geneCluster   = genes,
                           fun           = "enrichKEGG",
                           pvalueCutoff  = 0.05,
                           pAdjustMethod = "BH")
dotplot(compKEGG, showCategory = 15, title = "KEGG Pathway Enrichment Analysis")

這個(gè)功能分析還做了比較

尋找peaks和注釋的基因的overlap

genes= lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
vennplot(genes)

peaks和注釋的基因的overlap

作者認(rèn)為尋找overlap非常重要

Overlap is very important, if two ChIP experiment by two different proteins overlap in a large fraction of their peaks, they may cooperative in regulation. Calculating the overlap is only touch the surface. ChIPseeker implemented statistical methods to measure the significance of the overlap.

此外益楼，找到重疊僅僅是流于表面的東西猾漫，更重要的是這個(gè)軟件可以幫助計(jì)算顯著性点晴，這樣得到的結(jié)果才可靠。

> p <- GRanges(seqnames=c("chr1", "chr3"),
+              ranges=IRanges(start=c(1, 100), end=c(50, 130)))
> shuffle(p, TxDb=txdb)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr1 227521923-227521972      *
  [2]     chr3     1960405-1960435      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> enrichPeakOverlap(queryPeak     = files[[5]],
+                   targetPeak    = unlist(files[1:4]),
+                   TxDb          = txdb,
+                   pAdjustMethod = "BH",
+                   nShuffle      = 50,
+                   chainFile     = NULL,
+                   verbose       = FALSE)
                                                      qSample                                            tSample qLen
ARmo_0M    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                    GSM1174480_ARmo_0M_peaks.bed.gz 1663
ARmo_1nM   GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                   GSM1174481_ARmo_1nM_peaks.bed.gz 1663
ARmo_100nM GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz                 GSM1174482_ARmo_100nM_peaks.bed.gz 1663
CBX6_BF    GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz 1663
           tLen N_OL     pvalue   p.adjust
ARmo_0M     812    0 0.88235294 0.88235294
ARmo_1nM   2296    8 0.17647059 0.35294118
ARmo_100nM 1359    3 0.33333333 0.44444444
CBX6_BF    1331  968 0.01960784 0.07843137

可以看到結(jié)果顯示悯周，CBX6_BF分析結(jié)果是有顯著性的粒督。
作者講他們收集了目前GEO庫(kù)中17,000 bed文件，可以通過(guò)getGEOspecies()函數(shù)得到他們的物種相關(guān)總結(jié)性的信息禽翼。

也可以通過(guò)以下函數(shù)下載bed文件相關(guān)信息

> downloadGEObedFiles(genome="hg19", destDir="hg19")
There were 50 or more warnings (use warnings() to see the first 50)
> gsm <- hg19$gsm[sample(nrow(hg19), 10)]
> downloadGSMbedFiles(gsm, destDir="hg19")
There were 36 warnings (use warnings() to see them)

至于那些警告是什么我也沒(méi)有繼續(xù)探究坠陈，提示什么非零，curl.rb的問(wèn)題捐康，我完全不知道在說(shuō)啥仇矾。不過(guò)暫時(shí)也沒(méi)時(shí)間去探究了。

最后編輯于：2019.11.14 16:27:35

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末解总，一起剝皮案震驚了整個(gè)濱河市贮匕，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌花枫，老刑警劉巖刻盐，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異劳翰，居然都是意外死亡敦锌，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)佳簸，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)乙墙，“玉大人，你說(shuō)我怎么就攤上這事生均√耄” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵马胧，是天一觀的道長(zhǎng)汉买。經(jīng)常有香客問(wèn)我，道長(zhǎng)佩脊，這世上最難降的妖魔是什么蛙粘？我笑而不...
開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮威彰，結(jié)果婚禮上出牧，老公的妹妹穿的比我還像新娘。我一直安慰自己抱冷，他們只是感情好崔列，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布梢褐。她就那樣靜靜地躺著旺遮，像睡著了一般赵讯。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上耿眉，一...
開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說(shuō)
那天边翼，我揣著相機(jī)與錄音，去河邊找鬼鸣剪。笑死组底，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的筐骇。我是一名探鬼主播债鸡，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼铛纬！你這毒婦竟也來(lái)了厌均？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤告唆，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體虐秋，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡趋观，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了懂牧。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片侈净。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖僧凤，靈堂內(nèi)的尸體忽然破棺而出用狱，到底是詐尸還是另有隱情，我是刑警寧澤拼弃，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布夏伊，位于F島的核電站，受9級(jí)特大地震影響吻氧，放射性物質(zhì)發(fā)生泄漏溺忧。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一盯孙、第九天我趴在偏房一處隱蔽的房頂上張望鲁森。院中可真熱鬧，春花似錦振惰、人聲如沸歌溉。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)痛垛。三九已至草慧，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間匙头，已是汗流浹背漫谷。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留蹂析，地道東北人舔示。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像电抚，于是被迫代替她去往敵國(guó)和親惕稻。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345