一僻孝、 基本介紹
ChIPseeker可以用來對(duì)ChIP-seq數(shù)據(jù)進(jìn)行注釋與可視化搓译。
二炉媒、 使用方法
(1) 安裝并載入R包
ls() #列出當(dāng)前工作空間中的對(duì)象
rm(list=ls()) #移除全部對(duì)象
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#BiocManager::install("ChIPpeakAnno")
BiocManager::install("ChIPseeker")
BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
BiocManager::install("org.Dm.eg.db")
#library(ChIPpeakAnno)
library(ChIPseeker)
library(TxDb.Dmelanogaster.UCSC.dm6.ensGene)
library(org.Dm.eg.db)
library(ggplot2)
library(clusterProfiler)
(2) bed文件處理
$ ls 11_intersect/*.bed | while read id ; do (cat $id | awk '{if($1=="2L"||$1=="2R"||$1=="3L"||$1=="3R"||$1=="4"||$1=="X"||$1=="Y") {print "chr"$1"\t"$2"\t"$3"\t"$4}}' > 12_ChIPseeker/bed/$(basename $id)) ; done &
(3) 峰在整個(gè)染色體的分布
peak1 <- readPeakFile("12_ChIPseeker/bed/WTF_intersect.bed") #GRanges object
peak2 <- readPeakFile("12_ChIPseeker/bed/WTM_intersect.bed")
peaks <- list(WTF=peak1, WTM=peak2)
covplot(peak1)
covplot(peaks) + facet_grid(chr ~ .id)
covplot(peaks, chrs=c("chrX"), xlim=c(1e7,1.1e7)) + facet_grid(chr ~ .id)
(4) 峰注釋
txdb = TxDb.Dmelanogaster.UCSC.dm6.ensGene
peakAnno <- annotatePeak(peak1,
tssRegion = c(-3000, 3000), #啟動(dòng)子區(qū)域
genomicAnnotationPriority = c("Promoter", "5UTR", "3UTR", "Exon", "Intron","Downstream", "Intergenic"),
flankDistance = 5000,
#sameStrand = T,
TxDb = txdb,
annoDb = "org.Dm.eg.db")
peakAnno
peakAnno_df <- as.data.frame(peakAnno)
head(peakAnno_df)
write.csv(peakAnno_df,file="12_ChIPseeker/WTF_peakAnno.csv",row.names=F)
注釋結(jié)果:
seqnames start end width strand V4 annotation geneChr geneStart geneEnd geneLength geneStrand geneId transcriptId distanceToTSS ENTREZID SYMBOL GENENAME
chrX 138244 138423 180 * wt_msl2_peak_1 Exon (FBtr0340053/FBgn0025836, exon 9 of 14) 6 140722 142197 1476 2 FBgn0267029 FBtr0345995 3774 19834782 lncRNA:CR45473 long non-coding RNA:CR45473
這里注釋有兩種特纤,一種是genomic annotation (也就是annotation這一列)岁疼,還有就是nearest gene annotation(也就是多出來的其它列)波俄。genomic annotation注釋的是peak的位置晨逝,它落在什么地方了,可以是UTR懦铺,可以是內(nèi)含子或者外顯子捉貌。而最近基因是peak相對(duì)于轉(zhuǎn)錄起始位點(diǎn)的距離,不管這個(gè)peak是落在內(nèi)含子或者別的什么位置上冬念,即使它落在基因間區(qū)上趁窃,我都能夠找到一個(gè)離它最近的基因(即使它可能非常遠(yuǎn))。
針對(duì)不同的問題急前,這兩種注釋的策略是不一樣的醒陆。第一種策略peak所在位置,可能就是調(diào)控的根本裆针,比如你要做可變剪切的刨摩,最近基因的注釋顯然不是你關(guān)注的點(diǎn)寺晌。而做基因表達(dá)調(diào)控的,當(dāng)然promoter區(qū)域是重點(diǎn)澡刹,離結(jié)合位點(diǎn)最近的基因更有可能被調(diào)控折剃。最近基因的注釋信息雖然是以基因?yàn)閱挝唤o出,但我們針對(duì)的是轉(zhuǎn)錄起始位點(diǎn)來計(jì)算距離像屋,針對(duì)于不同的轉(zhuǎn)錄本怕犁,一個(gè)基因可能有多個(gè)轉(zhuǎn)錄起始位點(diǎn),所以注釋是在轉(zhuǎn)錄本的水平上進(jìn)行的己莺,我們可以看到輸出有一列是transcriptId奏甫。
兩種注釋有時(shí)候還不夠,我想看peak上下游某個(gè)范圍內(nèi)(比如說-5k到5k的距離)都有什么基因凌受,annotatePeak也可以做到阵子。你只要傳個(gè)參數(shù)說你要這個(gè)信息,還有什么區(qū)間內(nèi)胜蛉,就可以了挠进。輸出中多三列:flank_txIds、flank_geneIds和flank_gene_distances誊册,在指定范圍內(nèi)所有的基因都被列出领突。
最近基因給出來了,但都是各種人類不友好的ID案怯,只需要給annotatePeak傳入annoDb參數(shù)就行了君旦。如果你的TxDb的基因ID是Entrez,它會(huì)轉(zhuǎn)成ENSEMBL嘲碱,反之亦然金砍,當(dāng)然不管是那一種,都會(huì)給出SYMBOL麦锯,還有描述性的GENENAME恕稠。
The annotation column annotates genomic features of the peak, that is whether peak is located in Promoter, Exon, UTR, Intron or Intergenic. If the peak is annotated by Exon or Intron, more detail information will be provided. For instance, Exon (38885 exon 3 of 11) indicates that the peak is located at the 3th exon of gene 38885 (entrezgene ID) which contain 11 exons in total.
一個(gè)peak所在的位置,可能是一個(gè)基因的外顯子扶欣,同時(shí)又是另一個(gè)基因的內(nèi)含子鹅巍。使用參數(shù)genomicAnnotationPriority指定優(yōu)先順序,默認(rèn)順序是:Promoter => 5’ UTR => 3’ UTR => Exon => Intron => Downstream => Distal Intergenic宵蛀。
Downstream is defined as the downstream of gene end. Promoter定義為轉(zhuǎn)錄起始位點(diǎn)(TSS)的上下游區(qū)域昆著。另外下游是基因間區(qū)的一部分凯傲,更確切是指緊接著基因的下游业崖;這里的上游和下游其實(shí)都是基因間區(qū)健田,單獨(dú)拿出來是因?yàn)楹突蛑苯舆B接沧烈,是很近的區(qū)域收捣,即近端基因間區(qū)困食。當(dāng)然笼蛛,基因間區(qū)還包含更遠(yuǎn)的間區(qū)(Distal intergenic)芽唇,即遠(yuǎn)端基因間區(qū)。
(5) 比對(duì)到TSS附近的峰分布
# plot the heatmap of peaks align to flank sequences of TSS.
#和deeptools的reference-point mode差不多脓豪,但是畫的圖有點(diǎn)丑
peakHeatmap(peak1,
TxDb=txdb,
upstream=3000, downstream=3000, #指定轉(zhuǎn)錄起始位點(diǎn)上下游
color=rainbow(length(peak1)))
peakHeatmap(peaks, #list
TxDb=txdb,
upstream=3000, downstream=3000,
color=rainbow(length(peaks)))
# plot the profile of peaks that align to flank sequences of TSS.
plotAvgProf2(peak1,
TxDb=txdb,
upstream=3000, downstream=3000,
xlab="Genomic Region (5'->3')",
ylab = "Read Count Frequency",
conf = 0.95, resample = 1000)
plotAvgProf2(peaks,
TxDb=txdb,
upstream=3000, downstream=3000,
xlab="Genomic Region (5'->3')",
ylab = "Read Count Frequency",
conf = 0.95, resample = 1000) + facet_grid(. ~ .id)
(6) 峰在基因組的位置
peakAnno <- annotatePeak(peak1,
tssRegion=c(-3000, 3000),
TxDb=txdb,
annoDb="org.Dm.eg.db")
plotAnnoPie(peakAnno)
peakAnnoList <- lapply(peaks, annotatePeak,
TxDb=txdb,
tssRegion=c(-3000, 3000))
plotAnnoBar(peakAnnoList)
(7) 相對(duì)于轉(zhuǎn)錄起始點(diǎn)的距離
相對(duì)于轉(zhuǎn)錄起始位點(diǎn)的距離找最近的基因巷帝。不管peak落在內(nèi)含子、基因間區(qū)還是其他位置扫夜,按照peak相對(duì)于轉(zhuǎn)錄起始位點(diǎn)的距離楞泼,都能找到一個(gè)離它最近的基因。
plotDistToTSS(peakAnno,
title="Distribution of transcription factor-binding loci\nrelative to TSS")
plotDistToTSS(peakAnnoList,
title="Distribution of transcription factor-binding loci\nrelative to TSS")
(8) 距離最近的基因的交集和功能
genes <- lapply(peakAnnoList, function(i) as.data.frame(i)$geneId)
names(genes)