RIPSeeker
clp
17 June, 2020
RIPSeeker: 用于從RIP-seq實(shí)驗(yàn)中識(shí)別蛋白質(zhì)相關(guān)轉(zhuǎn)錄本的統(tǒng)計(jì)R包
RIPSeeker使用具有負(fù)二項(xiàng)分布概率的雙態(tài)HMM(two-state HMM with negative binomial emission probability)從RIP-Seq比對(duì)中推斷和區(qū)分RIP峰值誉结。雖然RIPSeeker是專門為RIP-seq數(shù)據(jù)分析量身定做的缕减,但它也提供了一套集成在這個(gè)獨(dú)立軟件包中的生物信息學(xué)工具,全面解決了從比對(duì)后處理到可視化和注釋的各種問(wèn)題。此外,還提供了一種基于規(guī)則的方法人芽,作為一個(gè)名為[rulebaseRIPSeek](http://127.0.0.1:18829/help/library/RIPSeeker/help/rulebaseRIPSeek)
的附加函數(shù)冰啃,用戶可以根據(jù)給定單端或雙端比對(duì)的自動(dòng)檢索在線Ensembl注釋,獲得RIP(和對(duì)照)中基因/轉(zhuǎn)錄本表達(dá)的RPKM/FPKM(和fold-change)蛀缝。
[ripSeek](http://127.0.0.1:18829/help/library/RIPSeeker/help/ripSeek)
的前端主功能對(duì)于大多數(shù)應(yīng)用程序來(lái)說(shuō)已經(jīng)足夠了顷链。該函數(shù)將比對(duì)獲得文件(BAM/BED/SAM)的路徑作為唯一必需的參數(shù),并輸出預(yù)測(cè)的RIP區(qū)域屈梁∴土罚可選參數(shù),用戶可以通過(guò)'cNAME'
指示控制第一文件參數(shù)列表中的哪些文件以實(shí)現(xiàn)經(jīng)驗(yàn)錯(cuò)誤發(fā)現(xiàn)率(eFDR)計(jì)算在讶。如果設(shè)置了參數(shù)'biomaRt_dataset'
和/或'goAnno'
煞抬,則ripSeek
將返回與RIP預(yù)測(cè)的基因組上下游相對(duì)應(yīng)的帶注釋的RIP預(yù)測(cè)和GO富集通路。用戶也可以通過(guò)logOddCutoff
构哺、pvalCutoff
革答、pvalAdjCutoff
、eFDRCutoff
指定統(tǒng)計(jì)顯著性分?jǐn)?shù)的閾值曙强。
下載并安裝包
rm(list = ls())
options(stringsAsFactors = F)
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))
options(download.file.method = 'libcurl')
options(url.method='libcurl')
#BiocManager::install("RIPSeeker")
#BiocManager::install("RIPSeekerData")
library(RIPSeeker)
#> Warning: package 'Rsamtools' was built under R version 3.6.2
#?RIPSeeker
ls("package:RIPSeeker")
#> [1] "addDummyProb" "addPseudoAlignment" "annotateRIP"
#> [4] "binCount" "combineAlignGals" "combineRIP"
#> [7] "computeLogOdd" "computeRPKM" "disambiguateMultihits"
#> [10] "empiricalFDR" "evalBinSize" "exportGRanges"
#> [13] "galp2gal" "getAlignGal" "logScoreWithControl"
#> [16] "logScoreWithoutControl" "mainSeek" "mainSeekSingleChrom"
#> [19] "nbh" "nbh_chk" "nbh_em"
#> [22] "nbh_gen" "nbh_init" "nbh_vit"
#> [25] "nbh.GRanges" "nbh.integer" "nbm_chk"
#> [28] "nbm_em" "plotCoverage" "plotStrandedCoverage"
#> [31] "randindx" "ripSeek" "rulebaseRIPSeek"
#> [34] "scoreMergedBins" "seekRIP" "selectBinSize"
#> [37] "statdis" "viewRIP"
示例
library(RIPSeekerData)
extdata.dir <- system.file("extdata", package="RIPSeekerData")
bamFiles <- list.files(extdata.dir, "\\.bam$",
recursive=TRUE, full.names=TRUE)
bamFiles <- grep("PRC2/", bamFiles, value=TRUE)
查看bam文件
# Retrieve system files
# OR change it to the extdata.dir from the code chunk above
# to get RIP predictions on the full alignment data
extdata.dir <- system.file("extdata", package="RIPSeeker")
bamFiles <- list.files(extdata.dir, "\\.bam$",
recursive=TRUE, full.names=TRUE)
bamFiles <- grep("PRC2", bamFiles, value=TRUE)
# RIP alignment for Ezh2 in mESC
ripGal <- combineAlignGals(grep(pattern="SRR039214", bamFiles, value=TRUE, invert=TRUE),
reverseComplement=TRUE, genomeBuild="mm9")
# Control RIP alignments for mutant Ezh2 -/- mESC
ctlGal <- combineAlignGals(grep(pattern="SRR039214", bamFiles, value=TRUE, invert=FALSE),
reverseComplement=TRUE, genomeBuild="mm9")
ripGal
#> GAlignments object with 31054 alignments and 1 metadata column:
#> seqnames strand cigar qwidth start end
#> <Rle> <Rle> <character> <integer> <integer> <integer>
#> SRR039210.4322179 chrX + 36M 36 3000326 3000361
#> SRR039210.5524106 chrX - 36M 36 3001293 3001328
#> SRR039210.5069294 chrX - 36M 36 3002790 3002825
#> SRR039210.1476279 chrX - 36M 36 3016328 3016363
#> SRR039210.711491 chrX + 36M 36 3021161 3021196
#> ... ... ... ... ... ... ...
#> SRR039211.1427139 chrX - 36M 36 166443604 166443639
#> SRR039211.447965 chrX - 36M 36 166445241 166445276
#> SRR039211.1352670 chrX - 36M 36 166446255 166446290
#> SRR039211.918096 chrX - 36M 36 166446315 166446350
#> SRR039211.67451 chrX - 36M 36 166446621 166446656
#> width njunc | uniqueHit
#> <integer> <integer> | <logical>
#> SRR039210.4322179 36 0 | FALSE
#> SRR039210.5524106 36 0 | FALSE
#> SRR039210.5069294 36 0 | FALSE
#> SRR039210.1476279 36 0 | TRUE
#> SRR039210.711491 36 0 | FALSE
#> ... ... ... . ...
#> SRR039211.1427139 36 0 | FALSE
#> SRR039211.447965 36 0 | FALSE
#> SRR039211.1352670 36 0 | FALSE
#> SRR039211.918096 36 0 | FALSE
#> SRR039211.67451 36 0 | TRUE
#> -------
#> seqinfo: 22 sequences from mm9 genome
ctlGal
#> GAlignments object with 8276 alignments and 1 metadata column:
#> seqnames strand cigar qwidth start end
#> <Rle> <Rle> <character> <integer> <integer> <integer>
#> SRR039214.4949641 chrX + 36M 36 3028539 3028574
#> SRR039214.5310910 chrX - 35M 35 3039791 3039825
#> SRR039214.4625677 chrX - 36M 36 3084567 3084602
#> SRR039214.1854227 chrX + 36M 36 3139109 3139144
#> SRR039214.5753635 chrX + 36M 36 3139131 3139166
#> ... ... ... ... ... ... ...
#> SRR039214.904524 chrX - 32M 32 166445788 166445819
#> SRR039214.2473565 chrX - 21M 21 166445960 166445980
#> SRR039214.576581 chrX - 36M 36 166446074 166446109
#> SRR039214.3756853 chrX - 36M 36 166446781 166446816
#> SRR039214.2347055 chrX + 36M 36 166650104 166650139
#> width njunc | uniqueHit
#> <integer> <integer> | <logical>
#> SRR039214.4949641 36 0 | FALSE
#> SRR039214.5310910 35 0 | FALSE
#> SRR039214.4625677 36 0 | FALSE
#> SRR039214.1854227 36 0 | FALSE
#> SRR039214.5753635 36 0 | FALSE
#> ... ... ... . ...
#> SRR039214.904524 32 0 | FALSE
#> SRR039214.2473565 21 0 | FALSE
#> SRR039214.576581 36 0 | FALSE
#> SRR039214.3756853 36 0 | FALSE
#> SRR039214.2347055 36 0 | TRUE
#> -------
#> seqinfo: 22 sequences from mm9 genome
我們都知道RIP(RNA Binding Protein Immunoprecipitation)是RNA結(jié)合蛋白免疫共沉淀残拐。顧名思義,是進(jìn)行RNA結(jié)合蛋白(RBP)與RNA之間的研究碟嘴。利用免疫共沉淀的方法將蛋白-RNA復(fù)合物拉下來(lái)溪食,將RNA結(jié)合蛋白所結(jié)合的RNA提取出來(lái),并對(duì)其進(jìn)行高通量測(cè)序娜扇,鑒定出基因和蛋白結(jié)合位點(diǎn)错沃,那么將會(huì)對(duì)RNA與蛋白質(zhì)相互作用在調(diào)節(jié)mRNA和非編碼RNA功能方面提供研究的途徑。理想情況下雀瓢,通過(guò)RIP對(duì)目的蛋白結(jié)合的RNA進(jìn)行富集后捎废,目的蛋白結(jié)合的RNA或者目的蛋白在RNA上的結(jié)合區(qū)域,在對(duì)應(yīng)的參考基因組位置上致燥,測(cè)序reads的覆蓋度會(huì)顯著升高登疗,相對(duì)其他非結(jié)合區(qū)域形成明顯的peak
。所以嫌蚤,通過(guò)測(cè)序數(shù)據(jù)檢測(cè)peak辐益,可以獲知目的蛋白結(jié)合的RNA,以及可能的結(jié)合的區(qū)域等信息脱吱。
RIPSeeker是一個(gè)基于隱馬爾可夫模型進(jìn)行從頭分析RIP peaks預(yù)測(cè)的免費(fèi)開(kāi)源Bioconductor R包智政,有著較高的敏感性和特異性。RIPSeeker區(qū)分正負(fù)鏈箱蝠,可以鑒定鏈特異性的peak區(qū)域续捂,有利于鏈特異性非編碼RNA的鑒定垦垂。并且,RIPSeeker不局限于鑒定狹義上的peak區(qū)域牙瓢,其適用于檢測(cè)更大范圍的峰值分布劫拗,以鑒定不同長(zhǎng)度范圍分布的整條結(jié)合轉(zhuǎn)錄本。所以矾克,這也是為什么進(jìn)行RIP-seq數(shù)據(jù)分析的時(shí)候页慷,我們選擇RIPSeeker的主要原因。不同于其他的peak calling方式胁附,可以說(shuō)RIPSeeker是為RIP-seq量身定做的酒繁。
雖然RIP-seq實(shí)驗(yàn)和ChIP-seq以及RNA-seq有著相似之處,但是RIP-seq有一個(gè)最根本不同的目標(biāo)控妻,就是發(fā)現(xiàn)目的結(jié)合蛋白相關(guān)的轉(zhuǎn)錄本州袒。如上圖所示,展示了3種不同模式下的比較弓候。在ChIP-seq中郎哭,目的蛋白結(jié)合的雙鏈DNA被抗體拉下來(lái),然后進(jìn)行高通量測(cè)序弓叛。由于reads通常短于雙鏈DNA片段彰居,測(cè)序數(shù)據(jù)會(huì)顯示出一個(gè)特性,即真正的結(jié)合位點(diǎn)撰筷,會(huì)在與片段長(zhǎng)度接近的一個(gè)距離d上陈惰,產(chǎn)生正負(fù)鏈的一個(gè)對(duì)稱峰,如上圖(a)所示毕籽。而由于RNA轉(zhuǎn)錄本是單鏈的抬闯,在ChIP-seq中觀察到的雙鏈DNA的雙峰性質(zhì)就不適用于RNA-seq和RIP-seq。因此关筒,也就不適合用通過(guò)查找這種雙峰的模式來(lái)進(jìn)行RIP-seq的分析溶握。并且,針對(duì)RNA來(lái)說(shuō)蒸播,我們還需要考慮到剪切比對(duì)的形式睡榆,因?yàn)榧羟惺录拇嬖冢膊荒芨p鏈DNA的模式相同袍榆,比對(duì)reads不能直接沿著基因組延伸胀屿。
Peak calling 方法比較
RIPSeeker的作者,將RIPSeeker與各種高通量測(cè)序分析中流行的其他算法進(jìn)行了比較包雀。作者選擇了三種ChIP-seq算法宿崭,包括MACS、QuEST和HPeak才写;兩種RNA-seq算法Cufflinks+Cuffdiff和Rulebased算法和一種PAR-CLIP算法PARalyzer葡兑。
通過(guò)各方法的比較奖蔓,發(fā)現(xiàn)對(duì)于相同的數(shù)據(jù)來(lái)源,所得到的峰的個(gè)數(shù)差別很大讹堤。這可能是由于不同方法之間使用的打分規(guī)則和peak長(zhǎng)度不同所致吆鹤。MACS在ENCODE數(shù)據(jù)上表現(xiàn)出來(lái)是檢測(cè)的峰數(shù)量更多,這是由于別的方法在很大程度上會(huì)將接近的peak峰區(qū)段連接成一段較長(zhǎng)的連續(xù)區(qū)間蜕劝。針對(duì)RIPSeeker和個(gè)別其他方法來(lái)說(shuō)檀头,針對(duì)兩個(gè)生物學(xué)重復(fù)轰异,它們鑒定得到的peak重復(fù)度一般高于50%(如下圖柱狀圖灰色部分)岖沛。
為了比較不同方法之間結(jié)果異質(zhì)性虫溜,作者使用同一數(shù)據(jù)集,對(duì)任意兩種方法的結(jié)果重疊情況進(jìn)行了比較股缸。結(jié)果顯示衡楞,RIPSeeker和其他方法之間有比較好的結(jié)果重疊(一般>50%)。
基于敏感度和特異性的ROC評(píng)價(jià)顯示敦姻,RIPSeeker在大多數(shù)測(cè)試中占主導(dǎo)地位瘾境。RIPSeeker在識(shí)別信號(hào)峰方面的敏感度和特異性優(yōu)于其他方法。
為了證明RIPSeeker軟件包的實(shí)用性镰惦,作者將RIPSeeker和其他發(fā)表的6個(gè)工具迷守,作用于3個(gè)RIP-seq數(shù)據(jù)集和2個(gè)PAR-CLIP數(shù)據(jù)集⊥耄基于受試者曲線兑凿,RIPSeeker表現(xiàn)出優(yōu)越的敏感性和特異性。來(lái)自RIPSeeker鑒定所得的peaks茵瘾,在后續(xù)的生物學(xué)研究中礼华,可通過(guò)特定基因的富集情況、已發(fā)表的一些motif結(jié)果龄捡、與目的蛋白相關(guān)的典型轉(zhuǎn)錄本等卓嫂,被進(jìn)一步證實(shí)。
參考文獻(xiàn):
[1]. Li Yue,Zhao Dorothy Yanling,Greenblatt Jack F et al. RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments.[J] .Nucleic Acids Res., 2013, 41: e94.
[2]. Zhang,Y., Liu,T., Meyer,C.A., Eeckhoute,J., Johnson,D.S.,Bernstein,B.E., Nusbaum,C., Myers,R.M., Brown,M., Li,W. et al.(2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol.,9, R137
[3]. Valouev,A., Johnson,D.S., Sundquist,A., Medina,C., Anton,E.,Batzoglou,S., Myers,R.M. and Sidow,A. (2008) Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods, 5, 829–834.
[4]. Qin,Z.S., Yu,J., Shen,J., Maher,C.A., Hu,M., KalyanaSundaram,S., Yu,J. and Chinnaiyan,A.M. (2010) HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics, 11, 369.
[5]. Trapnell,C., Williams,B.A., Pertea,G., Mortazavi,A., Kwan,G., van Baren,M.J., Salzberg,S.L., Wold,B.J. and Pachter,L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol., 28, 516–520
[6]. Roberts,A., Goff,L., Pertea,G., Kim,D., Kelley,D.R., Pimentel,H., Salzberg,S.L., Rinn,J.L., Pachter,L. and Trapnell,C. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc., 7, 562–578
[7]. Corcoran,D.L., Georgiev,S., Mukherjee,N., Gottwein,E., Skalsky,R.L., Keene,J.D. and Ohler,U. (2011) PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data. Genome Biol., 12, R79
[8]. Zhao,J., Ohsumi,T.K., Kung,J.T., Ogawa,Y., Grau,D.J., Sarma,K., Song,J.J., Kingston,R.E., Borowsky,M. and Lee,J.T. (2010) Genome-wide identification of polycomb-Associated RNAs by RIP-seq. Mol. Cell, 40, 939–953