導(dǎo)航:
A. 簡(jiǎn)介
B. 操作An Overview of the IRanges package
2.1Normality
2.2Lists of IRanges objects
2.3Vector Extraction
2.4Finding Overlapping Ranges
2.5Counting Overlapping Ranges
2.6Finding Neighboring Ranges
2.7Transforming Ranges
2.8Set Operations
C. 常見問題Biostars
A. 簡(jiǎn)介
IRanges
用來處理IRanges數(shù)據(jù)實(shí)例。The IRanges
function is a constructor that can be used to create IRanges instances. IRanges是Bioconductor上的基礎(chǔ)數(shù)據(jù)類型,一般認(rèn)為一段整數(shù)區(qū)間integer ranges
或者說是interval ranges
哨啃,如基因結(jié)構(gòu)中一個(gè)外顯子區(qū)域,也可用于解釋基因組上的位置問題骂租,由開始位置(start)和結(jié)束位置(end)定義。
-Data, e.g., aligned reads, ChIP peaks, SNPs, CpG islands, …
-Annotations, e.g., gene models, regulatory elements, methylated regions
-Ranges are defined by chromosome, start, end, and strand
-Often, metadata is associated with each range, e.g., quality of alignment, strength of ChIP peak
B. 用法
1.安裝并加載
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("IRanges")
library(IRanges)
2.IRanges對(duì)象操作
ir1 <- IRanges(start=1:10, width=10:1)
ir2 <- IRanges(start=1:10, end=11)
ir3 <- IRanges(end=11, width=10:1)
identical(ir1, ir2) && identical(ir1, ir3)#&&比較兩個(gè)長(zhǎng)度為1的向量,也叫標(biāo)量比較
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40),
width=c(12, 6, 6, 15, 6, 2, 7),names=letters[1:7])
class(ir)
#算術(shù)、切片、邏輯操作
start(ir)
end(ir)
width(ir)
start(ir) + 4
ir[1:4]
ir[start(ir) <= 15]
#可視化
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep=0.5){
height = 1
if (is(xlim, "Ranges"))
xlim = c(min(start(xlim)), max(end(xlim)))## 得到ir整個(gè)區(qū)域的最小值和最大值
bins = disjointBins(IRanges(start(x), end(x)+1))# 從不相交的區(qū)間開始排列,建立index為1巡验,剩下的bin不相交的建立index為2,以此類推
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
ybottom = bins * (sep + height) - height
rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col)
title(main)
axis(1)
invisible(ybottom)
}
plotRanges(ir, col = "royalblue3")
2.1 Normality-reduce() merge redundant ranges to NormalIRanges
dir <- reduce(ir) # 合并有交集的區(qū)域,如果是測(cè)序數(shù)據(jù)片段,連接成一段序列
names(dir) <- paste0("gene",letters[24:26])
plotRanges(reduce(ir),col = "royalblue3")
2.2 Lists of IRanges objects將多個(gè)IRanges對(duì)象合并為list
rl <- IRangesList(ir, rev(ir))#rev()反向排列輸出
start(rl)
class(rl)
2.3 Vector Extraction-提取一段序列
set.seed(0) # 隨機(jī)數(shù)據(jù)
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length=500),seq(10, 0.001, length=500))#seq()隨機(jī)生成數(shù)據(jù)
xVector <- rpois(1e7, lambda)# Poisson分布生成λ為lambda的1e7個(gè)數(shù)據(jù)
# 泊松分布的參數(shù)λ是單位時(shí)間(或單位面積)內(nèi)隨機(jī)事件的平均發(fā)生率碘耳。適合于描述單位時(shí)間內(nèi)隨機(jī)事件發(fā)生的次數(shù)显设。
yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)])
xRle <- Rle(xVector) #為了在R中存放占用大容量的序列,IRanges使用run-length encoding, RLE來壓縮這些序列,如444333222就可以表示為3個(gè)4,3個(gè)3辛辨,3個(gè)2捕捂。
yRle <- Rle(yVector)
irextract <- IRanges(start=c(4501, 4901) , width=100)
xRle[irextract]
2.4 Finding Overlapping Ranges
尋找overlap是許多組學(xué)分析任務(wù)的必要環(huán)節(jié)。RNA-seq愉阎,overlaps用于細(xì)胞的活動(dòng)绞蹦,看表達(dá)量多少,鑒定不同的isoform榜旦;findOverlaps(query, subject, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end", "within", "equal"), select = c("all", "first", "last", "arbitrary"), ...)
ol <- findOverlaps(ir, dir)#找到的overlap結(jié)果儲(chǔ)存在ol這個(gè)對(duì)象中幽七,包含兩列
as.matrix(ol)
#使用accessor函數(shù)queryHits()和subjectHits()獲得具體的索引號(hào)
names(ir)[queryHits(ol)]
names(dir)[subjectHits(ol)]
#Overlaps表示的是query和subject之間的映射關(guān)系(可以理解成位置),以matrix呈現(xiàn)兩列,進(jìn)行了命名理解起來更容易。
#例如query的1號(hào)(也就是a序列)與subject的1號(hào)(也就是genex序列有重疊)溅呢;而subjectHits中有5個(gè)1出現(xiàn)澡屡,那么就是genex序列有5段分別和a,b,c,d,e序列重疊
#type參數(shù)-它默認(rèn)采用any的比對(duì)模式,也就是兩個(gè)序列只要有重疊咐旧,就計(jì)算在內(nèi)驶鹉。如果想找query完全落在subject中的序列,就需要使用within模式;
#selecte參數(shù)-如果一個(gè)query同時(shí)比對(duì)到subject的多個(gè)位置,方便選擇
2.5 Counting Overlapping Ranges
coverage()計(jì)算每個(gè)位置上的Ranges數(shù),也可理解為一定長(zhǎng)度序列中ranges的重疊深度
cov <- coverage(ir) # 按照是否相交排序分類,歸于哪一類的問題铣墨,以及每個(gè)位點(diǎn)對(duì)應(yīng)的個(gè)數(shù)
plotRanges(ir) #可視化后明顯分為三類
cov <- as.vector(cov)
mat <- cbind(seq_along(cov)-0.5, cov)
d <- diff(cov) != 0
mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat)
mat <- mat[order(mat[,1]),]
lines(mat, col="red", lwd=4)#有數(shù)據(jù)1,2部分?jǐn)?shù)據(jù)重疊2,以此類推
axis(2)
2.6 Finding Neighboring Ranges尋找鄰近Ranges,計(jì)算距離
Nearest函數(shù)查找最近相鄰范圍(重疊為零),precede和follow函數(shù)查找特定邊上不重疊的最近相鄰范圍
qry <- IRanges(start = 6, end = 13, names = 'query')
sbj <- IRanges(start = c(2,4,18,19), end=c(4,5,21,24),names=1:4)
nearest(qry, sbj) #最近的距離
precede(qry,sbj) #尋找前面的ranges
follow(qry,sbj) #尋找后面的ranges
#也可以vertorization
qry2 <- IRanges(start=c(6,7), width=3)
nearest(qry2,sbj)
#distanceToNearest()和distance()確定兩個(gè)ranges之間的距離
qry <- IRanges(sample(seq_len(1000),5),width=10)
sbj <- IRanges(sample(seq_len(1000),5),width=10)
distanceToNearest(qry,sbj)
distance(qry,sbj) #僅返回距離
2.7 Transforming Ranges
2.7.1 Adjusting starts, ends and widths 調(diào)整開始室埋、終止、寬度
shift(ir, 10)# 表示該區(qū)域整體向前平移十個(gè)單位比如[1, 10] → [11, 20]
# 同樣的還有函數(shù) narrow, resize, flank, reflect, restrict, and threebands
# narrow(x, start=NA, end=NA, width=NA, use.names=TRUE)
narrow(ir, start=1:5, width=2) # 將原來的開始位置按照1到5逐級(jí)增加,保持寬度為2姚淆,比如原來的c(1, 2, 3, 4, 5, 6) → c(1, 3, 5, 7, 9, 6)
# resize(x, width, fix="start", use.names=TRUE, ...)
resize(ir, width = 2, fix = "start") # 保持start不變孕蝉,將寬度調(diào)整為2
resize(ir, width = 2, fix = "end") # 保持end不變,將寬度調(diào)整為2
# restrict(x, start=NA, end=NA, keep.all.ranges=FALSE, use.names=TRUE)
restrict(ir, start=2, end=3)
# threebands(x, start=NA, end=NA, width=NA)延伸了narrow函數(shù)的功能腌逢,返回3個(gè)相關(guān)范圍降淮,分別對(duì)應(yīng)到"left","middle" 和 "right",其中"middle"范圍對(duì)應(yīng)的就是narrow函數(shù)返回的范圍:
threebands(ir, start=1:5, width=2)
# arithmetic operators +, - and *
ir + seq_len(length(ir))
ir * -2 # 將區(qū)域擴(kuò)大兩倍
ir * 2 # 將區(qū)域縮小兩倍
2.7.2 Making ranges disjoint
# disjoin函數(shù)通過將IRanges對(duì)象分割為重疊Ranges相同的最寬范圍搏讶,使其不相交佳鳖。
disjoin(ir)
plotRanges(disjoin(ir))
# disjointBins將范圍劃分為多個(gè)bins,這樣每個(gè)容器中的范圍都是不相交的媒惕。返回值是bins的整數(shù)向量
disjointBins(ir)
2.7.3 Other transformations
# reflect 在一組公共引用邊界內(nèi)翻轉(zhuǎn)每個(gè)范圍系吩。
reflect(ir, IRanges(start(ir), width=width(ir)*2))
# flank返回指定寬度的范圍,一個(gè)例子是為一組基因形成啟動(dòng)子區(qū)域
flank(ir, width=seq_len(length(ir)))
# flank(x, width, start=TRUE, both=FALSE, use.names=TRUE, ...)獲得兩端部分(flank),
#比如想得到左側(cè):轉(zhuǎn)錄起始位點(diǎn)吓笙;右側(cè):轉(zhuǎn)錄終止位點(diǎn)
#http://www.bioconductor.org/help/course-materials/2014/SeattleOct2014/A01.3_BioconductorForSequenceAnalysis.html
#flank默認(rèn)是計(jì)算上游,如果要計(jì)算下游的終止位點(diǎn)設(shè)置(start=FALSE)
ir.trans <- flank(ir, width = 2, both = T) # 表示以start為中心區(qū)域向兩側(cè)擴(kuò)展2個(gè)單位淑玫,即1變?yōu)閏(-1,2)
plots <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
add = FALSE, ybottom = NULL, ...) {
require(scales)
col <- alpha(col, 0.5)
height <- 1
sep <- 0.5
if (is(xlim, "Ranges")) {
xlim <- c(min(start(xlim)), max(end(xlim)) * 1.2)
}
if (!add) {
bins <- disjointBins(IRanges(start(x), end(x) + 1))
ybottom <- bins * (sep + height) - height
par(mar = c(3, 0.5, 2.5, 0.5), mgp = c(1.5, 0.5, 0))
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
}
rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
...)
text((start(x) + end(x))/2, ybottom + height/2, 1:length(x), col = "white",
xpd = TRUE)
title(main)
axis(1)
invisible(ybottom)
}
xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
ybottom <- plots(ir, col = "red")
plots(ir.trans, col = "blue", add = TRUE, ybottom = ybottom)
2.8 Set Operations集合運(yùn)算
# gap()(gap:前一個(gè)大區(qū)域的結(jié)尾到后一個(gè)大區(qū)域的開頭)
gaps(ir, start=1, end=50) # 查找1到50這個(gè)區(qū)間巾腕,ir沒有覆蓋到的區(qū)域即gap
plotRanges(gaps(ir, start=1, end=50), c(1,50))
#union并集,intersect交集,setdiff補(bǔ)集,逐對(duì)操作面睛,pintersect,psetdiff,punion,pgap
3 Vector Views
Views存儲(chǔ)一個(gè)類似向量的對(duì)象,稱為“subject”尊搬,以及一個(gè)IRanges對(duì)象叁鉴,定義subject的范圍。每一個(gè)Range都代表了基于這個(gè)主題的View佛寿。
3.1 Creating Views創(chuàng)建
#Views()基于指標(biāo)幌墓,slice()基于數(shù)字邊界。
xViews <- Views(xRle, xRle >= 1)
xViews <- slice(xRle, 1)
xRleList <- RleList(xRle, 2L * rev(xRle))
xViewsList <- slice(xRleList, 1)
3.2 Aggregating Views
#本地函數(shù)viewMaxs冀泻、viewMins常侣、viewSums和viewMeans用于描述性統(tǒng)計(jì)分析
head(viewSums(xViews))
viewSums(xViewsList)
head(viewMaxs(xViews))
viewMaxs(xViewsList)
4 Lists of Atomic Vectors
showClass("RleList")
args(IntegerList)
cIntList1 <- IntegerList(x=xVector, y=yVector)
cIntList1
sIntList2 <- IntegerList(x=xVector, y=yVector, compress=FALSE)
sIntList2
# sparse integer list整數(shù)list
xExploded <- lapply(xVector[1:5000], function(x) seq_len(x))
cIntList2 <- IntegerList(xExploded)
sIntList2 <- IntegerList(xExploded, compress=FALSE)
object.size(cIntList2)
object.size(sIntList2)
#lengths()整數(shù)向量包含每個(gè)元素的長(zhǎng)度;
#length()向量衍生對(duì)象元素的數(shù)目,list衍生的像simple list 或compressed list
length(cIntList2)
Rle(lengths(cIntList2))
#lapply/sapply循環(huán), [[元素提取,c連接
system.time(sapply(xExploded, mean))
system.time(sapply(sIntList2, mean))
system.time(sapply(cIntList2, mean))
identical(sapply(xExploded, mean), sapply(sIntList2, mean))
identical(sapply(xExploded, mean), sapply(cIntList2, mean))
#AtomicList objects support the Ops (e.g. +, ==, &), Math (e.g.log, sqrt), Math2 (e.g. round, signif), Summary (e.g. min, max, sum), and Complex (e.g. Re,Im) group generics.
xRleList > 0
yRleList <- RleList(yRle, 2L * rev(yRle))
xRleList + yRleList
sum(xRleList > 0 | yRleList > 0)
# 原子列表來自List,它們還可以使用循環(huán)函數(shù)endoapply來執(zhí)行自同態(tài)
safe.max <- function(x) { if(length(x)) max(x) else integer(0) }
endoapply(sIntList2, safe.max)
5 Session Information
sessionInfo()#查看編譯此文檔的系統(tǒng)上的輸出
C. IRanges包中常見問題梳理
這兒訪問的是biostars問答網(wǎng)站https://www.biostars.org/弹渔,這是一個(gè)專注于生物信息類的問答的網(wǎng)站胳施,隨著生物數(shù)據(jù)分析的流行,biostars人氣是越來越高肢专。通過多年的累計(jì)舞肆,“幾乎”你所遇到的技術(shù)問題,這里面都能找到答案博杖。
首先椿胯,單純的問答網(wǎng)頁(yè),清晰明了剃根,用過社交網(wǎng)站應(yīng)該一看就會(huì)哩盲;
點(diǎn)擊右上角ALL,可以看到關(guān)鍵帖子的頻數(shù);RNA-seq是熱點(diǎn)廉油,然后是R(2020-03-12)
其次镣丑,支持關(guān)鍵詞搜索。遇到問題先檢索娱两,然后再提問莺匠,也能提高學(xué)習(xí)效率;我們?cè)谒阉骺蜉斎隝Ranges十兢,得到50個(gè)結(jié)果趣竣;
我們可以看到Overlaps 提到9+,IRanges中熱點(diǎn)問題是關(guān)于Overlaps的旱物,可見很多人都跳過坑遥缕,如果你也感興趣,來參與討論吧宵呛,分享你的經(jīng)驗(yàn)或者困惑~ S4 class 提到2次单匣,要理解每個(gè)包支持的數(shù)據(jù)結(jié)構(gòu)。此外宝穗,通過看問題户秤,深刻理解了找區(qū)域,找位點(diǎn)逮矛,是IRanges最常用的功能鸡号,也是必須要掌握的。下面是我簡(jiǎn)單將問題進(jìn)行羅列:
問題:I can not load a fasta file in Rstudio.無(wú)法在Rstudio中加載Fasta文件须鼎? 大概是安裝方式出了問題鲸伴,解答讓安裝最新版解決。
答:What Is The Best Library For Handling Interval Logic? 什么是處理間隔邏輯的最佳庫(kù)晋控?
答:Calculate Mapping Density Along Chromosome Coordinates計(jì)算沿染色體坐標(biāo)的映射密度汞窗?
答:Creating fasta files to BLAST- what's that fastest way創(chuàng)建BLAST的Fasta文件-最快的方法是什么
答:How To Determine Overlaps From Coordinates如何根據(jù)坐標(biāo)確定Overlaps?
答:Finding overlapping ranges in R; 在R中找overlaps
Bioconductor - Error : Function found is not S4 generic Bioconductor-報(bào)錯(cuò)Error:發(fā)現(xiàn)的函數(shù)不是S4通用的 #重啟解決
答:The Longest Chromosome > Sizeof(Int32)最長(zhǎng)的染色體> Sizeof(Int32)
答:Finding Common Annotations In Gtf And Gff Files Of Different Origin在不同來源的GTF和GFF文件中查找通用注釋
答:How do I subset a GRanges on chromosome, region and strand?如何在染色體,區(qū)域和鏈上劃分GRanges赡译?
答:Bedtools, report feature with highest overlap if overlaps two features如果兩個(gè)功能重疊仲吏,則報(bào)表功能重疊最多。#其實(shí)有提到Bedtools可以替代上述代碼的功能捶朵,日后可以了解一下
答:Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table快速編程挑戰(zhàn):如何從表中計(jì)算參考coverage
答:How To Use R To Segment Genome And Count Reads From Sequencing Data? 如何使用R分割基因組并計(jì)數(shù)測(cè)序數(shù)據(jù)中的讀數(shù)蜘矢?#Countoverlap
答:What Is The Quickest Algorithm For Range Overlap? Overlap最快的算法是什么?
A:Extract intergenic coordinate from gff從gff提取基因間坐標(biāo)
答:Tools for gene annotation基因注釋工具
答:Look for SNPs that span segments with R? 尋找跨越帶有R的片段的SNP
Cannot load package "hgu133a.db"無(wú)法加載軟件包“ hgu133a.db”
A:How To Intersect A Range With Single Positions如何使區(qū)域與單個(gè)位置相交
答:How do I get ranges for first or last n nucleotides in genomicranges如何獲得基因組范圍中前n個(gè)或后n個(gè)核苷酸的范圍
答:Retrieve all the differentially methylated CpG's in a region檢索區(qū)域中所有差異甲基化的CpG
A:Found Correspondent Numbers In Integer Intervals (R)以整數(shù)間隔(R)找到對(duì)應(yīng)的號(hào)碼
I can not load a fasta file in Rstudio我無(wú)法在Rstudio中加載Fasta文件
答:Cross-Correlation On Chip-Seq Like Data? 像芯片序列數(shù)據(jù)上的互相關(guān)综看?
(by findOverlaps in R IRanges or using BEDTools).答:Remove entries in a IRangesList刪除IRangesList中的條目
Alternate Transcripts Paralogs ? A: 副記錄
A: Quick And Easy Way To Find Whether A Locus Is Falling In Which Region Of Exon一種快速簡(jiǎn)便的方法來確定一個(gè)位點(diǎn)落在外顯子的哪個(gè)區(qū)域
Representing Interaction Data Using Iranges/Granges For 5C/Hi-C Data Analysis使用Iranges / Granges表示交互數(shù)據(jù)進(jìn)行5C / Hi-C數(shù)據(jù)分析?
答:SciClone installation in R version 3.3.3 R版本3.3.3中的SciClone安裝
Error in .normargSEW0(start, "start"):'start' must be a numeric vector (or NULL).normargSEW0(start品腹,“ start”)中的錯(cuò)誤:“ start”必須是數(shù)字矢量(或NULL)
Cannot coerce class 'structure("IRanges", package = "IRanges")' into a data.frame error.無(wú)法將類'structure(“ IRanges”,package =“ IRanges”)轉(zhuǎn)換為data.frame红碑。
Calculate Gene Density Per Kb And Plot Density Over Position For All Scaffolds Of A Draft Genome Using R使用R計(jì)算草案基因組的所有支架的每Kb基因密度和位置上的圖密度
答:Multiple Samples Data And Chromosome Ideograms From Cnv Caller來自Cnv調(diào)用者的多個(gè)樣本數(shù)據(jù)和染色體表意文字
In R bioconductor, how to combine DNAString views please ?答:在R Bioconductor中舞吭,如何結(jié)合DNAString views泡垃?
SciClone installation in R version 3.3.3 R版本3.3.3中的SciClone安裝?
答:Quick Programming Challenge: Calculate Common And Unique Regions From A List Of快速編程挑戰(zhàn):從列表中計(jì)算公共和唯一區(qū)域
What is a good strategy to find hit regions from a GWAS in R?從R中的GWAS中找到命中區(qū)域的好策略是什么?
Intersecting A Set Of Bam Reads With A Set Of Coordinates A: 將一組Bam reads與一組坐標(biāo)相交
Plot average expression profile from one bed file across overlapping regions A: 跨重疊區(qū)域從一個(gè)bed文件繪制平均表達(dá)式配置文件
A:New Hg18 Genome In R R中新的Hg18基因組
A:Finding overlapping ranges in R? 在R中找到重疊范圍
CellCODE installation issues CellCODE安裝問題
How to get overlap count (in basepair) of two IRange objects?如何獲得兩個(gè)IRange對(duì)象的overlap count(以堿基對(duì)計(jì))羡鸥?
Remove entries in a IRangesList? 刪除IRangesList中的條目
答:Common genomic intervals in R ? R中常見的基因組間隔
如果還有很多疑惑蔑穴,關(guān)于Bioconductor的包接下來會(huì)一一介紹。
參考資料(方便想進(jìn)一步了解的去看源文件):
[1] Michael L., Wolfgang H., Herve′ Page`s, et al. (2013)Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8):e1003118. doi:10.1371/journal.pcbi.1003118
[2][Bioconductor基礎(chǔ)--IRanges和Range thinking]https://mp.weixin.qq.com/s?src=11×tamp=1584153503&ver=2215&signature=oxgHX99O10ky9BK8-ORjaArSjsbZkpK4st-W6E3rQMIa34Ikq1JlN6tpBgty7TDSBdF4rtRPgiLTkcZtVmD9nZojH5BR9f3RbDcZ7LeLheEMs48*KhrA09rzjzsciI&new=1
[3][Bioconductor for Sequence Analysis]http://www.bioconductor.org/help/course-materials/2014/SeattleOct2014/A01.3_BioconductorForSequenceAnalysis.html
[4]Parnell LD, Lindenbaum P, Shameer K, Dall’Olio GM, Swan DC, et al. (2011) BioStar: An Online Question & Answer Resource for the Bioinformatics Community. PLoS Comput Biol 7(10): e1002216. doi:10.1371/journal.pcbi.1002216