跟著官方文檔來看看GRanges這個對象

劉小澤寫于2020.8.5
今天跟著官方文檔來看看GRanges這個對象

1 前言

內(nèi)容來自:https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html

GenomicRanges是Bioconductor各個項目都在使用的基因組坐標的存儲方式,它基于IRanges 建立郁油,目前為BSgenome腕巡、Rsamtools桶蝎、ShortRead 、rtracklayer蜓陌、GenomicFeatures货矮、 GenomicAlignments对途、VariantAnnotation 等提供支持

安裝加載

if (!require("BiocManager"))install.packages("BiocManager")
if (!require("GenomicRanges"))BiocManager::install("GenomicRanges")

library(GenomicRanges)

2 GRanges: Genomic Ranges

存儲了一系列的基因組起始、終止坐標啸蜜,可以為連續(xù)的結(jié)合位點坑雅、轉(zhuǎn)錄本、外顯子等提供支持

gr <- GRanges(
    seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
    ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
    strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
    score = 1:10,
    GC = seq(1, 0, length=10))
gr
#> GRanges object with 10 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   d     chr2   104-114      * |         4 0.666666666666667
#>   e     chr1   105-115      * |         5 0.555555555555556
#>   f     chr1   106-116      + |         6 0.444444444444444
#>   g     chr3   107-117      + |         7 0.333333333333333
#>   h     chr3   108-118      + |         8 0.222222222222222
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome; no seqlengths

左邊是基因組坐標(染色體名稱衬横、范圍裹粤、正負鏈),右邊是meta信息(score蜂林、GC等等)

2.1 認識這個數(shù)據(jù)

其中包含多少行
names(gr)
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
# 或者
length(gr)
#> [1] 10
分別提取seqnames遥诉、ranges、strand
seqnames(gr)
#> factor-Rle of length 10 with 4 runs
#>   Lengths:    1    3    2    4
#>   Values : chr1 chr2 chr1 chr3
#> Levels(3): chr1 chr2 chr3
ranges(gr)
#> IRanges object with 10 ranges and 0 metadata columns:
#>         start       end     width
#>     <integer> <integer> <integer>
#>   a       101       111        11
#>   b       102       112        11
#>   c       103       113        11
#>   d       104       114        11
#>   e       105       115        11
#>   f       106       116        11
#>   g       107       117        11
#>   h       108       118        11
#>   i       109       119        11
#>   j       110       120        11
strand(gr)
#> factor-Rle of length 10 with 5 runs
#>   Lengths: 1 2 2 3 2
#>   Values : - + * + -
#> Levels(3): + - *
一次性提取seqnames噪叙、ranges矮锈、strand
granges(gr)
#> GRanges object with 10 ranges and 0 metadata columns:
#>     seqnames    ranges strand
#>        <Rle> <IRanges>  <Rle>
#>   a     chr1   101-111      -
#>   b     chr2   102-112      +
#>   c     chr2   103-113      +
#>   d     chr2   104-114      *
#>   e     chr1   105-115      *
#>   f     chr1   106-116      +
#>   g     chr3   107-117      +
#>   h     chr3   108-118      +
#>   i     chr3   109-119      -
#>   j     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome; no seqlengths
一次性提取meta信息

得到一個數(shù)據(jù)框

mcols(gr)
#> DataFrame with 10 rows and 2 columns
#>       score                GC
#>   <integer>         <numeric>
#> a         1                 1
#> b         2 0.888888888888889
#> c         3 0.777777777777778
#> d         4 0.666666666666667
#> e         5 0.555555555555556
#> f         6 0.444444444444444
#> g         7 0.333333333333333
#> h         8 0.222222222222222
#> i         9 0.111111111111111
#> j        10                 0

如果只想提取其中的某部分

mcols(gr)$score
#>  [1]  1  2  3  4  5  6  7  8  9 10
還可以加入染色體長度
seqlengths(gr) <- c(249250621, 243199373, 198022430)
# 提取
seqlengths(gr)
#>      chr1      chr2      chr3 
#> 249250621 243199373 198022430

2.2 拆分+取子集

拆分

利用split會得到一個GRangesList對象

sp <- split(gr, rep(1:2, each=5))
sp
#> GRangesList object of length 2:
#> $`1`
#> GRanges object with 5 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   d     chr2   104-114      * |         4 0.666666666666667
#>   e     chr1   105-115      * |         5 0.555555555555556
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
#> 
#> $`2`
#> GRanges object with 5 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   f     chr1   106-116      + |         6 0.444444444444444
#>   g     chr3   107-117      + |         7 0.333333333333333
#>   h     chr3   108-118      + |         8 0.222222222222222
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

如果要再連起來

c(sp[[1]], sp[[2]])
#> GRanges object with 10 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   d     chr2   104-114      * |         4 0.666666666666667
#>   e     chr1   105-115      * |         5 0.555555555555556
#>   f     chr1   106-116      + |         6 0.444444444444444
#>   g     chr3   107-117      + |         7 0.333333333333333
#>   h     chr3   108-118      + |         8 0.222222222222222
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
簡單取子集

操作和向量的操作類似

gr[2:3]
#> GRanges object with 2 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

如果要再選取一些meta信息,操作又和數(shù)據(jù)框類似

gr[2:3, "GC"]
#> GRanges object with 2 ranges and 1 metadata column:
#>     seqnames    ranges strand |                GC
#>        <Rle> <IRanges>  <Rle> |         <numeric>
#>   b     chr2   102-112      + | 0.888888888888889
#>   c     chr2   103-113      + | 0.777777777777778
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
對子集重新賦值
# 先把全部的行拆成一個列表睁蕾,然后把第一行賦值給第二行
singles <- split(gr, names(gr))
grMod <- gr
grMod[2] <- singles[[1]]
head(grMod, n=3)
#> GRanges object with 3 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr1   101-111      - |         1                 1
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
還有重復(fù)苞笨、翻轉(zhuǎn)、取特定區(qū)域等操作
# 重復(fù)
rep(singles[[2]], times = 3)
#> GRanges object with 3 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
# 翻轉(zhuǎn)
rev(gr)
#> GRanges object with 10 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   j     chr3   110-120      - |        10                 0
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   h     chr3   108-118      + |         8 0.222222222222222
#>   g     chr3   107-117      + |         7 0.333333333333333
#>   f     chr1   106-116      + |         6 0.444444444444444
#>   e     chr1   105-115      * |         5 0.555555555555556
#>   d     chr2   104-114      * |         4 0.666666666666667
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   a     chr1   101-111      - |         1                 1
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
# 前后兩行
head(gr,n=2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
tail(gr,n=2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
# 取第2-4行
window(gr, start=2,end=4)
#> GRanges object with 3 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   d     chr2   104-114      * |         4 0.666666666666667
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
# 取2-3行 + 7-9行
gr[IRanges(start=c(2,7), end=c(3,9))]
#> GRanges object with 5 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   g     chr3   107-117      + |         7 0.333333333333333
#>   h     chr3   108-118      + |         8 0.222222222222222
#>   i     chr3   109-119      - |         9 0.111111111111111
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

3 對區(qū)間操作

3.1 intra-range(同一個GRange中;行內(nèi)部調(diào)整)

獲取數(shù)據(jù)
g <- gr[1:3]
g <- append(g, singles[[10]])
start(g)
#> [1] 101 102 103 110
end(g)
#> [1] 111 112 113 120
width(g)
#> [1] 11 11 11 11
# 如果存在兩行包含的關(guān)系猫缭,range就會合并顯示
range(g)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   101-111      -
#>   [2]     chr2   102-113      +
#>   [3]     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
g
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
對側(cè)翼操作
# 取上游10bp(區(qū)分正負鏈)
flank(g, 10)
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   112-121      - |         1                 1
#>   b     chr2    92-101      + |         2 0.888888888888889
#>   c     chr2    93-102      + |         3 0.777777777777778
#>   j     chr3   121-130      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
# 取下游10bp(區(qū)分正負鏈)
flank(g, 10, start=FALSE)
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1    91-100      - |         1                 1
#>   b     chr2   113-122      + |         2 0.888888888888889
#>   c     chr2   114-123      + |         3 0.777777777777778
#>   j     chr3   100-109      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
整體移動
# 全部移動5bp(不分正負鏈)
shift(g, 5)
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   106-116      - |         1                 1
#>   b     chr2   107-117      + |         2 0.888888888888889
#>   c     chr2   108-118      + |         3 0.777777777777778
#>   j     chr3   115-125      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
直接修改區(qū)間范圍
# 全部修改為30bp(區(qū)分正負鏈葱弟;默認固定start)
resize(g, 30,fix="start")
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1    82-111      - |         1                 1
#>   b     chr2   102-131      + |         2 0.888888888888889
#>   c     chr2   103-132      + |         3 0.777777777777778
#>   j     chr3    91-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

3.2 inter-range (同一個GRange中;行與行之間調(diào)整)

g
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
將重疊區(qū)域合并
reduce(g)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   101-111      -
#>   [2]     chr2   102-113      +
#>   [3]     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
得到?jīng)]有任何重疊的結(jié)果
# 如果有重疊猜丹,則單獨列出來
disjoin(g)
#> GRanges object with 5 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   101-111      -
#>   [2]     chr2       102      +
#>   [3]     chr2   103-112      +
#>   [4]     chr2       113      +
#>   [5]     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
對沒有任何重疊的結(jié)果數(shù)量進行統(tǒng)計
coverage(g)
#> RleList of length 3
#> $chr1
#> integer-Rle of length 249250621 with 3 runs
#>   Lengths:       100        11 249250510
#>   Values :         0         1         0
#> 
#> $chr2
#> integer-Rle of length 243199373 with 5 runs
#>   Lengths:       101         1        10         1 243199260
#>   Values :         0         1         2         1         0
#> 
#> $chr3
#> integer-Rle of length 198022430 with 3 runs
#>   Lengths:       109        11 198022310
#>   Values :         0         1         0
看GRange對象以外的區(qū)域
# 也就是每個染色體長度除去了g包含的坐標
gaps(g)
#> GRanges object with 12 ranges and 0 metadata columns:
#>        seqnames        ranges strand
#>           <Rle>     <IRanges>  <Rle>
#>    [1]     chr1   1-249250621      +
#>    [2]     chr1         1-100      -
#>    [3]     chr1 112-249250621      -
#>    [4]     chr1   1-249250621      *
#>    [5]     chr2         1-101      +
#>    ...      ...           ...    ...
#>    [8]     chr2   1-243199373      *
#>    [9]     chr3   1-198022430      +
#>   [10]     chr3         1-109      -
#>   [11]     chr3 121-198022430      -
#>   [12]     chr3   1-198022430      *
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

3.3 between-range (兩個GRange對象之間)

重點操作是:findOverlaps芝加、unionintersect射窒、setdiff

數(shù)據(jù)準備
g2 <- head(gr, n=2)
g
#> GRanges object with 4 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   c     chr2   103-113      + |         3 0.777777777777778
#>   j     chr3   110-120      - |        10                 0
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
g2
#> GRanges object with 2 ranges and 2 metadata columns:
#>     seqnames    ranges strand |     score                GC
#>        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
#>   a     chr1   101-111      - |         1                 1
#>   b     chr2   102-112      + |         2 0.888888888888889
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
并集
union(g, g2)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   101-111      -
#>   [2]     chr2   102-113      +
#>   [3]     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
交集
intersect(g, g2)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   101-111      -
#>   [2]     chr2   102-112      +
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
補集
setdiff(g, g2)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr2       113      +
#>   [2]     chr3   110-120      -
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome
findOverlaps

數(shù)據(jù)一:https://raw.githubusercontent.com/Bioconductor/BiocWorkshops/master/100_Morgan_RBiocForAll/CpGislands.Hsapiens.hg38.UCSC.bed

suppressMessages(library("rtracklayer"))
library(stringr)
fname='CpGislands.Hsapiens.hg38.UCSC.bed.txt'
# 重命名藏杖,不需要后綴.txt
aft_name <- gsub('.txt','',fname)
file.rename(fname,aft_name)
#> Warning in file.rename(fname, aft_name): cannot rename file 'CpGislands.Hsapiens.hg38.UCSC.bed.txt' to
#> 'CpGislands.Hsapiens.hg38.UCSC.bed', reason 'No such file or directory'
#> [1] FALSE
cpg <- import(aft_name)
# 原本BED文件是0-based,半開半閉區(qū)間脉顿,比如[0,10) = 0..9十個數(shù)
# 而import()函數(shù)自動將BED的類型轉(zhuǎn)成Bioconductor統(tǒng)一類型:1-based蝌麸、全閉區(qū)間,也就是將[0,10)變成了[1,10]艾疟,還是原來的十個數(shù)值
head(cpg)
#> GRanges object with 6 ranges and 1 metadata column:
#>       seqnames              ranges strand |        name
#>          <Rle>           <IRanges>  <Rle> | <character>
#>   [1]     chr1 155188537-155192004      * |    CpG:_361
#>   [2]     chr1     2226774-2229734      * |    CpG:_366
#>   [3]     chr1   36306230-36307408      * |    CpG:_110
#>   [4]     chr1   47708823-47710847      * |    CpG:_164
#>   [5]     chr1   53737730-53739637      * |    CpG:_221
#>   [6]     chr1 144179072-144179313      * |     CpG:_20
#>   -------
#>   seqinfo: 254 sequences from an unspecified genome; no seqlengths
seqnames(cpg)
#> factor-Rle of length 30477 with 254 runs
#>   Lengths:                    2535                    1682 ...                       1                       2
#>   Values :                    chr1                    chr2 ... chr22_KI270735v1_random chr22_KI270738v1_random
#> Levels(254): chr1 chr2 chr3 chr4 ... chr22_KI270734v1_random chr22_KI270735v1_random chr22_KI270738v1_random
# 只對標準的染色體1-22+X+Y進行處理
cpg <- keepStandardChromosomes(cpg, pruning.mode = "coarse")
head( start(cpg) )
#> [1] 155188537   2226774  36306230  47708823  53737730 144179072
hist(log10(width(cpg)))
# 取子集
subset(cpg, seqnames %in% c("chr1", "chr2"))
#> GRanges object with 4217 ranges and 1 metadata column:
#>          seqnames              ranges strand |        name
#>             <Rle>           <IRanges>  <Rle> | <character>
#>      [1]     chr1 155188537-155192004      * |    CpG:_361
#>      [2]     chr1     2226774-2229734      * |    CpG:_366
#>      [3]     chr1   36306230-36307408      * |    CpG:_110
#>      [4]     chr1   47708823-47710847      * |    CpG:_164
#>      [5]     chr1   53737730-53739637      * |    CpG:_221
#>      ...      ...                 ...    ... .         ...
#>   [4213]     chr2 242003256-242004412      * |     CpG:_79
#>   [4214]     chr2 242006590-242010686      * |    CpG:_263
#>   [4215]     chr2 242045491-242045723      * |     CpG:_16
#>   [4216]     chr2 242046615-242047706      * |    CpG:_170
#>   [4217]     chr2 242088150-242089411      * |    CpG:_149
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

數(shù)據(jù)二:TxDb.Hsapiens.UCSC.hg38.knownGene

library("TxDb.Hsapiens.UCSC.hg38.knownGene")
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
# 如果取外顯子坐標:exons(TxDb.Hsapiens.UCSC.hg38.knownGene)
# 如果取基因坐標:genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx
#> GRanges object with 247541 ranges and 2 metadata columns:
#>                    seqnames        ranges strand |     tx_id           tx_name
#>                       <Rle>     <IRanges>  <Rle> | <integer>       <character>
#>        [1]             chr1   11869-14409      + |         1 ENST00000456328.2
#>        [2]             chr1   12010-13670      + |         2 ENST00000450305.2
#>        [3]             chr1   29554-31097      + |         3 ENST00000473358.1
#>        [4]             chr1   30267-31109      + |         4 ENST00000469289.1
#>        [5]             chr1   30366-30503      + |         5 ENST00000607096.1
#>        ...              ...           ...    ... .       ...               ...
#>   [247537] chrUn_GL000220v1 155997-156149      + |    247537 ENST00000619779.1
#>   [247538] chrUn_KI270442v1 380608-380726      + |    247538 ENST00000620265.1
#>   [247539] chrUn_KI270442v1 217250-217401      - |    247539 ENST00000611690.1
#>   [247540] chrUn_KI270744v1   51009-51114      - |    247540 ENST00000616830.1
#>   [247541] chrUn_KI270750v1 148668-148843      + |    247541 ENST00000612925.1
#>   -------
#>   seqinfo: 595 sequences (1 circular) from hg38 genome

# 同樣也是保留標準染色體信息
tx <- keepStandardChromosomes(tx, pruning.mode="coarse")
seqnames(tx)
#> factor-Rle of length 227462 with 25 runs
#>   Lengths: 19915 16586 14251  9364 10786 10427 10798  9307 ...  4437 13623  5403  2925  4920  6970   978    37
#>   Values :  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8 ... chr18 chr19 chr20 chr21 chr22  chrX  chrY  chrM
#> Levels(25): chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 ... chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM

開始取交集

findOverlaps(tx,cpg)
#> Hits object with 140077 hits and 0 metadata columns:
#>            queryHits subjectHits
#>            <integer>   <integer>
#>        [1]         3          10
#>        [2]        32          18
#>        [3]        33          18
#>        [4]        34          18
#>        [5]        35          18
#>        ...       ...         ...
#>   [140073]    227385       13824
#>   [140074]    227386       13823
#>   [140075]    227386       13824
#>   [140076]    227419       13829
#>   [140077]    227424       13831
#>   -------
#>   queryLength: 227462 / subjectLength: 27949

# 統(tǒng)計每個轉(zhuǎn)錄本上有多少CpG位點
olaps <- countOverlaps(tx, cpg)
length(olaps)
#> [1] 227462
table(olaps)
#> olaps
#>      0      1      2      3      4      5      6      7      8      9     10     11     12     13     14     15 
#> 126181  81273  12621   3804   1557    713    427    266    175     96     68     42     43     33     25     22 
#>     16     17     18     19     20     21     22     23     24     25     26     27     28     29     30     31 
#>     20      7     14      7      7      3      7      6      3      1      4      1      3      1      6      5 
#>     32     33     34     35     36     37     38     63     65     71 
#>      3      3      1      2      3      2      1      4      1      1

# 添加結(jié)果到GRanges
tx$cpgOverlaps <- olaps
tx
#> GRanges object with 227462 ranges and 3 metadata columns:
#>            seqnames      ranges strand |     tx_id           tx_name cpgOverlaps
#>               <Rle>   <IRanges>  <Rle> | <integer>       <character>   <integer>
#>        [1]     chr1 11869-14409      + |         1 ENST00000456328.2           0
#>        [2]     chr1 12010-13670      + |         2 ENST00000450305.2           0
#>        [3]     chr1 29554-31097      + |         3 ENST00000473358.1           1
#>        [4]     chr1 30267-31109      + |         4 ENST00000469289.1           0
#>        [5]     chr1 30366-30503      + |         5 ENST00000607096.1           0
#>        ...      ...         ...    ... .       ...               ...         ...
#>   [227458]     chrM   5826-5891      - |    227458 ENST00000387409.1           0
#>   [227459]     chrM   7446-7514      - |    227459 ENST00000387416.2           0
#>   [227460]     chrM 14149-14673      - |    227460 ENST00000361681.2           0
#>   [227461]     chrM 14674-14742      - |    227461 ENST00000387459.1           0
#>   [227462]     chrM 15956-16023      - |    227462 ENST00000387461.2           0
#>   -------
#>   seqinfo: 25 sequences (1 circular) from hg38 genome
# 根據(jù)這個結(jié)果取子集
subset(tx, cpgOverlaps > 10)
#> GRanges object with 281 ranges and 3 metadata columns:
#>         seqnames            ranges strand |     tx_id           tx_name cpgOverlaps
#>            <Rle>         <IRanges>  <Rle> | <integer>       <character>   <integer>
#>     [1]     chr1   2050411-2185395      + |       292 ENST00000378567.8          15
#>     [2]     chr1   2050485-2146108      + |       293 ENST00000468310.5          11
#>     [3]     chr1   2073462-2185390      + |       298 ENST00000400921.6          13
#>     [4]     chr1   2073986-2185190      + |       300 ENST00000461106.6          13
#>     [5]     chr1   3069168-3434342      + |       382 ENST00000511072.5          32
#>     ...      ...               ...    ... .       ...               ...         ...
#>   [277]     chrX 40051246-40177329      - |    223668 ENST00000378455.8          11
#>   [278]     chrX 40051248-40177329      - |    223669 ENST00000342274.8          11
#>   [279]     chrX 40062955-40177083      - |    223675 ENST00000406200.3          11
#>   [280]     chrY     333933-386907      - |    226968 ENST00000390665.9          14
#>   [281]     chrY     344896-386955      - |    226974 ENST00000445792.7          12
#>   -------
#>   seqinfo: 25 sequences (1 circular) from hg38 genome

4 更復(fù)雜的GRangesList

多個GRanges對象放到一個列表組成了 GRangesList

4.1 構(gòu)建

gr1 <- GRanges(
    seqnames = "chr2",
    ranges = IRanges(103, 106),
    strand = "+",
    score = 5L, GC = 0.45)
gr2 <- GRanges(
    seqnames = c("chr1", "chr1"),
    ranges = IRanges(c(107, 113), width = 3),
    strand = c("+", "-"),
    score = 3:4, GC = c(0.3, 0.5))
grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
#> GRangesList object of length 2:
#> $txA
#> GRanges object with 1 range and 2 metadata columns:
#>       seqnames    ranges strand |     score        GC
#>          <Rle> <IRanges>  <Rle> | <integer> <numeric>
#>   [1]     chr2   103-106      + |         5      0.45
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $txB
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames    ranges strand |     score        GC
#>          <Rle> <IRanges>  <Rle> | <integer> <numeric>
#>   [1]     chr1   107-109      + |         3       0.3
#>   [2]     chr1   113-115      - |         4       0.5
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

4.2 基本操作

seqnames(grl)
#> RleList of length 2
#> $txA
#> factor-Rle of length 1 with 1 run
#>   Lengths:    1
#>   Values : chr2
#> Levels(2): chr2 chr1
#> 
#> $txB
#> factor-Rle of length 2 with 1 run
#>   Lengths:    2
#>   Values : chr1
#> Levels(2): chr2 chr1
ranges(grl)
#> IRangesList object of length 2:
#> $txA
#> IRanges object with 1 range and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]       103       106         4
#> 
#> $txB
#> IRanges object with 2 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]       107       109         3
#>   [2]       113       115         3
strand(grl)
#> RleList of length 2
#> $txA
#> factor-Rle of length 1 with 1 run
#>   Lengths: 1
#>   Values : +
#> Levels(3): + - *
#> 
#> $txB
#> factor-Rle of length 2 with 2 runs
#>   Lengths: 1 1
#>   Values : + -
#> Levels(3): + - *
length(grl)
#> [1] 2
names(grl)
#> [1] "txA" "txB"
seqlengths(grl)
#> chr2 chr1 
#>   NA   NA
isEmpty(grl)
#> [1] FALSE

# 提取meta信息来吩,需要先unlist
mcols(grl) #沒結(jié)果
#> DataFrame with 2 rows and 0 columns
mcols(unlist(grl))
#> DataFrame with 3 rows and 2 columns
#>         score        GC
#>     <integer> <numeric>
#> txA         5      0.45
#> txB         3       0.3
#> txB         4       0.5

4.3 列表的循環(huán)操作

lapply(grl, length)
#> $txA
#> [1] 1
#> 
#> $txB
#> [1] 2
sapply(grl, length)
#> txA txB 
#>   1   2

還可以先unlist,計算完重新list

gr <- unlist(grl)
gr$log_score <- log(gr$score)
grl <- relist(gr, grl)
grl
#> GRangesList object of length 2:
#> $txA
#> GRanges object with 1 range and 3 metadata columns:
#>       seqnames    ranges strand |     score        GC       log_score
#>          <Rle> <IRanges>  <Rle> | <integer> <numeric>       <numeric>
#>   txA     chr2   103-106      + |         5      0.45 1.6094379124341
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $txB
#> GRanges object with 2 ranges and 3 metadata columns:
#>       seqnames    ranges strand |     score        GC        log_score
#>          <Rle> <IRanges>  <Rle> | <integer> <numeric>        <numeric>
#>   txB     chr1   107-109      + |         3       0.3 1.09861228866811
#>   txB     chr1   113-115      - |         4       0.5 1.38629436111989
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

歡迎關(guān)注我們的公眾號~_~  
我們是兩個農(nóng)轉(zhuǎn)生信的小碩蔽莱,打造生信星球弟疆,想讓它成為一個不拽術(shù)語、通俗易懂的生信知識平臺盗冷。需要幫助或提出意見請后臺留言或發(fā)送郵件到jieandze1314@gmail.com

Welcome to our bioinfoplanet!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末怠苔,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子仪糖,更是在濱河造成了極大的恐慌柑司,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件锅劝,死亡現(xiàn)場離奇詭異攒驰,居然都是意外死亡,警方通過查閱死者的電腦和手機故爵,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進店門讼育,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人稠集,你說我怎么就攤上這事奶段。” “怎么了剥纷?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵痹籍,是天一觀的道長。 經(jīng)常有香客問我晦鞋,道長蹲缠,這世上最難降的妖魔是什么棺克? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮线定,結(jié)果婚禮上娜谊,老公的妹妹穿的比我還像新娘。我一直安慰自己斤讥,他們只是感情好纱皆,可當(dāng)我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著芭商,像睡著了一般派草。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上铛楣,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天近迁,我揣著相機與錄音,去河邊找鬼簸州。 笑死鉴竭,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的岸浑。 我是一名探鬼主播搏存,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼助琐!你這毒婦竟也來了祭埂?” 一聲冷哼從身側(cè)響起面氓,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤兵钮,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后舌界,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體掘譬,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年呻拌,在試婚紗的時候發(fā)現(xiàn)自己被綠了葱轩。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡藐握,死狀恐怖靴拱,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情猾普,我是刑警寧澤袜炕,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布,位于F島的核電站初家,受9級特大地震影響偎窘,放射性物質(zhì)發(fā)生泄漏乌助。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一陌知、第九天 我趴在偏房一處隱蔽的房頂上張望他托。 院中可真熱鬧,春花似錦仆葡、人聲如沸赏参。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽登刺。三九已至,卻和暖如春嗡呼,著一層夾襖步出監(jiān)牢的瞬間纸俭,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工南窗, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留揍很,地道東北人。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓万伤,卻偏偏與公主長得像窒悔,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子敌买,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,700評論 2 345