GDAS009-Bioconductor中的基因組注釋


title: GDAS009-Bioconductor中的基因組注釋
date: 2019-09-06 12:0:00
type: "tags"
tags:

  • Bioconductor
  • 基因組注釋
  • KEGG
  • GO
    categories:
  • Genomics Data Analysis Series

前言

這一部分內(nèi)容涉及R中使用人類基因且,內(nèi)含子,外顯子,轉(zhuǎn)錄本式塌,AnnotationHub乙嘀,基因組的注釋包飒筑,GO分析挂洛,KEGG分析等,筆記末尾的參考文獻(xiàn)是原文第步。

基礎(chǔ)注釋資源與發(fā)現(xiàn)

在這一部分里,我們將回顧Bioconductor中用于處理和注釋基因組序列的一些工具缘琅。我們將研究參考基因組序列粘都,轉(zhuǎn)錄本和基因,并以基因通路(gene pathway)作為結(jié)束刷袍。我們學(xué)習(xí)這一部分的最終目標(biāo)就是使用注釋信息來(lái)幫助我們對(duì)基因組實(shí)驗(yàn)進(jìn)行可靠的解釋翩隧。Bioconductor的基本目標(biāo)就是更加方便地有關(guān)基因組結(jié)構(gòu)和功能的信息統(tǒng)計(jì)統(tǒng)計(jì)分析程序。

注釋概念的層次結(jié)構(gòu)

Bioconductor包括許多不同類型的基因組注釋呻纹。我們可以在層次結(jié)構(gòu)中來(lái)理解這些注釋資源堆生。

  • 最基因的注釋就是某個(gè)物種的參考基因組序列专缠。它總是按照核苷酸的線性方式排列成染色體(例如參考基因組。
  • 在此之上就是將染色體序列排列到感興趣的區(qū)域中顽频。最感興趣的區(qū)域就是基因藤肢,但是注釋中也含有其它的信息,例如SNP或CpG位點(diǎn)糯景∴胰Γ基因具有內(nèi)部結(jié)構(gòu),即被轉(zhuǎn)錄的部分和未被轉(zhuǎn)錄的部分蟀淮∽钭。“基因模式”定義了在基因組坐標(biāo)中的標(biāo)記和布置這些結(jié)構(gòu)的方式。
    • 在感興趣的區(qū)域(regions of interest)的理念下怠惶,我們還定義了面向平臺(tái)的注釋(platform-oriented annotation)涨缚。這處類型的注釋通常首先是由廠家提供的,但隨著研究的進(jìn)行策治,對(duì)這些平臺(tái)中最初有歧義信息進(jìn)行了確認(rèn)和更新脓魏,從而完善了這些注釋內(nèi)容。密歇根大學(xué)的brainarray project 說(shuō)明了affymetrix陣列注釋的過(guò)程通惫。我們將在本節(jié)最后討論面向平臺(tái)注釋的問(wèn)題茂翔。
  • 在此之是是將區(qū)域(通常是基因或基因的產(chǎn)物)組成成具有共同結(jié)構(gòu)或功能特性的組。例如在細(xì)胞中共同被發(fā)現(xiàn)的履腋,或者是被鑒定為在生物學(xué)過(guò)程中協(xié)同作用的基因組(我的理解就是GO分析珊燎,KEGG分析這一類)。

發(fā)現(xiàn)可用的參考基因組

Bioconductor已經(jīng)包含了注釋包的合成遵湖,將它這一層次結(jié)構(gòu)上的所有元素都帶了可編程環(huán)境中悔政。參考基因組序列是使用Biostrings和BSgenome包中的工具進(jìn)行管理的,available.genomes 函數(shù)能夠列出構(gòu)建好的人和現(xiàn)在各種模式生物的參考基因組延旧,如下所示:

library(Biostrings)
ag = available.genomes()
length(ag)
## [1] 87
head(ag)
## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.UCSC.apiMel2"       
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Athaliana.TAIR.04232008"       
## [6] "BSgenome.Athaliana.TAIR.TAIR9"

參考基因組的版本很重要

不同物種的參考基因組是從頭構(gòu)建的谋国,然后隨著算法和測(cè)序數(shù)據(jù)的不斷改進(jìn)而進(jìn)一步完善。對(duì)人類而言迁沫,基因組研究聯(lián)盟(Genome Research Consortium)于2009年構(gòu)建了37號(hào)版本烹卒,并于2013年構(gòu)建了38號(hào)版本。

一旦參考基因組構(gòu)建完成弯洗,就哦可以很輕松地對(duì)某個(gè)物種進(jìn)行信息豐富的基因組序列分析旅急,因?yàn)槿藗兛梢詫W⒂谀且鹨阎械任换蚨鄻有缘膮^(qū)域。

The reference build for an organism is created de novo and then refined as algorithms and sequenced data improve. For humans, the Genome Research Consortium signed off on build 37 in 2009, and on build 38 in 2013.

需要注意的是牡整,基因組序列包含有很長(zhǎng)的名稱藐吮,這個(gè)名稱里包括版本信息。這樣命名的方式就是為了避免與不同版本的參考基因組混淆。在LiftOver這節(jié)視頻里谣辞,我們就展示了如何使用UCSC的liftOver工具與rtracklayer包中的接口對(duì)接迫摔,從而實(shí)現(xiàn)不同版本的基因組坐標(biāo)轉(zhuǎn)化的過(guò)程。

為了幫助用戶避免混淆從不同參考基因組坐標(biāo)上收集分析來(lái)的數(shù)據(jù)泥从,我們提供了一個(gè)”基因組“標(biāo)簽句占,這個(gè)標(biāo)簽填充了大多關(guān)于序列的信息。在隨后的部分里躯嫉,我們會(huì)看到一些案例纱烘。用于序列比對(duì)的軟件可以檢查被比對(duì)上的序列的兼容標(biāo)簽,從而有助于確保有意義的結(jié)果祈餐。

H. sapiens的參考基因序列

通過(guò)安裝并添加一個(gè)單獨(dú)的R包就能獲取智人(Homo sapiens)的參考序列擂啥。這個(gè)程序包定義了一個(gè)Hsapiens對(duì)象,試劑公司對(duì)象是染色體序列的來(lái)源帆阳,但是當(dāng)對(duì)其進(jìn)行單獨(dú)顯示時(shí)哺壶,它會(huì)提供相關(guān)序列數(shù)據(jù)來(lái)源的信息,如下所示:

library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)
head(genome(Hsapiens))  # see the tag
##   chr1   chr2   chr3   chr4   chr5   chr6 
## "hg19" "hg19" "hg19" "hg19" "hg19" "hg19"

我們使用 $ 符號(hào)來(lái)獲取17號(hào)染色體的序列蜒谤,如下所示:

Hsapiens$chr17
##   81195210-letter "DNAString" instance
## seq: AAGCTTCTCACCCTGTTCCTGCATAGATAATTGC...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT

參考序列的轉(zhuǎn)錄本和基因

UCSC注釋

TxDb包家族和數(shù)據(jù)對(duì)象管理了轉(zhuǎn)錄本和基因模式信息山宾。我們可以認(rèn)為這些信息來(lái)源于UCSC基因組瀏覽器的注釋表,如下所示:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene # abbreviate
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

我們使用 genes() 來(lái)獲取Entrez Gene ID的地址鳍徽,如下所示:

ghs = genes(txdb)
ghs
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames                 ranges strand |     gene_id
##            <Rle>              <IRanges>  <Rle> | <character>
##       1    chr19 [ 58858172,  58874214]      - |           1
##      10     chr8 [ 18248755,  18258723]      + |          10
##     100    chr20 [ 43248163,  43280376]      - |         100
##    1000    chr18 [ 25530930,  25757445]      - |        1000
##   10000     chr1 [243651535, 244006886]      - |       10000
##     ...      ...                    ...    ... .         ...
##    9991     chr9 [114979995, 115095944]      - |        9991
##    9992    chr21 [ 35736323,  35743440]      + |        9992
##    9993    chr22 [ 19023795,  19109967]      - |        9993
##    9994     chr6 [ 90539619,  90584155]      + |        9994
##    9997    chr22 [ 50961997,  50964905]      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

我們也可以使用合適的標(biāo)識(shí)符進(jìn)行信息過(guò)濾∷担現(xiàn)在我們提取兩個(gè)不同基因的外顯子,這些外顯子由其Entrez基因ID標(biāo)明旬盯,如下所示:

exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
                  filter=list(gene_id=c(100, 101)))
## GRanges object with 39 ranges and 3 metadata columns:
##        seqnames                 ranges strand |    EXONID
##           <Rle>              <IRanges>  <Rle> | <integer>
##    [1]    chr10 [135075920, 135076737]      - |    144421
##    [2]    chr10 [135077192, 135077269]      - |    144422
##    [3]    chr10 [135080856, 135080921]      - |    144423
##    [4]    chr10 [135081433, 135081570]      - |    144424
##    [5]    chr10 [135081433, 135081622]      - |    144425
##    ...      ...                    ...    ... .       ...
##   [35]    chr20   [43254210, 43254325]      - |    256371
##   [36]    chr20   [43255097, 43255240]      - |    256372
##   [37]    chr20   [43257688, 43257810]      - |    256373
##   [38]    chr20   [43264868, 43264929]      - |    256374
##   [39]    chr20   [43280216, 43280376]      - |    256375
##                                  TXNAME          GENEID
##                         <CharacterList> <CharacterList>
##    [1] uc009ybi.3,uc010qva.2,uc021qbe.1             101
##    [2]            uc009ybi.3,uc021qbe.1             101
##    [3] uc009ybi.3,uc010qva.2,uc021qbe.1             101
##    [4]                       uc009ybi.3             101
##    [5]            uc010qva.2,uc021qbe.1             101
##    ...                              ...             ...
##   [35]                       uc002xmj.3             100
##   [36]                       uc002xmj.3             100
##   [37]                       uc002xmj.3             100
##   [38]                       uc002xmj.3             100
##   [39]                       uc002xmj.3             100
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

ENSEMBL注釋

Ensembl home主頁(yè)上寫道:Ensembl創(chuàng)建,整合和發(fā)布研究基因組的參考數(shù)據(jù)庫(kù)和工具翎猛。該項(xiàng)目位于 歐洲分子生物學(xué)實(shí)驗(yàn)室胖翰,該實(shí)驗(yàn)室的數(shù)據(jù)庫(kù)支持其注釋資源可以與Bioconductor兼容。

ensembldb 包含有一個(gè)簡(jiǎn)要說(shuō)明切厘,其內(nèi)容如下所示:

ensembldb包提供了一些函數(shù)萨咳,這些函數(shù)用于創(chuàng)建和使用以轉(zhuǎn)錄本為中心的注釋數(shù)據(jù)庫(kù)/包。使用注釋數(shù)據(jù)庫(kù)的Perl API可以從Ensembl 1中直接獲取這些數(shù)據(jù)疫稿。TxDb 包的功能和數(shù)據(jù)類似于GenomicFeatures包培他,另外,除了從數(shù)據(jù)庫(kù)檢索所有的基因/轉(zhuǎn)錄本模型和注釋外遗座,ensembldb包還提供了一個(gè)過(guò)濾框架舀凛,用于檢索特定條目的注釋,例如位于染色體區(qū)域上的某編碼基因或某LincRNA轉(zhuǎn)錄模式的特定條目途蒋。從1.7版本開(kāi)始猛遍,由ensembldb創(chuàng)建的EnsDb數(shù)據(jù)庫(kù)還包含蛋白質(zhì)注釋數(shù)據(jù)庫(kù)(參考第11節(jié):數(shù)據(jù)庫(kù)而已和可用屬性/列的概述)。有關(guān)蛋白質(zhì)注釋的信息請(qǐng)參考蛋白質(zhì)的vignette,如下所示:

library(ensembldb)
library(EnsDb.Hsapiens.v75)
names(listTables(EnsDb.Hsapiens.v75))
##  [1] "gene"           "tx"             "tx2exon"        "exon"          
##  [5] "chromosome"     "protein"        "uniprot"        "protein_domain"
##  [9] "entrezgene"     "metadata"

舉例說(shuō)明如下:

edb = EnsDb.Hsapiens.v75  # abbreviate
txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
                   columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs
## GRanges object with 20 ranges and 5 metadata columns:
##                   seqnames                 ranges strand |      protein_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENST00000335953       11 [113930315, 114121398]      + | ENSP00000338157
##   ENST00000335953       11 [113930315, 114121398]      + | ENSP00000338157
##   ENST00000335953       11 [113930315, 114121398]      + | ENSP00000338157
##   ENST00000335953       11 [113930315, 114121398]      + | ENSP00000338157
##   ENST00000335953       11 [113930315, 114121398]      + | ENSP00000338157
##               ...      ...                    ...    ... .             ...
##   ENST00000392996       11 [113931229, 114121374]      + | ENSP00000376721
##   ENST00000539918       11 [113935134, 114118066]      + | ENSP00000445047
##   ENST00000545851       11 [114051488, 114118018]      + |            <NA>
##   ENST00000535379       11 [114107929, 114121279]      + |            <NA>
##   ENST00000535509       11 [114117512, 114121198]      + |            <NA>
##                     uniprot_id              tx_biotype           tx_id
##                    <character>             <character>     <character>
##   ENST00000335953  ZBT16_HUMAN          protein_coding ENST00000335953
##   ENST00000335953 Q71UL7_HUMAN          protein_coding ENST00000335953
##   ENST00000335953 Q71UL6_HUMAN          protein_coding ENST00000335953
##   ENST00000335953 Q71UL5_HUMAN          protein_coding ENST00000335953
##   ENST00000335953 F5H6C3_HUMAN          protein_coding ENST00000335953
##               ...          ...                     ...             ...
##   ENST00000392996 F5H5Y7_HUMAN          protein_coding ENST00000392996
##   ENST00000539918         <NA> nonsense_mediated_decay ENST00000539918
##   ENST00000545851         <NA>    processed_transcript ENST00000545851
##   ENST00000535379         <NA>    processed_transcript ENST00000535379
##   ENST00000535509         <NA>         retained_intron ENST00000535509
##                     gene_name
##                   <character>
##   ENST00000335953      ZBTB16
##   ENST00000335953      ZBTB16
##   ENST00000335953      ZBTB16
##   ENST00000335953      ZBTB16
##   ENST00000335953      ZBTB16
##               ...         ...
##   ENST00000392996      ZBTB16
##   ENST00000539918      ZBTB16
##   ENST00000545851      ZBTB16
##   ENST00000535379      ZBTB16
##   ENST00000535509      ZBTB16
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

你的數(shù)據(jù)將會(huì)成他人的注釋:導(dǎo)入/導(dǎo)出

ENCODE項(xiàng)目很地說(shuō)明了今天的實(shí)驗(yàn)是明天的注釋懊烤。你應(yīng)該以同樣的方式考慮自己的實(shí)驗(yàn)(當(dāng)然梯醒,要使實(shí)驗(yàn)成為可靠且持久的注釋,它必須解決有關(guān)基因組結(jié)構(gòu)或功能的重要問(wèn)題腌紧,并且必須使用適當(dāng)?shù)娜紫埃苷_執(zhí)行的實(shí)驗(yàn)流程。需要注意壁肋,ENCODE能夠非常明確地將實(shí)驗(yàn)流程鏈接到數(shù)據(jù))号胚。

例如,我們來(lái)看一個(gè)雌激素受體結(jié)合數(shù)據(jù)墩划,它是由ENCODE發(fā)布的一個(gè)narrowPeak 數(shù)據(jù)涕刚。它的堿基是用ascii文本表示的,因此可以很容易地導(dǎo)入為一組文本數(shù)據(jù)乙帮。如果記錄的字段有一定的規(guī)律性杜漠,則可以將文件作為表格導(dǎo)入。

但是察净,我們不僅是想導(dǎo)入數(shù)據(jù)驾茴,還想將導(dǎo)入的數(shù)據(jù)作為可計(jì)算的對(duì)象。我們認(rèn)識(shí)到arrowePeak和bedGraph格式之間的聯(lián)系后氢卡,我們就可以立即將其導(dǎo)入GRanges中锈至。

為了說(shuō)明這一點(diǎn),我們?cè)贓RBS包中找到narrowPeak原始數(shù)據(jù)文件的路徑译秦,如下所示:

f1 = dir(system.file("extdata",package="ERBS"), full=TRUE)[1]
readLines(f1, 4) # look at a few lines
## [1] "chrX\t1509354\t1512462\t5\t0\t.\t157.92\t310\t32.000000\t1991"    
## [2] "chrX\t26801421\t26802448\t6\t0\t.\t147.38\t310\t32.000000\t387"   
## [3] "chr19\t11694101\t11695359\t1\t0\t.\t99.71\t311.66\t32.000000\t861"
## [4] "chr19\t4076892\t4079276\t4\t0\t.\t84.74\t310\t32.000000\t1508"

使用import命令非常簡(jiǎn)單峡捡,如下所示:

library(rtracklayer)
imp = import(f1, format="bedGraph")
imp
## GRanges object with 1873 ranges and 7 metadata columns:
##          seqnames               ranges strand |     score       NA.
##             <Rle>            <IRanges>  <Rle> | <numeric> <integer>
##      [1]     chrX [ 1509355,  1512462]      * |         5         0
##      [2]     chrX [26801422, 26802448]      * |         6         0
##      [3]    chr19 [11694102, 11695359]      * |         1         0
##      [4]    chr19 [ 4076893,  4079276]      * |         4         0
##      [5]     chr3 [53288568, 53290767]      * |         9         0
##      ...      ...                  ...    ... .       ...       ...
##   [1869]    chr19 [11201120, 11203985]      * |      8701         0
##   [1870]    chr19 [ 2234920,  2237370]      * |       990         0
##   [1871]     chr1 [94311336, 94313543]      * |      4035         0
##   [1872]    chr19 [45690614, 45691210]      * |     10688         0
##   [1873]    chr19 [ 6110100,  6111252]      * |      2274         0
##               NA.1      NA.2      NA.3      NA.4      NA.5
##          <logical> <numeric> <numeric> <numeric> <integer>
##      [1]      <NA>    157.92       310        32      1991
##      [2]      <NA>    147.38       310        32       387
##      [3]      <NA>     99.71    311.66        32       861
##      [4]      <NA>     84.74       310        32      1508
##      [5]      <NA>      78.2   299.505        32      1772
##      ...       ...       ...       ...       ...       ...
##   [1869]      <NA>      8.65     7.281   0.26576      2496
##   [1870]      <NA>      8.65    26.258  1.995679      1478
##   [1871]      <NA>      8.65    12.511   1.47237      1848
##   [1872]      <NA>      8.65     6.205         0       298
##   [1873]      <NA>      8.65    17.356  2.013228       496
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
genome(imp)  # genome identifier tag not set, but you should set it
##  chrX chr19  chr3 chr17  chr8 chr11 chr16  chr1  chr2  chr6  chr9  chr7 
##    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
##  chr5 chr12 chr20 chr21 chr22 chr18 chr10 chr14 chr15  chr4 chr13 
##    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

我們可以通過(guò)一次獲取GRanges。元數(shù)據(jù)列中還有一些其他字段用于指定名稱筑悴,但是如果我們只對(duì)范圍感興趣们拙,除了添加基因組元數(shù)據(jù)以防止與不兼容的坐標(biāo)中記錄的數(shù)據(jù)非法組合外,我們就完成了這個(gè)任務(wù)(這一段不太理解阁吝,原文如下):

We obtain a GRanges in one stroke. There are some additional fields in the metadata columns whose names should be specified, but if we are interested only in the ranges, we are done, with the exception of adding the genome metadata to protect against illegitimate combination with data recorded in an incompatible coordinate system.

為了與其他得養(yǎng)家或系統(tǒng)進(jìn)行交流砚婆,我們有兩個(gè)主要選擇。我們可以將GRanges保存為RData對(duì)象突勇,輕松地傳遞給另外一個(gè)R用戶使用装盯。或者甲馋,我們歌詞采用其他標(biāo)準(zhǔn)格式進(jìn)行導(dǎo)出埂奈。例如,如果我們僅對(duì)間隔地址和綁定的得分感興趣定躏,則僅保存為“bed”格式就足夠了挥转,如下所示:

export(imp, "demoex.bed")  # implicit format choice
cat(readLines("demoex.bed", n=5), sep="\n")
## chrX 1509354 1512462 .   5   .
## chrX 26801421    26802448    .   6   .
## chr19    11694101    11695359    .   1   .
## chr19    4076892 4079276 .   4   .
## chr3 53288567    53290767    .   9   .

我們已經(jīng)進(jìn)行了導(dǎo)入海蔽,建模和導(dǎo)入實(shí)驗(yàn)數(shù)據(jù)之間的“往返”,該實(shí)驗(yàn)數(shù)據(jù)可以與其他數(shù)據(jù)集成在一起绑谣,從而增進(jìn)生物學(xué)的理解党窜。

我們需要注意的是,注釋在某種程度上是永久正確的借宵,它與在知識(shí)邊界上的研究進(jìn)展乏味地隔離開(kāi)來(lái)幌衣。我們已經(jīng)看到了,甚至人類染色體的參考序列也受到了修訂壤玫。在使用ERBS包時(shí)豁护,我們將未知的實(shí)驗(yàn)結(jié)果視為定義ER結(jié)合位點(diǎn)從而進(jìn)入潛在的生物學(xué)解釋。不確定性欲间,峰鑒定的可變質(zhì)量楚里,尚未得到明確估計(jì),但應(yīng)該是這個(gè)樣子猎贴。

Bioconductor已經(jīng)盡力致力于這種情況的多個(gè)方面班缎。我們維護(hù)軟件先前版本和注釋的存檔,以便可以檢查或修改過(guò)去的工作她渴。我們每年會(huì)兩次更新中疏注釋資源达址,以確保正在進(jìn)行的工作以及獲得新知識(shí)的穩(wěn)定性。而且趁耗,我們已經(jīng)簡(jiǎn)化了導(dǎo)入和創(chuàng)建實(shí)驗(yàn)數(shù)據(jù)和注釋數(shù)據(jù)的表示形式沉唠。

AnnotationHub

AnnotationHub包用于獲取GRanges或其它的合適設(shè)計(jì)的容器,用于機(jī)構(gòu)設(shè)計(jì)的容器苛败,如下所示:

library(AnnotationHub)
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
ah = AnnotationHub()
## snapshotDate(): 2017-10-27
ah
## AnnotationHub with 42282 records
## # snapshotDate(): 2017-10-27 
## # $$dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih....
## # $$species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos ta...
## # $$rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, ChainFile,...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH2"]]' 
## 
##             title                                                         
##   AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa             
##   AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa          
##   AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa          
##   AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa                    
##   AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa                  
##   ...       ...                                                           
##   AH58988 | org.Flavobacterium_piscicida.eg.sqlite                        
##   AH58989 | org.Bacteroides_fragilis_YCH46.eg.sqlite                      
##   AH58990 | org.Pseudomonas_mendocina_ymp.eg.sqlite                       
##   AH58991 | org.Salmonella_enterica_subsp._enterica_serovar_Typhimurium...
##   AH58992 | org.Acinetobacter_baumannii.eg.sqlite

我們可以通過(guò)AnnotationHub獲得許多與HepG2細(xì)胞系相關(guān)的實(shí)驗(yàn)數(shù)據(jù)對(duì)象满葛,如下所示:

query(ah, "HepG2")
## AnnotationHub with 440 records
## # snapshotDate(): 2017-10-27 
## # $$dataprovider: UCSC, BroadInstitute, Pazar
## # $$species: Homo sapiens, NA
## # $$rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH22246"]]' 
## 
##             title                                                         
##   AH22246 | pazar_CEBPA_HEPG2_Schmidt_20120522.csv                        
##   AH22249 | pazar_CTCF_HEPG2_Schmidt_20120522.csv                         
##   AH22273 | pazar_HNF4A_HEPG2_Schmidt_20120522.csv                        
##   AH22309 | pazar_STAG1_HEPG2_Schmidt_20120522.csv                        
##   AH22348 | wgEncodeAffyRnaChipFiltTransfragsHepg2CytosolLongnonpolya.b...
##   ...       ...                                                           
##   AH41564 | E118-H4K5ac.imputed.pval.signal.bigwig                        
##   AH41691 | E118-H4K8ac.imputed.pval.signal.bigwig                        
##   AH41818 | E118-H4K91ac.imputed.pval.signal.bigwig                       
##   AH46971 | E118_15_coreMarks_mnemonics.bed.gz                            
##   AH49484 | E118_RRBS_FractionalMethylation.bigwig

query 方法可以使用過(guò)濾字符串的向量。要限制對(duì)尋址組蛋白H4K5的注釋資源的響應(yīng)罢屈,只需要添加該標(biāo)簽嘀韧,如下所示(To limit response to annotation resources addressing the histone H4K5, simply add that tag):

query(ah, c("HepG2", "H4K5"))
## AnnotationHub with 1 record
## # snapshotDate(): 2017-10-27 
## # names(): AH41564
## # $$dataprovider: BroadInstitute
## # $$species: Homo sapiens
## # $$rdataclass: BigWigFile
## # $$rdatadateadded: 2015-05-08
## # $$title: E118-H4K5ac.imputed.pval.signal.bigwig
## # $$description: Bigwig File containing -log10(p-value) signal tracks fr...
## # $$taxonomyid: 9606
## # $$genome: hg19
## # $$sourcetype: BigWig
## # $$sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/signal/cons...
## # $$sourcesize: 226630905
## # $$tags: c("EpigenomeRoadMap", "signal", "consolidatedImputed",
## #   "H4K5ac", "E118", "ENCODE2012", "LIV.HEPG2.CNCR", "HepG2
## #   Hepatocellular Carcinoma Cell Line") 
## # retrieve record with 'object[["AH41564"]]'

The OrgDb基因注釋圖

那些命名為org.*.ge.db 的包含在基因水平上鏈接到位置,蛋白產(chǎn)物標(biāo)識(shí)符儡遮,KEGG途徑和GO term,PMIDs以及其它注釋資源的標(biāo)識(shí)符的信息暗赶,如下所示:

library(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns() gives same answer
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"
head(select(org.Hs.eg.db, keys="ORMDL3", keytype="SYMBOL", 
   columns="PMID"))
## 'select()' returned 1:many mapping between keys and columns
##   SYMBOL     PMID
## 1 ORMDL3 11042152
## 2 ORMDL3 12093374
## 3 ORMDL3 12477932
## 4 ORMDL3 14702039
## 5 ORMDL3 15489334
## 6 ORMDL3 16169070

基因集和通路資源

基因本體論

Gene Ontology (GO)是一種廣泛使用的結(jié)構(gòu)化詞匯鄙币,它組織了基因和基因產(chǎn)物在以下方面的內(nèi)容:

  • 生物過(guò)程
  • 分子功能
  • 細(xì)胞組分。

這套詞匯本身旨在與所有生物有關(guān)蹂随。它采用有向無(wú)環(huán)圖的形式十嘿,其中term作為節(jié)點(diǎn),使用is-apart-of關(guān)系作構(gòu)成了大多數(shù)鏈接岳锁。

將生物體特定基因鏈接到基因本體中的術(shù)語(yǔ)的注釋與詞匯表本身是分開(kāi)的绩衷,并且涉及不同類型的證據(jù)。這些記錄都在Bioconductor的注釋包中。

我們可以使用GO.db包來(lái)快速地訪問(wèn)GO詞匯咳燕,如下所示:

library(GO.db)
GO.db # metadata
## GODb object:
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2017-Nov01
## | Db type: GODb
## | package: AnnotationDbi
## | DBSCHEMA: GO_DB
## | GOEGSOURCEDATE: 2017-Nov6
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | DBSCHEMAVERSION: 2.1
## 
## Please see: help('select') for usage information

使用AnnotationDbi包中的keys勿决,columnsselect函數(shù)也很容易在地id與不同terms之間進(jìn)行映射,如下所示:

k5 = keys(GO.db)[1:5]
cgo = columns(GO.db)
select(GO.db, keys=k5, columns=cgo[1:3])
## 'select()' returned 1:1 mapping between keys and columns
##         GOID
## 1 GO:0000001
## 2 GO:0000002
## 3 GO:0000003
## 4 GO:0000006
## 5 GO:0000007
##                                                                                                                                                                                                                                                                                      DEFINITION
## 1                                                                                                       The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.
## 2                                                                                                                                             The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.
## 3                                                                                                                                                                  The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms.
## 4                                      Enables the transfer of zinc ions (Zn2+) from one side of a membrane to the other, probably powered by proton motive force. In high-affinity transport the transporter is able to bind the solute even if it is only present at very low concentrations.
## 5 Enables the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+ = Zn2+, probably powered by proton motive force. In low-affinity transport the transporter is able to bind the solute only if it is present at very high concentrations.
##   ONTOLOGY
## 1       BP
## 2       BP
## 3       BP
## 4       MF
## 5       MF

詞匯表的圖形結(jié)構(gòu)被編碼在SQLite數(shù)據(jù)庫(kù)的表中招盲。我們可以使用RSQLite接口對(duì)此進(jìn)行查詢低缩,如下所示:

con = GO_dbconn()
dbListTables(con)
##  [1] "go_bp_offspring" "go_bp_parents"   "go_cc_offspring"
##  [4] "go_cc_parents"   "go_mf_offspring" "go_mf_parents"  
##  [7] "go_obsolete"     "go_ontology"     "go_synonym"     
## [10] "go_term"         "map_counts"      "map_metadata"   
## [13] "metadata"        "sqlite_stat1"

以下查詢提示了一些內(nèi)部標(biāo)識(shí)符:

dbGetQuery(con, "select _id, go_id, term from go_term limit 5")
##   _id      go_id                                        term
## 1  30 GO:0000001                   mitochondrion inheritance
## 2  32 GO:0000002            mitochondrial genome maintenance
## 3  33 GO:0000003                                reproduction
## 4  37 GO:0042254                         ribosome biogenesis
## 5  38 GO:0044183 protein binding involved in protein folding

我們可以將 mitochondrion inheritance term追溯到父項(xiàng)和祖父母項(xiàng),如下所示:

dbGetQuery(con, "select * from go_bp_parents where _id=30")
##   _id _parent_id relationship_type
## 1  30      26537              is_a
## 2  30      26540              is_a
dbGetQuery(con, "select _id, go_id, term from go_term where _id=26616")
##     _id      go_id
## 1 26616 GO:0048387
##                                                              term
## 1 negative regulation of retinoic acid receptor signaling pathway
dbGetQuery(con, "select * from go_bp_parents where _id=26616")
##     _id _parent_id    relationship_type
## 1 26616       8389                 is_a
## 2 26616      26614                 is_a
## 3 26616      26613 negatively_regulates
dbGetQuery(con, "select _id, go_id, term from go_term where _id=5932")
##    _id      go_id                    term
## 1 5932 GO:0019237 centromeric DNA binding

將 “mitochondrion inheritance” 視為過(guò)程“mitochondrion distribution”和 “organelle inheritance”在概念上的精練是有意義的曹货,這兩個(gè)term在數(shù)據(jù)庫(kù)中被為父項(xiàng)咆繁。

可以使用 GO_dbschema()來(lái)查看整個(gè)數(shù)據(jù)庫(kù)模式。

KEGG

自Bioconductor誕生以來(lái)顶籽,KEGG的注釋就能在Bioconductor中人使用了玩般,但KEGG的數(shù)據(jù)庫(kù)使用權(quán)限已經(jīng)進(jìn)行了更改。當(dāng)我們使用KEGG.db加載后會(huì)出現(xiàn)以下信息礼饱,如下所示:

> library(KEGG.db)
KEGG.db contains mappings based on older data because the original
  resource was removed from the the public domain before the most
  recent update was produced. This package should now be considered
  deprecated and future versions of Bioconductor may not have it
  available.  Users who want more current data are encouraged to look
  at the KEGGREST or reactome.db packages

因此我們可以關(guān)注KEGGREST這個(gè)包坏为,它需要聯(lián)網(wǎng)。這是一個(gè)非常有用的慨仿,基于Entrez標(biāo)識(shí)符的工具【酶現(xiàn)在我們查詢一下BRCA2的信息(它的EntrezID為675),如下所示:

library(KEGGREST)
brca2K = keggGet("hsa:675")
names(brca2K[[1]])
##  [1] "ENTRY"      "NAME"       "DEFINITION" "ORTHOLOGY"  "ORGANISM"  
##  [6] "PATHWAY"    "DISEASE"    "BRITE"      "POSITION"   "MOTIF"     
## [11] "DBLINKS"    "STRUCTURE"  "AASEQ"      "NTSEQ"

我們也可以通過(guò)keggGet函數(shù)來(lái)獲取構(gòu)成通路模式的基因列表镰吆,如下所示:

brpat = keggGet("path:hsa05212")
names(brpat[[1]])
##  [1] "ENTRY"       "NAME"        "DESCRIPTION" "CLASS"       "PATHWAY_MAP"
##  [6] "DISEASE"     "DRUG"        "ORGANISM"    "GENE"        "COMPOUND"   
## [11] "KO_PATHWAY"  "REFERENCE"
brpat[[1]]$GENE[seq(1,132,2)] # entrez gene ids
##  [1] "3845"  "5290"  "5293"  "5291"  "5295"  "5296"  "8503"  "9459" 
##  [9] "5879"  "5880"  "5881"  "4790"  "5970"  "207"   "208"   "10000"
## [17] "1147"  "3551"  "8517"  "572"   "598"   "842"   "369"   "673"  
## [25] "5894"  "5604"  "5594"  "5595"  "5599"  "5602"  "5601"  "5900" 
## [33] "5898"  "5899"  "10928" "998"   "7039"  "1950"  "1956"  "2064" 
## [41] "2475"  "6198"  "6199"  "3716"  "6774"  "6772"  "7422"  "1029" 
## [49] "1019"  "1021"  "595"   "5925"  "1869"  "1870"  "1871"  "7157" 
## [57] "1026"  "1647"  "4616"  "10912" "581"   "578"   "1643"  "51426"
## [65] "7040"  "7042"

KEGGREST還有許多值得研究的地方帘撰,例如還可以查詢BRCA2(人類)關(guān)于胰腺癌途徑的靜態(tài)圖像,如下所示:

library(png)
library(grid)
brpng = keggGet("hsa05212", "image")
grid.raster(brpng)
plot of chunk getp

其它本體

rols包含有與EMBL-EBI連接的接口 Ontology Lookup Service.

library(rols)
oo = Ontologies()
oo
## Object of class 'Ontologies' with 198 entries
##    GENEPIO, MP ... SEPIO, SIBO
oo[[1]]
## Ontology: Genomic Epidemiology Ontology (genepio)  
##   The Genomic Epidemiology Ontology (GenEpiO) covers vocabulary
##   necessary to identify, document and research foodborne pathogens
##   and associated outbreaks.
##    Loaded: 2017-04-10 Updated: 2017-10-20 Version: 2017-04-09 
##    4351 terms  137 properties  38 individuals

為了控制查詢檢索中涉及的網(wǎng)絡(luò)流量万皿,搜索分為幾個(gè)階段摧找,如下所示:

glis = OlsSearch("glioblastoma")
glis
## Object of class 'OlsSearch':
##   query: glioblastoma 
##   requested: 20 (out of 502)
##   response(s): 0
res = olsSearch(glis)
dim(res)
## NULL
resdf = as(res, "data.frame") # get content
resdf[1:4,1:4]
##                                                     id
## 1 ncit:class:http://purl.obolibrary.org/obo/NCIT_C3058
## 2     omit:http://purl.obolibrary.org/obo/OMIT_0007102
## 3    ordo:class:http://www.orpha.net/ORDO/Orphanet_360
## 4   hp:class:http://purl.obolibrary.org/obo/HP_0100843
##                                           iri   short_form        label
## 1   http://purl.obolibrary.org/obo/NCIT_C3058   NCIT_C3058 Glioblastoma
## 2 http://purl.obolibrary.org/obo/OMIT_0007102 OMIT_0007102 Glioblastoma
## 3      http://www.orpha.net/ORDO/Orphanet_360 Orphanet_360 Glioblastoma
## 4   http://purl.obolibrary.org/obo/HP_0100843   HP_0100843 Glioblastoma
resdf[1,5]  # full description for one instance
## [[1]]
## [1] "The most malignant astrocytic tumor (WHO grade IV).  It is composed of poorly differentiated neoplastic astrocytes and it is characterized by the presence of cellular polymorphism, nuclear atypia, brisk mitotic activity, vascular thrombosis, microvascular proliferation and necrosis. It typically affects adults and is preferentially located in the cerebral hemispheres. It may develop from diffuse astrocytoma WHO grade II or anaplastic astrocytoma (secondary glioblastoma, IDH-mutant), but more frequently, it manifests after a short clinical history de novo, without evidence of a less malignant precursor lesion (primary glioblastoma, IDH- wildtype). (Adapted from WHO)"

ontologyIndex 包支持導(dǎo)入開(kāi)放生物本體(OBO, Open Biological Ontologies)格式的數(shù)據(jù),并含有用于查詢和可視化本體系統(tǒng)高效的工具牢硅。

通用基因集管理

GSEABase 包有一個(gè)用于管理基因集和集合的優(yōu)秀工具蹬耘。我們可以從MSigDb中導(dǎo)入膠質(zhì)母細(xì)胞瘤相關(guān)的基因集來(lái)說(shuō)明一下,如下所示:

library(GSEABase)
glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
## Warning in readLines(con, ...): incomplete final line found on '/
## Library/Frameworks/R.framework/Versions/3.4/Resources/library/ph525x/gmt/
## glioSets.gmt'
glioG
## GeneSetCollection
##   names: BALDWIN_PRKCI_TARGETS_UP, BEIER_GLIOMA_STEM_CELL_DN, ..., ZHENG_GLIOBLASTOMA_PLASTICITY_UP (47 total)
##   unique identifiers: ADA, AQP9, ..., ZFP28 (3671 total)
##   types in collection:
##     geneIdType: NullIdentifier (1 total)
##     collectionType: NullCollection (1 total)
head(geneIds(glioG[[1]]))
## [1] "ADA"      "AQP9"     "ATP2B4"   "ATP6V1G1" "CBX6"     "CCDC165"

模式生物的統(tǒng)一减余,自我描述方法

OrganismDb包簡(jiǎn)化了對(duì)注釋的訪問(wèn)综苔。還可以針對(duì)TxDborg.[Nn].eg.db進(jìn)行直接查詢,如下所示:

library(Homo.sapiens)
class(Homo.sapiens)
## [1] "OrganismDb"
## attr(,"package")
## [1] "OrganismDbi"
Homo.sapiens
## OrganismDb Object:
## # Includes GODb Object:  GO.db 
## # With data about:  Gene Ontology 
## # Includes OrgDb Object:  org.Hs.eg.db 
## # Gene data about:  Homo sapiens 
## # Taxonomy Id:  9606 
## # Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg19.knownGene 
## # Transcriptome data about:  Homo sapiens 
## # Based on genome:  hg19 
## # The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .
tx = transcripts(Homo.sapiens)
## 'select()' returned 1:1 mapping between keys and columns
keytypes(Homo.sapiens)
##  [1] "ACCNUM"       "ALIAS"        "CDSID"        "CDSNAME"     
##  [5] "DEFINITION"   "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [9] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
## [13] "EXONID"       "EXONNAME"     "GENEID"       "GENENAME"    
## [17] "GO"           "GOALL"        "GOID"         "IPI"         
## [21] "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
## [25] "PATH"         "PFAM"         "PMID"         "PROSITE"     
## [29] "REFSEQ"       "SYMBOL"       "TERM"         "TXID"        
## [33] "TXNAME"       "UCSCKG"       "UNIGENE"      "UNIPROT"
columns(Homo.sapiens)
##  [1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"      
##  [5] "CDSID"        "CDSNAME"      "CDSSTART"     "CDSSTRAND"   
##  [9] "DEFINITION"   "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
## [13] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
## [17] "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"    
## [21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"      
## [25] "GENENAME"     "GO"           "GOALL"        "GOID"        
## [29] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [33] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [37] "PROSITE"      "REFSEQ"       "SYMBOL"       "TERM"        
## [41] "TXCHROM"      "TXEND"        "TXID"         "TXNAME"      
## [45] "TXSTART"      "TXSTRAND"     "TXTYPE"       "UCSCKG"      
## [49] "UNIGENE"      "UNIPROT"

面向平臺(tái)的注釋

通過(guò)在NCBI GEO的GPL信息頁(yè)面 上對(duì)信息進(jìn)行排序位岔,我們就可以看到最常用的寡核苷陣列平臺(tái)(數(shù)據(jù)庫(kù)中有4760個(gè)系列)就是Affy Human Genome U133 plus 2.0 array (GPL 570)如筛。我們可以使用hgu133plus2.db對(duì)這些數(shù)據(jù)進(jìn)行注釋,如下所示:

library(hgu133plus2.db)
## 
hgu133plus2.db
## ChipDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: ChipDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMANCHIP_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | MANUFACTURER: Affymetrix
## | CHIPNAME: Human Genome U133 Plus 2.0 Array
## | MANUFACTURERURL: http://www.affymetrix.com/support/technical/byproduct.affx?product=hg-u133-plus
## | EGSOURCEDATE: 2015-Sep27
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: ENTREZID
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 20150919
## | GOEGSOURCEDATE: 2015-Sep27
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19
## | GPSOURCEDATE: 2010-Mar22
## | ENSOURCEDATE: 2015-Jul16
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.uniprot.org/
## | UPSOURCEDATE: Thu Oct  1 23:31:58 2015
## 
## Please see: help('select') for usage information

這個(gè)資源(以及ChipDb類的所有實(shí)例)的基本目的是在探針集(probeset)識(shí)別符和更高層次的基因組注釋之間進(jìn)行映射抒抬。

有關(guān)探針的詳細(xì)信息(探針集的組成部分)已經(jīng)由那些后綴為probe的文件包提供杨刨,如下所示:

library(hgu133plus2probe)
head(hgu133plus2probe)
##                    sequence    x   y Probe.Set.Name
## 1 CACCCAGCTGGTCCTGTGGATGGGA  718 317      1007_s_at
## 2 GCCCCACTGGACAACACTGATTCCT 1105 483      1007_s_at
## 3 TGGACCCCACTGGCTGAGAATCTGG  584 901      1007_s_at
## 4 AAATGTTTCCTTGTGCCTGCTCCTG  192 205      1007_s_at
## 5 TCCTTGTGCCTGCTCCTGTACTTGT  844 979      1007_s_at
## 6 TGCCTGCTCCTGTACTTGTCCTCAG  537 971      1007_s_at
##   Probe.Interrogation.Position Target.Strandedness
## 1                         3330           Antisense
## 2                         3443           Antisense
## 3                         3512           Antisense
## 4                         3563           Antisense
## 5                         3570           Antisense
## 6                         3576           Antisense
dim(hgu133plus2probe)
## [1] 604258      6

將探針集標(biāo)識(shí)符映射到基因水平的信息可以提示一些有意思的歧視,如下所示:

select(hgu133plus2.db, keytype="PROBEID", 
  columns=c("SYMBOL", "GENENAME", "PATH", "MAP"), keys="1007_s_at")
## 'select()' returned 1:many mapping between keys and columns
##     PROBEID  SYMBOL                                    GENENAME PATH
## 1 1007_s_at    DDR1 discoidin domain receptor tyrosine kinase 1 <NA>
## 2 1007_s_at MIR4640                               microRNA 4640 <NA>
##       MAP
## 1 6p21.33
## 2 6p21.33

顯然擦剑,該探針集合可以用于mRNA和miRNA豐度的定量妖胀。作為穩(wěn)定的檢查芥颈,我們可以看到伶唯,不同的符號(hào)映射到了相同的細(xì)胞帶(最后一句不懂呐舔,原文為: As a sanity check we see that the distinct symbols map to the same cytoband)。

總結(jié)

我們現(xiàn)在已經(jīng)擁有了含有從核酸到通路水平的許多數(shù)據(jù)递礼。通過(guò)Bioconductor.org上的View就可以查看現(xiàn)有的一些資源怕品。

參考資料

  1. Genomic annotation in Bioconductor: The general situation
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末妇垢,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子肉康,更是在濱河造成了極大的恐慌闯估,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件吼和,死亡現(xiàn)場(chǎng)離奇詭異涨薪,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)炫乓,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門刚夺,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人末捣,你說(shuō)我怎么就攤上這事侠姑。” “怎么了箩做?”我有些...
    開(kāi)封第一講書人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵莽红,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我邦邦,道長(zhǎng)安吁,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任燃辖,我火速辦了婚禮鬼店,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘黔龟。我一直安慰自己妇智,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布氏身。 她就那樣靜靜地躺著巍棱,像睡著了一般。 火紅的嫁衣襯著肌膚如雪观谦。 梳的紋絲不亂的頭發(fā)上拉盾,一...
    開(kāi)封第一講書人閱讀 48,954評(píng)論 1 283
  • 那天桨菜,我揣著相機(jī)與錄音豁状,去河邊找鬼捉偏。 笑死,一個(gè)胖子當(dāng)著我的面吹牛泻红,可吹牛的內(nèi)容都是我干的夭禽。 我是一名探鬼主播,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼谊路,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼讹躯!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起缠劝,我...
    開(kāi)封第一講書人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤潮梯,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后惨恭,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體秉馏,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年脱羡,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了萝究。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡锉罐,死狀恐怖帆竹,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情脓规,我是刑警寧澤栽连,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布,位于F島的核電站抖拦,受9級(jí)特大地震影響升酣,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜态罪,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一噩茄、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧复颈,春花似錦绩聘、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至帜讲,卻和暖如春衅谷,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背似将。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工获黔, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留蚀苛,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓玷氏,卻偏偏與公主長(zhǎng)得像堵未,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子盏触,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容