GEOquery 包使用指南

1 Overview of GEO
2 Getting Started using GEOquery
3 GEOquery Data Structures
- 3.1 The GDS, GSM, and GPL classes
- 3.2 The GSE class
4 Converting to BioConductor ExpressionSets and limma MALists
5 Accessing Raw Data from GEO
6 Use Cases
- 6.1 Getting all Series Records for a Given Platform

# 1 GEO 介紹

GEO（The NCBI Gene Expression Omnibus）是NCBI專門儲存高通量測序的庫拼坎。如基于芯片數(shù)據(jù)（mRNA、DNA完疫、蛋白豐度）泰鸡，蛋白質(zhì)質(zhì)譜數(shù)據(jù)和高通量測序數(shù)據(jù)。
GEO數(shù)據(jù)主要有4種基本類型壳鹤。Sample, Platform 和 Series是由作者上傳的數(shù)據(jù)盛龄，dataset是由GEO官方從做和提交的數(shù)據(jù)整理出來的。

## 1.1 Platforms
GEO 號：GPLxxx芳誓。
芯片的組成信息余舶，例如 cDNAs, oligonucleotide probesets, ORFs, antibodies ∏绿剩或者其它定量檢測平臺信息匿值，例如SAGE tags, peptides。

## 1.2 Samples
GEO 號: GSMxxx

描述單個樣本信息葛圃，處理步驟千扔、處理條件以及實驗測得的結(jié)果憎妙。一個樣本可能屬于多個研究（Series）。

## 1.3 Series
GEO 號：GSExxx

涉及同一個研究的記錄曲楚，包括處理過的數(shù)據(jù)厘唾、總結(jié)和分析；信息可以從GSEMatrix文件解析快速得到龙誊。

##1.4 Datasets
GEO 號：GDSxxx

一套經(jīng)過整理的GEO 數(shù)據(jù)集抚垃。每套數(shù)據(jù)都是可以進(jìn)行生物學(xué)或者統(tǒng)計學(xué)上比較的樣本，是GEO自帶工具進(jìn)行數(shù)據(jù)分析和展示的基礎(chǔ)趟大。一個 GDS數(shù)據(jù)集來自同一個平臺鹤树，數(shù)據(jù)分析和標(biāo)準(zhǔn)化都具有一致性。

# 2 GEOquery快速上手

#GEOquery包安裝：
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")

#使用
library("GEOquery")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))

getGEO函數(shù)可以從GEO官網(wǎng)獲取數(shù)據(jù)或者將固定格式數(shù)據(jù)解析為R格式的數(shù)據(jù)逊朽。

# 3 GEOquery 數(shù)據(jù)結(jié)構(gòu)

GEOquery 數(shù)據(jù)結(jié)構(gòu)大致分為兩類罕伯。第一種是GDS, GPL和GSM,他們的操作和數(shù)據(jù)類型差不多；第二種是GSE叽讳，GSE數(shù)據(jù)是由GSM和GPL整合而成追他。

## 3.1 GDS, GSM 和 GPL

這些數(shù)據(jù)類組成

一個元數(shù)據(jù)頭(與the SOFT 格式的頭差不多)
一個GEODataTable(數(shù)據(jù))。

可以使用show()查看這些數(shù)據(jù)類岛蚤。

查看 gsm metadata

head(Meta(gsm))

## $channel_count
## [1] "1"
## 
## $comment
## [1] "Raw data provided as supplementary file"
## 
## $contact_address
## [1] "715 Albany Street, E613B"
## 
## $contact_city
## [1] "Boston"
## 
## $contact_country
## [1] "USA"
## 
## $contact_department
## [1] "Genetics and Genomics"

查看 gsm data

Table(gsm)[1:5,]
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

查看GSM數(shù)據(jù)列的含義

Columns(gsm)
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1                                                                           
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065

head(Meta(gds))
Table(gds)[1:5,]
Columns(gds)
Columns(gds)[1,]

##3.2 GSE類

GSE類組成：

一個元數(shù)據(jù)頭
GPLList邑狸，列表，包含GPL信息
GSMList涤妒，列表单雾，包含GSM信息
- 包含GEODataTable

gse <- getGEO("GSE781",GSEMatrix=FALSE)

head(Meta(gse))
## $contact_address
## [1] "715 Albany Street, E613B"
## 
## $contact_city
## [1] "Boston"
## 
## $contact_country
## [1] "USA"
## 
## $contact_department
## [1] "Genetics and Genomics"
## 
## $contact_email
## [1] "mlenburg@bu.edu"
## 
## $contact_fax
## [1] "617-414-1646"

# names of all the GSM objects contained in the GSE
names(GSMList(gse))
##  [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
##  [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"

# and get the first GSM object on the list
GSMList(gse)[[1]]
## An object of class "GSM"
## channel_count 
## [1] "1"
## comment 
## [1] "Raw data provided as supplementary file"
## contact_address 
## [1] "715 Albany Street, E613B"
## contact_city 
## [1] "Boston"
## contact_country 
## [1] "USA"
## contact_department 
## [1] "Genetics and Genomics"
## contact_email 
## [1] "mlenburg@bu.edu"
## contact_fax 
## [1] "617-414-1646"
## contact_institute 
## [1] "Boston University School of Medicine"
## contact_name 
## [1] "Marc,E.,Lenburg"
## contact_phone 
## [1] "617-414-1375"
## contact_state 
## [1] "MA"
## contact_web_link 
## [1] "http://gg.bu.edu"
## contact_zip/postal_code 
## [1] "02130"
## data_row_count 
## [1] "22283"
## description 
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"                                                                                                                                                                                                                                                                                                                                                                           
## [3] "Keywords = renal"                                                                                                                                                                                                                                                                                                                                                                            
## [4] "Keywords = RCC"                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Keywords = carcinoma"                                                                                                                                                                                                                                                                                                                                                                        
## [6] "Keywords = cancer"                                                                                                                                                                                                                                                                                                                                                                           
## [7] "Lot batch = 2004638"                                                                                                                                                                                                                                                                                                                                                                         
## geo_accession 
## [1] "GSM11805"
## last_update_date 
## [1] "May 28 2005"
## molecule_ch1 
## [1] "total RNA"
## organism_ch1 
## [1] "Homo sapiens"
## platform_id 
## [1] "GPL96"
## series_id 
## [1] "GSE781"
## source_name_ch1 
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status 
## [1] "Public on Nov 25 2003"
## submission_date 
## [1] "Oct 20 2003"
## supplementary_file 
## [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
## title 
## [1] "N035 Normal Human Kidney U133A"
## type 
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1                                                                           
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P
## 22278 more rows ...

# and the names of the GPLs represented
names(GPLList(gse))
## [1] "GPL96" "GPL97"

# 4 Converting to BioConductor ExpressionSets and limma MALists

GEO datasets與limma 數(shù)據(jù)結(jié)構(gòu)MAList 和Biobase數(shù)據(jù)結(jié)構(gòu) ExpressionSet比較相似∷希可以相互轉(zhuǎn)換：

GDS2MA 和 GDS2eSet

## 4.1 Getting GSE Series Matrix files as an ExpressionSet
GEO Series是一套實驗數(shù)據(jù)的集合硅堆，有SOFT，MINiML格式文件贿讹，以及一個 Series Matrix File(s)文本硬萍。Series Matrix File是tab-delimited text，getGEO函數(shù)可以解析围详，解析結(jié)果就是ExpressionSets。

# Note that GSEMatrix=TRUE is the default
# GSEMatrix=TRUE下載GSE2553_series_matrix.txt.gz和GPL1977.soft文件
# GSEMatrix=F下載GSE2553.soft.gz
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
show(gse2553)

show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
##   varLabels: title geo_accession ... data_row_count (30 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 12600 (12600 total)
##   fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 16230383 
## Annotation: GPL1977

phenoData(gse2553[[1]]))
pData(gse2553[[1]]))
featureData(gse2553[[1]]))
fData(gse2553[[1]]))
experimentData(gse2553[[1]]))

一個GSE下如果存在多個GPL測序祖屏，篩選特定的GPL數(shù)據(jù)助赞；GSE會有多個列表gset[[idx]]

show(pData(gse2553[[1]])[1:5,c(1,6,8)])
##                                                                  title type
## GSM48681                      Patient sample ST18, Dermatofibrosarcoma  RNA
## GSM48682                           Patient sample ST410, Ewing Sarcoma  RNA
## GSM48683                            Patient sample ST130, Sarcoma, NOS  RNA
## GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor  RNA
## GSM48685                             Patient sample ST367, Liposarcoma  RNA
##                                  source_name_ch1
## GSM48681                     Dermatofibrosarcoma
## GSM48682                           Ewing Sarcoma
## GSM48683                            Sarcoma, NOS
## GSM48684 Malignant Peripheral Nerve Sheath Tumor
## GSM48685                             Liposarcoma

##4.2 Converting GDS to an ExpressionSet

eset <- GDS2eSet(gds,do.log2=TRUE)

eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
##   varLabels: sample disease.state individual description
##   varMetadata: labelDescription
## featureData
##   featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
##   fvarLabels: ID Gene title ... GO:Component ID (21 total)
##   fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14641932 
## Annotation:

pData(eset)[,1:3]
##            sample disease.state individual
## GSM11815 GSM11815           RCC        035
## GSM11832 GSM11832           RCC        023
## GSM12069 GSM12069           RCC        001
## GSM12083 GSM12083           RCC        005
## GSM12101 GSM12101           RCC        011
## GSM12106 GSM12106           RCC        032
## GSM12274 GSM12274           RCC          2
## GSM12299 GSM12299           RCC          3
## GSM12412 GSM12412           RCC          4
## GSM11810 GSM11810        normal        035
## GSM11827 GSM11827        normal        023
## GSM12078 GSM12078        normal        001
## GSM12099 GSM12099        normal        005
## GSM12269 GSM12269        normal          1
## GSM12287 GSM12287        normal          2
## GSM12301 GSM12301        normal          3
## GSM12448 GSM12448        normal          4

##4.3 Converting GDS to an MAList
ExpressionSet不包含注釋信息，getGEO可以幫助我們獲取袁勺。

#get the platform from the GDS metadata
Meta(gds)$platform
## [1] "GPL97"

#So use this information in a call to getGEO
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))

與ExpressionSet不同雹食，the limma MAList包含基因注釋信息。上面的gpl包含注釋信息期丰。

MAList不僅包含數(shù)據(jù)群叶，還包含樣本信息吃挑，和注釋信息。

MA <- GDS2MA(gds,GPL=gpl)
class(MA)## 
[1] "MAList"
## attr(,"package")
## [1] "limma"

4.4 Converting GSE to an ExpressionSet
GSE轉(zhuǎn)換成ExpressionSet

gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)

## $GSM11805
## [1] "GPL96"
## 
## $GSM11810
## [1] "GPL97"
## 
## $GSM11814
## [1] "GPL96"
## 
## $GSM11815
## [1] "GPL97"
## 
## $GSM11823
## [1] "GPL96"
## 
## $GSM11827
## [1] "GPL97"

這個GSE包含兩個GPLs街立，GPL96 和 GPL97舶衬。

篩選使用GPL96 的GSM。

gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
## [1] 17

Table(gsmlist[[1]])[1:5,]
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

the single measurement for each array is called the VALUE column

# and get the column descriptions
Columns(gsmlist[[1]])[1:5,]

獲取表達(dá)矩陣：

# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
colnames(Table(GPLList(gse)[[1]]))
[1] "ID"                               "GB_ACC"                           "SPOT_ID"                         
[4] "Species Scientific Name"          "Annotation Date"                  "Sequence Type"                   
[7] "Sequence Source"                  "Target Description"               "Representative Public ID"        
[10] "Gene Title"                       "Gene Symbol"                      "ENTREZ_GENE_ID"                  
[13] "RefSeq Transcript ID"             "Gene Ontology Biological Process" "Gene Ontology Cellular Component"
[16] "Gene Ontology Molecular Function"

# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(gsmlist,function(x) 
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]

##       GSM11805  GSM11814  GSM11823  GSM11830  GSM12067  GSM12075  GSM12079
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376 11.222795 11.469845
## [2,]  5.749534  7.908092  7.093814  7.514122  7.901470  6.407693  5.165912
## [3,]  7.066089  7.750205  7.244126  7.962896  7.337176  6.569856  7.477354
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568 12.529870 12.240046
## [5,]  6.195741  6.061776  6.565293  6.583459  6.877744  6.652486  3.981853
##       GSM12098  GSM12100  GSM12105  GSM12268  GSM12270  GSM12283  GSM12298
## [1,] 10.823367 10.835971 10.810893 11.062653 10.323055 11.181028 11.566387
## [2,]  6.556123  8.207014  6.816344  6.563768  7.353147  5.770829  6.912889
## [3,]  7.708739  7.428779  7.754888  7.126188  8.742815  7.339850  7.602142
## [4,] 12.336534 11.762839 11.237509 12.412490 11.213408 12.678380 12.232901
## [5,]  5.501439  6.247928  6.017922  6.525129  6.683696  5.918863  5.837943
##       GSM12300  GSM12399  GSM12444
## [1,] 11.078151 11.535178 11.105450
## [2,]  4.812498  7.471675  7.488644
## [3,]  7.383704  7.432959  7.381110
## [4,] 12.090939 11.421802 12.172834
## [5,]  6.281698  5.419539  5.469235

構(gòu)造ExpressionSet

require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
##   varLabels: samples
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

5 Accessing Raw Data from GEO

getGEOSuppFiles函數(shù)

6 Use Cases

##6.1 Getting all Series Records for a Given Platform

獲取GPL相關(guān)的所有GSE 和 GSM accessions

gpl97 <- getGEO('GPL97')
Meta(gpl97)$title
## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
head(Meta(gpl97)$series_id)
## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
length(Meta(gpl97)$series_id)
## [1] 165
head(Meta(gpl97)$sample_id)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
length(Meta(gpl97)$sample_id)
## [1] 7917

gsmids <- Meta(gpl97)$sample_id
gsmlist <- sapply(gsmids[1:5],getGEO)
names(gsmlist)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"

英文版原文見：[Using the GEOquery Package

基因芯片數(shù)據(jù)分析-1: 使用GEOquery 包從GEO獲取數(shù)據(jù)