GEOquery 包使用指南
- 1 Overview of GEO
- 2 Getting Started using GEOquery
- 3 GEOquery Data Structures
- 4 Converting to BioConductor ExpressionSets and limma MALists
- 5 Accessing Raw Data from GEO
- 6 Use Cases
# 1 GEO 介紹
GEO(The NCBI Gene Expression Omnibus)是NCBI專門儲存高通量測序的庫拼坎。如基于芯片數(shù)據(jù)(mRNA、DNA完疫、蛋白豐度)泰鸡,蛋白質(zhì)質(zhì)譜數(shù)據(jù)和高通量測序數(shù)據(jù)。
GEO數(shù)據(jù)主要有4種基本類型壳鹤。Sample, Platform 和 Series是由作者上傳的數(shù)據(jù)盛龄,dataset是由GEO官方從做和提交的數(shù)據(jù)整理出來的。
## 1.1 Platforms
GEO 號:GPLxxx芳誓。
芯片的組成信息余舶,例如 cDNAs, oligonucleotide probesets, ORFs, antibodies ∏绿剩或者其它定量檢測平臺信息匿值,例如SAGE tags, peptides。
## 1.2 Samples
GEO 號: GSMxxx
描述單個樣本信息葛圃,處理步驟千扔、處理條件以及實驗測得的結(jié)果憎妙。一個樣本可能屬于多個研究(Series)。
## 1.3 Series
GEO 號:GSExxx
涉及同一個研究的記錄曲楚,包括處理過的數(shù)據(jù)厘唾、總結(jié)和分析;信息可以從GSEMatrix文件解析快速得到龙誊。
##1.4 Datasets
GEO 號:GDSxxx
一套經(jīng)過整理的GEO 數(shù)據(jù)集抚垃。每套數(shù)據(jù)都是可以進(jìn)行生物學(xué)或者統(tǒng)計學(xué)上比較的樣本,是GEO自帶工具進(jìn)行數(shù)據(jù)分析和展示的基礎(chǔ)趟大。一個 GDS數(shù)據(jù)集來自同一個平臺鹤树,數(shù)據(jù)分析和標(biāo)準(zhǔn)化都具有一致性。
# 2 GEOquery快速上手
#GEOquery包安裝:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
#使用
library("GEOquery")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))
getGEO
函數(shù)可以從GEO官網(wǎng)獲取數(shù)據(jù)或者將固定格式數(shù)據(jù)解析為R格式的數(shù)據(jù)逊朽。
# 3 GEOquery 數(shù)據(jù)結(jié)構(gòu)
GEOquery 數(shù)據(jù)結(jié)構(gòu)大致分為兩類罕伯。第一種是GDS, GPL和GSM,他們的操作和數(shù)據(jù)類型差不多;第二種是GSE叽讳,GSE數(shù)據(jù)是由GSM和GPL整合而成追他。
## 3.1 GDS, GSM 和 GPL
這些數(shù)據(jù)類組成
- 一個元數(shù)據(jù)頭(與the SOFT 格式的頭差不多)
- 一個GEODataTable(數(shù)據(jù))。
可以使用show()查看這些數(shù)據(jù)類岛蚤。
- 查看 gsm metadata
head(Meta(gsm))
## $channel_count
## [1] "1"
##
## $comment
## [1] "Raw data provided as supplementary file"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
- 查看 gsm data
Table(gsm)[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
- 查看GSM數(shù)據(jù)列的含義
Columns(gsm)
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
head(Meta(gds))
Table(gds)[1:5,]
Columns(gds)
Columns(gds)[1,]
##3.2 GSE類
GSE類組成:
- 一個元數(shù)據(jù)頭
- GPLList邑狸,列表,包含GPL信息
- GSMList涤妒,列表单雾,包含GSM信息
- 包含GEODataTable
gse <- getGEO("GSE781",GSEMatrix=FALSE)
head(Meta(gse))
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
##
## $contact_fax
## [1] "617-414-1646"
# names of all the GSM objects contained in the GSE
names(GSMList(gse))
## [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
## [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"
# and get the first GSM object on the list
GSMList(gse)[[1]]
## An object of class "GSM"
## channel_count
## [1] "1"
## comment
## [1] "Raw data provided as supplementary file"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "mlenburg@bu.edu"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
## 22278 more rows ...
# and the names of the GPLs represented
names(GPLList(gse))
## [1] "GPL96" "GPL97"
# 4 Converting to BioConductor ExpressionSets and limma MALists
GEO datasets與limma 數(shù)據(jù)結(jié)構(gòu)MAList 和Biobase數(shù)據(jù)結(jié)構(gòu) ExpressionSet比較相似∷希可以相互轉(zhuǎn)換:
- GDS2MA 和 GDS2eSet
## 4.1 Getting GSE Series Matrix files as an ExpressionSet
GEO Series是一套實驗數(shù)據(jù)的集合硅堆,有SOFT,MINiML格式文件贿讹,以及一個 Series Matrix File(s)文本硬萍。Series Matrix File是tab-delimited text,getGEO
函數(shù)可以解析围详,解析結(jié)果就是ExpressionSets。
# Note that GSEMatrix=TRUE is the default
# GSEMatrix=TRUE下載GSE2553_series_matrix.txt.gz和GPL1977.soft文件
# GSEMatrix=F下載GSE2553.soft.gz
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
show(gse2553)
show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
## varLabels: title geo_accession ... data_row_count (30 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 12600 (12600 total)
## fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 16230383
## Annotation: GPL1977
phenoData(gse2553[[1]]))
pData(gse2553[[1]]))
featureData(gse2553[[1]]))
fData(gse2553[[1]]))
experimentData(gse2553[[1]]))
一個GSE下如果存在多個GPL測序祖屏,篩選特定的GPL數(shù)據(jù)助赞;GSE會有多個列表gset[[idx]]
show(pData(gse2553[[1]])[1:5,c(1,6,8)])
## title type
## GSM48681 Patient sample ST18, Dermatofibrosarcoma RNA
## GSM48682 Patient sample ST410, Ewing Sarcoma RNA
## GSM48683 Patient sample ST130, Sarcoma, NOS RNA
## GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor RNA
## GSM48685 Patient sample ST367, Liposarcoma RNA
## source_name_ch1
## GSM48681 Dermatofibrosarcoma
## GSM48682 Ewing Sarcoma
## GSM48683 Sarcoma, NOS
## GSM48684 Malignant Peripheral Nerve Sheath Tumor
## GSM48685 Liposarcoma
##4.2 Converting GDS to an ExpressionSet
eset <- GDS2eSet(gds,do.log2=TRUE)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
## varLabels: sample disease.state individual description
## varMetadata: labelDescription
## featureData
## featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
## fvarLabels: ID Gene title ... GO:Component ID (21 total)
## fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 14641932
## Annotation:
pData(eset)[,1:3]
## sample disease.state individual
## GSM11815 GSM11815 RCC 035
## GSM11832 GSM11832 RCC 023
## GSM12069 GSM12069 RCC 001
## GSM12083 GSM12083 RCC 005
## GSM12101 GSM12101 RCC 011
## GSM12106 GSM12106 RCC 032
## GSM12274 GSM12274 RCC 2
## GSM12299 GSM12299 RCC 3
## GSM12412 GSM12412 RCC 4
## GSM11810 GSM11810 normal 035
## GSM11827 GSM11827 normal 023
## GSM12078 GSM12078 normal 001
## GSM12099 GSM12099 normal 005
## GSM12269 GSM12269 normal 1
## GSM12287 GSM12287 normal 2
## GSM12301 GSM12301 normal 3
## GSM12448 GSM12448 normal 4
##4.3 Converting GDS to an MAList
ExpressionSet不包含注釋信息,getGEO
可以幫助我們獲取袁勺。
#get the platform from the GDS metadata
Meta(gds)$platform
## [1] "GPL97"
#So use this information in a call to getGEO
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
與ExpressionSet不同雹食,the limma MAList
包含基因注釋信息。上面的gpl包含注釋信息期丰。
MAList不僅包含數(shù)據(jù)群叶,還包含樣本信息吃挑,和注釋信息。
MA <- GDS2MA(gds,GPL=gpl)
class(MA)##
[1] "MAList"
## attr(,"package")
## [1] "limma"
4.4 Converting GSE to an ExpressionSet
GSE轉(zhuǎn)換成ExpressionSet
gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)
## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"
這個GSE包含兩個GPLs街立,GPL96 和 GPL97舶衬。
篩選使用GPL96 的GSM。
gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
## [1] 17
Table(gsmlist[[1]])[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
the single measurement for each array is called the
VALUE
column
# and get the column descriptions
Columns(gsmlist[[1]])[1:5,]
獲取表達(dá)矩陣:
# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
colnames(Table(GPLList(gse)[[1]]))
[1] "ID" "GB_ACC" "SPOT_ID"
[4] "Species Scientific Name" "Annotation Date" "Sequence Type"
[7] "Sequence Source" "Target Description" "Representative Public ID"
[10] "Gene Title" "Gene Symbol" "ENTREZ_GENE_ID"
[13] "RefSeq Transcript ID" "Gene Ontology Biological Process" "Gene Ontology Cellular Component"
[16] "Gene Ontology Molecular Function"
# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]
## GSM11805 GSM11814 GSM11823 GSM11830 GSM12067 GSM12075 GSM12079
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376 11.222795 11.469845
## [2,] 5.749534 7.908092 7.093814 7.514122 7.901470 6.407693 5.165912
## [3,] 7.066089 7.750205 7.244126 7.962896 7.337176 6.569856 7.477354
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568 12.529870 12.240046
## [5,] 6.195741 6.061776 6.565293 6.583459 6.877744 6.652486 3.981853
## GSM12098 GSM12100 GSM12105 GSM12268 GSM12270 GSM12283 GSM12298
## [1,] 10.823367 10.835971 10.810893 11.062653 10.323055 11.181028 11.566387
## [2,] 6.556123 8.207014 6.816344 6.563768 7.353147 5.770829 6.912889
## [3,] 7.708739 7.428779 7.754888 7.126188 8.742815 7.339850 7.602142
## [4,] 12.336534 11.762839 11.237509 12.412490 11.213408 12.678380 12.232901
## [5,] 5.501439 6.247928 6.017922 6.525129 6.683696 5.918863 5.837943
## GSM12300 GSM12399 GSM12444
## [1,] 11.078151 11.535178 11.105450
## [2,] 4.812498 7.471675 7.488644
## [3,] 7.383704 7.432959 7.381110
## [4,] 12.090939 11.421802 12.172834
## [5,] 6.281698 5.419539 5.469235
構(gòu)造ExpressionSet
require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
## varLabels: samples
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
5 Accessing Raw Data from GEO
- getGEOSuppFiles函數(shù)
6 Use Cases
##6.1 Getting all Series Records for a Given Platform
- 獲取GPL相關(guān)的所有GSE 和 GSM accessions
gpl97 <- getGEO('GPL97')
Meta(gpl97)$title
## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
head(Meta(gpl97)$series_id)
## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
length(Meta(gpl97)$series_id)
## [1] 165
head(Meta(gpl97)$sample_id)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
length(Meta(gpl97)$sample_id)
## [1] 7917
gsmids <- Meta(gpl97)$sample_id
gsmlist <- sapply(gsmids[1:5],getGEO)
names(gsmlist)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"
英文版原文見:[Using the GEOquery Package