今天開始會(huì)寫一系列關(guān)于用R進(jìn)行基因芯片分析的文章田篇,算是對(duì)該流程的總結(jié)。之前學(xué)習(xí)了通過circos來畫圈圖箍铭,以及通過MCScanX進(jìn)行基因家族的共線性分析泊柬,放一張結(jié)果先:
逼格滿滿,容我先傲嬌一下~
##################我是分割線##################
GEO數(shù)據(jù)庫(kù)基礎(chǔ)知識(shí)
Gene Expression Omnibus(簡(jiǎn)稱GEO)是一個(gè)將科研團(tuán)體提交的基因芯片诈火、下一代測(cè)序數(shù)據(jù)兽赁、以及其他高通量功能基因組數(shù)據(jù)形式進(jìn)行歸檔以供免費(fèi)下載的公共數(shù)據(jù)庫(kù)。
在GEO中記錄的數(shù)據(jù)形式主要有一下幾種:
- GEO Platform (GPL) 平臺(tái),包括芯片和測(cè)序平臺(tái)刀崖,每個(gè)平臺(tái)都被賦予唯一的GEO號(hào)
- GEO Series (GSE) study的ID號(hào)惊科,是相關(guān)實(shí)驗(yàn)樣本的集合,并提供了對(duì)于該研究的關(guān)鍵描述性信息
- GEO Sample (GSM) 樣本ID號(hào)亮钦,提供了單個(gè)樣本處理的實(shí)驗(yàn)條件以及該樣本的測(cè)量數(shù)據(jù)
- GEO Dataset (GDS) 數(shù)據(jù)集的ID號(hào)
一個(gè)GDS可以有多個(gè)GSM馆截,一個(gè)GSM可以有多個(gè)GSE,至于GPL蜂莉,一般不接觸蜡娶。
簡(jiǎn)單來講,分析一張表達(dá)芯片主要包括以下幾個(gè)步驟:
1. 安裝相關(guān)的packages(installation)
2. 從GEO下載數(shù)據(jù)(download dataset)
3. 歸一化(normalization)
4. 芯片質(zhì)控(quality control)
5. 差異表達(dá)基因分析(differential expressed gene analysis)
我們通常接觸的都是GSE系列(一個(gè)GSE里面有多個(gè)GSM)的數(shù)據(jù)巡语。
Installation
從GEO數(shù)據(jù)庫(kù)中下載感興趣的芯片翎蹈,需要事先安裝好bioconductor的核心包,然后安裝專門用于下載標(biāo)準(zhǔn)GEO數(shù)據(jù)結(jié)構(gòu)的GEOquery包男公。
R console 輸入以下命令荤堪,進(jìn)行GEOquery包的安裝:
source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")
Getting the data
這次以數(shù)據(jù)集GSE97045為對(duì)象進(jìn)行分析。這張芯片以栽培番茄(S. lycopersicum (Sl) (cv. P73) )和野生潘那利番茄(S. pennellii (Sp) (acc. PE47) )為研究對(duì)象枢赔,分析干旱處理前后澄阳,兩種番茄種內(nèi)與種間基因表達(dá)的差異,并進(jìn)行聚類分析踏拜∷橛可以發(fā)現(xiàn),該實(shí)驗(yàn)包括兩個(gè)因素:不同番茄(栽培 vs 野生)和不同處理(unstressed vs drought)速梗。每一條件各有三個(gè)重復(fù)肮塞,因此,該實(shí)驗(yàn)的總樣本(芯片)數(shù)為:2(兩種番茄)* 2 (兩個(gè)處理)* 3 (三次重復(fù))= 12
首先姻锁,通過GEOquery包進(jìn)行芯片數(shù)據(jù)的下載或直接在本地打開文件:
setwd('D:\\work\\Bioinformatics\\data\\array')
gse97045 <- getGEO(filename = "D:\\work\\Bioinformatics\\data\\array\\GSE97045_series_matrix.txt.gz", getGPL = F)
#class(gse97045)
#[1] "ExpressionSet"
#attr(,"package")
#[1] "Biobase"
直接打開的就是"ExpressionSet"類文件枕赵,可以用該類的方法,主要包括4種:
#geneNames(): 該方法已失效
#sampleNames(): 返回樣本(芯片)編號(hào)
> sampleNames(gse97045)
[1] "GSM2550523" "GSM2550524" "GSM2550525" "GSM2550526" "GSM2550527" "GSM2550528" "GSM2550529" "GSM2550530" "GSM2550531" "GSM2550532" "GSM2550533" "GSM2550534"
#pData(): 返回與實(shí)驗(yàn)相關(guān)的表型數(shù)據(jù)或元數(shù)據(jù)位隶,即phenotypic data的縮寫拷窜,以data.frame的格式呈現(xiàn)
> colnames(pData(gse97045))
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "characteristics_ch1.3" "treatment_protocol_ch1" "growth_protocol_ch1"
[16] "molecule_ch1" "extract_protocol_ch1" "label_ch1"
[19] "label_protocol_ch1" "taxid_ch1" "hyb_protocol"
[22] "scan_protocol" "description" "data_processing"
[25] "data_processing.1" "platform_id" "contact_name"
[28] "contact_email" "contact_laboratory" "contact_department"
[31] "contact_institute" "contact_address" "contact_city"
[34] "contact_state" "contact_zip/postal_code" "contact_country"
[37] "supplementary_file" "data_row_count" "age:ch1"
[40] "condition:ch1" "cultivar/accession:ch1" "tissue:ch1"
看一下元數(shù)據(jù)的“title”列信息,告訴我們樣本信息涧黄、處理信息篮昧、以及重復(fù)信息,factor格式數(shù)據(jù):
> pData(gse97045)$title
[1] Tomato (S. lycopersicum) cv. P73, control condition, rep. 1
[2] Tomato (S. lycopersicum) cv. P73, control condition, rep. 2
[3] Tomato (S. lycopersicum) cv. P73, control condition, rep. 3
[4] Tomato (S. lycopersicum) cv. P73, water stress condition, rep. 1
[5] Tomato (S. lycopersicum) cv. P73, water stress condition, rep. 2
[6] Tomato (S. lycopersicum) cv. P73, water stress condition, rep. 3
[7] S. pennellii acc. PE47, control condition, rep. 1
[8] S. pennellii acc. PE47, control condition, rep. 2
[9] S. pennellii acc. PE47, control condition, rep. 3
[10] S. pennellii acc. PE47, water stress condition, rep. 1
[11] S. pennellii acc. PE47, water stress condition, rep. 2
[12] S. pennellii acc. PE47, water stress condition, rep. 3
12 Levels: S. pennellii acc. PE47, control condition, rep. 1 ...
#exprs(): Retrieve expression data from eSets.
> head(exprs(gse97045))
GSM2550523 GSM2550524 GSM2550525 GSM2550526 GSM2550527 GSM2550528
AFFX-BioB-3_at 9.154191 8.935244 8.765674 8.839061 8.275339 8.609382
AFFX-BioB-5_at 9.115401 8.814133 8.664128 8.730531 8.321579 8.564144
AFFX-BioB-M_at 8.980967 8.748333 8.642012 8.750143 8.249259 8.606312
AFFX-BioC-3_at 10.598291 10.439536 10.294270 10.344476 9.912000 10.224729
AFFX-BioC-5_at 10.409473 10.211119 10.013224 10.126247 9.690638 9.981757
AFFX-BioDn-3_at 12.484206 12.394027 12.261628 12.482427 12.047057 12.336597
GSM2550529 GSM2550530 GSM2550531 GSM2550532 GSM2550533 GSM2550534
AFFX-BioB-3_at 9.477754 9.479547 9.513700 9.179431 9.239031 9.238477
AFFX-BioB-5_at 9.402628 9.429561 9.535614 9.166966 9.273455 9.244174
AFFX-BioB-M_at 9.405048 9.328257 9.393749 9.061191 9.068758 9.223863
AFFX-BioC-3_at 10.878668 10.895548 10.923604 10.612944 10.669775 10.707216
AFFX-BioC-5_at 10.588091 10.621175 10.646094 10.341966 10.442737 10.434916
AFFX-BioDn-3_at 12.603844 12.596163 12.627703 12.487721 12.563084 12.577017
這里再插點(diǎn)有關(guān)exprs()函數(shù)的知識(shí):
- exprs()屬于Biobase package笋妥,專門用于處理eSet類數(shù)據(jù)懊昨,返回值為基因表達(dá)矩陣。
- eSet類是Bioconductor 為基因表達(dá)數(shù)據(jù)格式所制定的標(biāo)準(zhǔn)春宣,基本可稱為一個(gè)Meta類疚颊,也是一個(gè)虛類狈孔,衍生出了許多重要的類,包括ExpressionSet類材义,SnpSet類以及AffyBatch類等1均抽。
將該過程包裝為函數(shù):
downGSE <- function(studyID, destdir = ".") {
library(GEOquery)
eSet <- getGEO(studyID, destdir = destdir, getGPL = F)
exprSet <- exprs(eSet[[1]])
pdata <- pData(eSet[[1]])
write.csv(exprSet, paste0(studyID, "_exprSet.csv"))
write.csv(pdata, paste0(studyID, "_metadata.csv"))
return(eSet)
}
對(duì)GSE97045進(jìn)行作圖,看其是否已經(jīng)歸一化:
> library(affyPLM)
載入需要的程輯包:gcrma
載入需要的程輯包:preprocessCore
> boxplot(gse97045, col = cols)
可見其掂,下載的GSE97045_series_matrix.txt.gz是經(jīng)過歸一化處理的表達(dá)矩陣油挥,可以直接應(yīng)用于下游數(shù)據(jù)分析。
Reference:
- 高山款熬, 歐劍虹深寥,肖凱,《R語(yǔ)言與Bioconductor生物信息學(xué)應(yīng)用》