TCGA新版數(shù)據(jù)差異分析

上一篇文章里面簡單學(xué)習(xí)了一下表達(dá)矩陣的提取店乐,順便探索了一下SummarizedExperiment對象橄教。

今天學(xué)習(xí)下用TCGAbiolinks做差異分析橄维。

加載R包和數(shù)據(jù)

rm(list = ls())

library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(TCGAbiolinks)

首先是查詢枣接、下載、整理硕旗,但是這一步我們在之前已經(jīng)做好了窗骑,直接加載就可以了!

下載方法:請翻看之前的推文

# 查詢
query <- GDCquery(project = "TCGA-COAD",
                    data.category = "Transcriptome Profiling",
                    data.type = "Gene Expression Quantification",
                    workflow.type = "STAR - Counts"
                    )
# 下載
GDCdownload(query, files.per.chunk = 100) #每次下載100個文件
  
# 整理
GDCprepare(query,save = T,save.filename = "TCGA-COAD_mRNA.Rdata")

我們已經(jīng)下載好了卵渴,就直接加載即可:

load("TCGA-mRNA/TCGA-COAD_mRNA.Rdata")

通常我們會區(qū)分mRNA和lncRNA,所以我們這里只選擇mRNA即可鲤竹,方法在上一篇也說過了浪读,非常簡單!直接對SummarizedExperiment對象取子集即可辛藻!

se_mrna <- data[rowData(data)$gene_type == "protein_coding",]

se_mrna
## class: RangedSummarizedExperiment 
## dim: 19962 521 
## metadata(1): data_release
## assays(6): unstranded stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

數(shù)據(jù)預(yù)處理

在進(jìn)行差異分析前進(jìn)行一些預(yù)處理碘橘。

dim(se_mrna)
## [1] 19962   521

首先根據(jù)spearman相關(guān)系數(shù)去除異常值:

coad_coroutliers <- TCGAanalyze_Preprocessing(se_mrna,cor.cut = 0.7)
## Number of outliers: 0

dim(coad_coroutliers)
## [1] 19962   521

還會生成一張相關(guān)圖:Array Array Intensity correlation (AAIC):


image.png

接下來進(jìn)行標(biāo)準(zhǔn)化:

# normalization of genes
coadNorm <- TCGAanalyze_Normalization(
    tabDF = coad_coroutliers, 
    geneInfo =  geneInfoHT
)
## I Need about  127 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: exprs ...

會使用EDASeq包中的方法:

EDASeq::newSeqExpressionSet
EDASeq::withinLaneNormalization
EDASeq::betweenLaneNormalization
EDASeq::counts
dim(coadNorm)
## [1] 19469   521

然后過濾掉低表達(dá)的基因:

# quantile filter of genes
coadFilt <- TCGAanalyze_Filtering(
    tabDF = coadNorm,
    method = "quantile", 
    qnt.cut =  0.25
)

看看一通操作下來后還剩多少基因?

dim(coadFilt)
## [1] 14600   521

從最開始的19962變成了現(xiàn)在的14600吱肌,過濾掉了5000+基因......

最后是根據(jù)腫瘤組織和正常組織進(jìn)行分組:
這里我們只選擇了實(shí)體瘤和部分正常組織痘拆。如果你想選擇更多,只要在typesample參數(shù)中添加更多類型即可氮墨。

可選類型見下圖纺蛆,也是根據(jù)TCGA-barcode進(jìn)行判斷的:


image.png
# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt),
    typesample = c("NT")
)

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt), 
    typesample = c("TP")
)

所以分組這一步我們還是自己搞定吐葵!就是根據(jù)barcode的第14,15位數(shù),結(jié)合上面那張圖判斷桥氏,毫無難度温峭。


image.png
# 小于10就是tumor
samplesTumor <- as.numeric(substr(colnames(coadFilt),14,15))<10

差異分析

非常簡單,支持edgeRlimma兩種方法字支,當(dāng)然也可以無縫連接DESeq2進(jìn)行差異分析凤藏!

# Diff.expr.analysis (DEA)
coadDEGs <- TCGAanalyze_DEA(
    mat1 = coadFilt[,!samplesTumor], # normal矩陣
    mat2 = coadFilt[,samplesTumor], # tumor矩陣
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01, 
    logFC.cut = 1,
    pipeline = "edgeR", # limma
    method = "glmLRT"
)
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## o 41 samples in Cond1type Normal
## o 480 samples in Cond2type Tumor
## o 14600 features as miRNA or genes
## This may take some minutes...
## ----------------------- END DEA -------------------------------

差異分析就做好了!結(jié)果非常完美堕伪,同時(shí)提供了gene_name和gene_type揖庄,也就是說我們一開始不取子集也是可以的~~,最后再取也行欠雌!

image.png

使用DESeq2進(jìn)行差異分析

連接DESeq2那真是太簡單了蹄梢,無縫銜接!桨昙!

library(DESeq2)

直接把SummarizedExperiment對象傳給DESeqDataSet()函數(shù)即可检号。

不過需要分組信息,這個需要我們手動制作一下蛙酪。

制作分組信息

其實(shí)我們的對象中包含了sample_type這一列信息齐苛,就在coldata中,但是有點(diǎn)過于詳細(xì)了桂塞。

table(colData(se_mrna)$sample_type)
## 
##          Metastatic       Primary Tumor     Recurrent Tumor Solid Tissue Normal 
##                   1                 478                   1                  41

我們給它修改一下~

new_type <- ifelse(colData(se_mrna)$sample_type == "Solid Tissue Normal",
                   "Normal",
                   "Tumor")
colData(se_mrna)$sample_type <- new_type

table(colData(se_mrna)$sample_type)
## 
## Normal  Tumor 
##     41    480

這樣就把分組搞定了凹蜂!skr!

差異分析

然后就是愉快的進(jìn)行差異分析~

ddsSE <- DESeqDataSet(se_mrna, design = ~ sample_type)
## renaming the first element in assays to 'counts'
ddsSE
## class: DESeqDataSet 
## dim: 19962 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

先過濾,這里我們簡單點(diǎn)阁危,

keep <- rowSums(counts(ddsSE)) >= 50
ddsSE <- ddsSE[keep,]

ddsSE
## class: DESeqDataSet 
## dim: 18820 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(18820): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

確定誰和誰比玛痊,我們設(shè)定Tumor-Normal

ddsSE$sample_type <- factor(ddsSE$sample_type, levels = c("Tumor","Normal"))

接下來就可以進(jìn)行差異分析了:

dds <- DESeq(ddsSE)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1544 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds, contrast = c("sample_type","Tumor","Normal"))
res
## log2 fold change (MLE): sample_type Tumor vs Normal 
## Wald test p-value: sample_type Tumor vs Normal 
## DataFrame with 18820 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE       stat      pvalue
##                    <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSG00000000003.15  5198.688       0.456284 0.1445636    3.15628 1.59793e-03
## ENSG00000000005.6     42.064       0.414580 0.3014120    1.37546 1.68989e-01
## ENSG00000000419.13  1743.704       0.849473 0.1134251    7.48929 6.92491e-14
## ENSG00000000457.14   463.909      -0.261646 0.0690276   -3.79046 1.50371e-04
## ENSG00000000460.17   328.546       1.272811 0.0940574   13.53229 1.00837e-41
## ...                      ...            ...       ...        ...         ...
## ENSG00000288658.1   5.317261    -1.45986802  0.274148 -5.3251160 1.00889e-07
## ENSG00000288660.1   4.648268     1.47152585  0.450554  3.2660389 1.09063e-03
## ENSG00000288669.1   0.143343    -0.13840184  1.339247 -0.1033431 9.17691e-01
## ENSG00000288674.1   3.201667     0.00548347  0.174731  0.0313824 9.74965e-01
## ENSG00000288675.1  15.048025     2.01434132  0.174631 11.5348224 8.80688e-31
##                           padj
##                      <numeric>
## ENSG00000000003.15 2.75622e-03
## ENSG00000000005.6  2.13290e-01
## ENSG00000000419.13 3.14116e-13
## ENSG00000000457.14 2.95220e-04
## ENSG00000000460.17 2.77449e-40
## ...                        ...
## ENSG00000288658.1  2.74461e-07
## ENSG00000288660.1  1.92231e-03
## ENSG00000288669.1  9.29645e-01
## ENSG00000288674.1  9.78866e-01
## ENSG00000288675.1  1.20917e-29

結(jié)果很棒狂打,不過沒有g(shù)ene_symbol了擂煞,需要自己添加哦~

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市趴乡,隨后出現(xiàn)的幾起案子对省,更是在濱河造成了極大的恐慌,老刑警劉巖晾捏,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件蒿涎,死亡現(xiàn)場離奇詭異,居然都是意外死亡惦辛,警方通過查閱死者的電腦和手機(jī)劳秋,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人玻淑,你說我怎么就攤上這事嗽冒。” “怎么了岁忘?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵辛慰,是天一觀的道長探熔。 經(jīng)常有香客問我媚值,道長蚯涮,這世上最難降的妖魔是什么堕战? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任筑凫,我火速辦了婚禮庐冯,結(jié)果婚禮上盈滴,老公的妹妹穿的比我還像新娘欲账。我一直安慰自己五鲫,他們只是感情好溺职,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著位喂,像睡著了一般浪耘。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上塑崖,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天七冲,我揣著相機(jī)與錄音,去河邊找鬼规婆。 笑死澜躺,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的抒蚜。 我是一名探鬼主播掘鄙,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼嗡髓!你這毒婦竟也來了操漠?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤饿这,失蹤者是張志新(化名)和其女友劉穎浊伙,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體蛹稍,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡吧黄,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年部服,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了唆姐。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡廓八,死狀恐怖奉芦,靈堂內(nèi)的尸體忽然破棺而出赵抢,到底是詐尸還是另有隱情,我是刑警寧澤声功,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布烦却,位于F島的核電站,受9級特大地震影響先巴,放射性物質(zhì)發(fā)生泄漏其爵。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一伸蚯、第九天 我趴在偏房一處隱蔽的房頂上張望摩渺。 院中可真熱鬧,春花似錦剂邮、人聲如沸摇幻。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽绰姻。三九已至,卻和暖如春引瀑,著一層夾襖步出監(jiān)牢的瞬間狂芋,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工伤疙, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留银酗,地道東北人。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓徒像,卻偏偏與公主長得像黍特,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子锯蛀,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評論 2 345

推薦閱讀更多精彩內(nèi)容