上一篇文章里面簡單學(xué)習(xí)了一下表達(dá)矩陣的提取店乐,順便探索了一下SummarizedExperiment
對象橄教。
今天學(xué)習(xí)下用TCGAbiolinks
做差異分析橄维。
加載R包和數(shù)據(jù)
rm(list = ls())
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
library(TCGAbiolinks)
首先是查詢枣接、下載、整理硕旗,但是這一步我們在之前已經(jīng)做好了窗骑,直接加載就可以了!
下載方法:請翻看之前的推文
# 查詢
query <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts"
)
# 下載
GDCdownload(query, files.per.chunk = 100) #每次下載100個文件
# 整理
GDCprepare(query,save = T,save.filename = "TCGA-COAD_mRNA.Rdata")
我們已經(jīng)下載好了卵渴,就直接加載即可:
load("TCGA-mRNA/TCGA-COAD_mRNA.Rdata")
通常我們會區(qū)分mRNA和lncRNA,所以我們這里只選擇mRNA即可鲤竹,方法在上一篇也說過了浪读,非常簡單!直接對SummarizedExperiment
對象取子集即可辛藻!
se_mrna <- data[rowData(data)$gene_type == "protein_coding",]
se_mrna
## class: RangedSummarizedExperiment
## dim: 19962 521
## metadata(1): data_release
## assays(6): unstranded stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
## ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
## TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
## TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
## paper_vital_status
數(shù)據(jù)預(yù)處理
在進(jìn)行差異分析前進(jìn)行一些預(yù)處理碘橘。
dim(se_mrna)
## [1] 19962 521
首先根據(jù)spearman相關(guān)系數(shù)去除異常值:
coad_coroutliers <- TCGAanalyze_Preprocessing(se_mrna,cor.cut = 0.7)
## Number of outliers: 0
dim(coad_coroutliers)
## [1] 19962 521
還會生成一張相關(guān)圖:Array Array Intensity correlation (AAIC):
接下來進(jìn)行標(biāo)準(zhǔn)化:
# normalization of genes
coadNorm <- TCGAanalyze_Normalization(
tabDF = coad_coroutliers,
geneInfo = geneInfoHT
)
## I Need about 127 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: exprs ...
會使用EDASeq
包中的方法:
EDASeq::newSeqExpressionSet
EDASeq::withinLaneNormalization
EDASeq::betweenLaneNormalization
EDASeq::counts
dim(coadNorm)
## [1] 19469 521
然后過濾掉低表達(dá)的基因:
# quantile filter of genes
coadFilt <- TCGAanalyze_Filtering(
tabDF = coadNorm,
method = "quantile",
qnt.cut = 0.25
)
看看一通操作下來后還剩多少基因?
dim(coadFilt)
## [1] 14600 521
從最開始的19962變成了現(xiàn)在的14600吱肌,過濾掉了5000+基因......
最后是根據(jù)腫瘤組織和正常組織進(jìn)行分組:
這里我們只選擇了實(shí)體瘤和部分正常組織痘拆。如果你想選擇更多,只要在typesample
參數(shù)中添加更多類型即可氮墨。
可選類型見下圖纺蛆,也是根據(jù)TCGA-barcode進(jìn)行判斷的:
# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(
barcode = colnames(coadFilt),
typesample = c("NT")
)
# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(
barcode = colnames(coadFilt),
typesample = c("TP")
)
所以分組這一步我們還是自己搞定吐葵!就是根據(jù)barcode的第14,15位數(shù),結(jié)合上面那張圖判斷桥氏,毫無難度温峭。
# 小于10就是tumor
samplesTumor <- as.numeric(substr(colnames(coadFilt),14,15))<10
差異分析
非常簡單,支持edgeR
和limma
兩種方法字支,當(dāng)然也可以無縫連接DESeq2
進(jìn)行差異分析凤藏!
# Diff.expr.analysis (DEA)
coadDEGs <- TCGAanalyze_DEA(
mat1 = coadFilt[,!samplesTumor], # normal矩陣
mat2 = coadFilt[,samplesTumor], # tumor矩陣
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01,
logFC.cut = 1,
pipeline = "edgeR", # limma
method = "glmLRT"
)
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## o 41 samples in Cond1type Normal
## o 480 samples in Cond2type Tumor
## o 14600 features as miRNA or genes
## This may take some minutes...
## ----------------------- END DEA -------------------------------
差異分析就做好了!結(jié)果非常完美堕伪,同時(shí)提供了gene_name和gene_type揖庄,也就是說我們一開始不取子集也是可以的~~,最后再取也行欠雌!
使用DESeq2進(jìn)行差異分析
連接DESeq2
那真是太簡單了蹄梢,無縫銜接!桨昙!
library(DESeq2)
直接把SummarizedExperiment
對象傳給DESeqDataSet()
函數(shù)即可检号。
不過需要分組信息,這個需要我們手動制作一下蛙酪。
制作分組信息
其實(shí)我們的對象中包含了sample_type
這一列信息齐苛,就在coldata
中,但是有點(diǎn)過于詳細(xì)了桂塞。
table(colData(se_mrna)$sample_type)
##
## Metastatic Primary Tumor Recurrent Tumor Solid Tissue Normal
## 1 478 1 41
我們給它修改一下~
new_type <- ifelse(colData(se_mrna)$sample_type == "Solid Tissue Normal",
"Normal",
"Tumor")
colData(se_mrna)$sample_type <- new_type
table(colData(se_mrna)$sample_type)
##
## Normal Tumor
## 41 480
這樣就把分組搞定了凹蜂!skr!
差異分析
然后就是愉快的進(jìn)行差異分析~
ddsSE <- DESeqDataSet(se_mrna, design = ~ sample_type)
## renaming the first element in assays to 'counts'
ddsSE
## class: DESeqDataSet
## dim: 19962 521
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
## ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
## TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
## TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
## paper_vital_status
先過濾,這里我們簡單點(diǎn)阁危,
keep <- rowSums(counts(ddsSE)) >= 50
ddsSE <- ddsSE[keep,]
ddsSE
## class: DESeqDataSet
## dim: 18820 521
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(18820): ENSG00000000003.15 ENSG00000000005.6 ...
## ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
## TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
## TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
## paper_vital_status
確定誰和誰比玛痊,我們設(shè)定Tumor-Normal。
ddsSE$sample_type <- factor(ddsSE$sample_type, levels = c("Tumor","Normal"))
接下來就可以進(jìn)行差異分析了:
dds <- DESeq(ddsSE)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1544 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds, contrast = c("sample_type","Tumor","Normal"))
res
## log2 fold change (MLE): sample_type Tumor vs Normal
## Wald test p-value: sample_type Tumor vs Normal
## DataFrame with 18820 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003.15 5198.688 0.456284 0.1445636 3.15628 1.59793e-03
## ENSG00000000005.6 42.064 0.414580 0.3014120 1.37546 1.68989e-01
## ENSG00000000419.13 1743.704 0.849473 0.1134251 7.48929 6.92491e-14
## ENSG00000000457.14 463.909 -0.261646 0.0690276 -3.79046 1.50371e-04
## ENSG00000000460.17 328.546 1.272811 0.0940574 13.53229 1.00837e-41
## ... ... ... ... ... ...
## ENSG00000288658.1 5.317261 -1.45986802 0.274148 -5.3251160 1.00889e-07
## ENSG00000288660.1 4.648268 1.47152585 0.450554 3.2660389 1.09063e-03
## ENSG00000288669.1 0.143343 -0.13840184 1.339247 -0.1033431 9.17691e-01
## ENSG00000288674.1 3.201667 0.00548347 0.174731 0.0313824 9.74965e-01
## ENSG00000288675.1 15.048025 2.01434132 0.174631 11.5348224 8.80688e-31
## padj
## <numeric>
## ENSG00000000003.15 2.75622e-03
## ENSG00000000005.6 2.13290e-01
## ENSG00000000419.13 3.14116e-13
## ENSG00000000457.14 2.95220e-04
## ENSG00000000460.17 2.77449e-40
## ... ...
## ENSG00000288658.1 2.74461e-07
## ENSG00000288660.1 1.92231e-03
## ENSG00000288669.1 9.29645e-01
## ENSG00000288674.1 9.78866e-01
## ENSG00000288675.1 1.20917e-29
結(jié)果很棒狂打,不過沒有g(shù)ene_symbol了擂煞,需要自己添加哦~