============================================
寫在前面:可以參考另外一篇《得到差異基因后怎么做兔院?》
============================================
序
- 走到這一步查坪,似乎順暢了許多,最主要時間不用花費那么多了颗品。另外缴阎,以前曾經(jīng)處理過不計其數(shù)的芯片质涛,挑選差異表達基因,篩選關(guān)鍵基因衫冻,功能富集,還有基于全部數(shù)據(jù)的WGCNA(當然你也可以用差異基因來做谒出,雖然不推薦隅俘,看不少文章也這么發(fā)),GSEA笤喳,PPI等等为居,這些后續(xù)我會慢慢發(fā)出來。
- 但是杀狡,因為以前處理的芯片表達譜數(shù)據(jù)是符合正態(tài)分布蒙畴,所以可以用t檢驗來篩選差異表達基因,但RNA-seq的read count普遍認為符合泊松分布呜象。所以篩選DEGs的方法還是不一樣
------------要求---------------
- 載入表達矩陣
- 設(shè)置好分組信息
- 用DEseq2進行差異分析
- 輸出差異分析結(jié)果
來源于生信技能樹
-------------------------參考文章-----------------------------------
- 2018.7月:Analyzing RNA-seq data with DESeq2
- Count-Based Differential Expression Analysis of RNA-seq Data
- 可以下載的pdf havard,點此下載
-------------------開始------------------------
-
正式載入數(shù)據(jù)之前膳凝,先看一下數(shù)據(jù)結(jié)構(gòu)data structure,因為這是我們下面一切分析的起點恭陡。
-
我們需要準備2個table:一個是countData蹬音,一個是colData
關(guān)于上面兩個表的說明
-
countData表示的是count矩陣,行代表gene子姜,列代表樣品祟绊,中間的數(shù)字代表對應(yīng)count數(shù)楼入。colData表示sample的元數(shù)據(jù),因為這個表提供了sample的元數(shù)據(jù)牧抽。
because this table supplies metadata/information about the columns of the countData matrix. Notice that the first column of colData must match the column names of countData (except the first, which is the gene ID column). - colData的行名與countData的列名一致(除去代表gene ID的第一列)
1 載入數(shù)據(jù)(countData和colData)
> library(tidyverse)
> library(DESeq2)
> #import data
> setwd("F:/rna_seq/data/matrix")
> mycounts<-read.csv("readcount.csv")
> head(mycounts)
X control1 control2 treat1 treat2
1 ENSMUSG00000000001 1648 2306 2941 2780
2 ENSMUSG00000000003 0 0 0 0
3 ENSMUSG00000000028 835 950 1366 1051
4 ENSMUSG00000000031 65 83 52 53
5 ENSMUSG00000000037 70 53 94 66
6 ENSMUSG00000000049 0 3 4 5
#這里有個x嘉熊,需要去除,先把第一列當作行名來處理
> rownames(mycounts)<-mycounts[,1]
#把帶X的列刪除
> mycounts<-mycounts[,-1]
> head(mycounts)
control1 control2 treat1 treat2
ENSMUSG00000000001 1648 2306 2941 2780
ENSMUSG00000000003 0 0 0 0
ENSMUSG00000000028 835 950 1366 1051
ENSMUSG00000000031 65 83 52 53
ENSMUSG00000000037 70 53 94 66
ENSMUSG00000000049 0 3 4 5
# 這一步很關(guān)鍵扬舒,要明白condition這里是因子阐肤,不是樣本名稱;小鼠數(shù)據(jù)有對照組和處理組讲坎,各兩個重復(fù)
> condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
> condition
[1] control control treat treat
Levels: control treat
#colData也可以自己在excel做好另存為.csv格式孕惜,再導(dǎo)入即可
> colData <- data.frame(row.names=colnames(mycounts), condition)
> colData
condition
control1 control
control2 control
treat1 treat
treat2 treat
2構(gòu)建dds對象,開始DESeq流程
注釋:dds=DESeqDataSet Object
> dds <- DESeqDataSetFromMatrix(mycounts, colData, design= ~ condition)
> dds <- DESeq(dds)
> # 查看一下dds的內(nèi)容
> dds
顯示為
class: DESeqDataSet
dim: 6 4
metadata(1): version
assays(3): counts mu cooks
rownames(6): ENSMUSG00000000001 ENSMUSG00000000003 ... ENSMUSG00000000037 ENSMUSG00000000049
rowData names(21): baseMean baseVar ... deviance maxCooks
colnames(4): control1 control2 treat1 treat2
colData names(2): condition sizeFactor
3 總體結(jié)果查看
接下來,我們要查看treat versus control的總體結(jié)果晨炕,并根據(jù)p-value進行重新排序衫画。利用summary
命令統(tǒng)計顯示一共多少個genes上調(diào)和下調(diào)(FDR0.1)
> res = results(dds, contrast=c("condition", "control", "treat"))
#或下面命令
> res= results(dds)
> res = res[order(res$pvalue),]
> head(res)
> summary(res)
#所有結(jié)果先進行輸出
> write.csv(res,file="All_results.csv")
> table(res$padj<0.05)
上述代碼的結(jié)果顯示
> res = results(dds2, contrast=c("condition", "control", "treat"))
> res = res[order(res$pvalue),]
> head(res)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000003309 548.1926 3.231611 0.2658125 12.157485 5.234568e-34 8.193146e-30
ENSMUSG00000046323 404.1894 3.067050 0.2628220 11.669687 1.820923e-31 1.425055e-27
ENSMUSG00000001123 341.8542 2.797485 0.2766499 10.112004 4.887441e-24 2.549941e-20
ENSMUSG00000023906 951.9460 2.382307 0.2510718 9.488551 2.342684e-21 9.116395e-18
ENSMUSG00000018569 485.4839 3.136031 0.3312999 9.465836 2.912214e-21 9.116395e-18
ENSMUSG00000000184 601.0842 -2.827750 0.3154171 -8.965112 3.099648e-19 8.085948e-16
> summary(res)
out of 28335 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 445, 1.6%
LFC < 0 (down) : 625, 2.2%
outliers [1] : 0, 0%
low counts [2] : 12683, 45%
(mean count < 18)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> write.csv(res,file="All_results.csv")
> table(res$padj<0.05)
FALSE TRUE
14909 743
可見,一共445個genes上調(diào)瓮栗,625個genes下調(diào)削罩,沒有離群值。padj小于0.05的共有743個费奸。
4 提取差異表達genes(DEGs)并進行g(shù)ene symbol注釋
差異表達基因的界定很不統(tǒng)一弥激,但log2FC是用的最廣泛同時也是最不精確的方式,但因為其好理解所以廣泛被應(yīng)用尤其芯片數(shù)據(jù)處理中愿阐,記的是havard universit做過一個統(tǒng)計微服,F(xiàn)C=2相對比較可靠。但無論怎么缨历,這種界定人為因素太大以蕴,過于武斷。所以GSEA戈二,WGCNA是拿全部表達數(shù)據(jù)(可以進行初步過濾)來進行分析舒裤,并且WGCNA采取軟閾值的方式來挑選genes更加合理。既然挑選差異表達基因觉吭,還是采取log2FC和padj來進行腾供。
獲取padj(p值經(jīng)過多重校驗校正后的值)小于0.05,表達倍數(shù)取以2為對數(shù)后大于1或者小于-1的差異表達基因鲜滩。代碼如下
> diff_gene_deseq2 <-subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
#或
#> diff_gene_deseq2 <-subset(res,padj < 0.05 & (log2FoldChange > 1 | log2FoldChange < -1))
> dim(diff_gene_deseq2)
> head(diff_gene_deseq2)
> write.csv(diff_gene_deseq2,file= "DEG_treat_vs_control.csv")
結(jié)果顯示如下:
> diff_gene_deseq2 <-subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
> dim(diff_gene_deseq2)
[1] 431 6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000003309 548.1926 3.231611 0.2658125 12.157485 5.234568e-34 8.193146e-30
ENSMUSG00000046323 404.1894 3.067050 0.2628220 11.669687 1.820923e-31 1.425055e-27
ENSMUSG00000001123 341.8542 2.797485 0.2766499 10.112004 4.887441e-24 2.549941e-20
ENSMUSG00000023906 951.9460 2.382307 0.2510718 9.488551 2.342684e-21 9.116395e-18
ENSMUSG00000018569 485.4839 3.136031 0.3312999 9.465836 2.912214e-21 9.116395e-18
ENSMUSG00000000184 601.0842 -2.827750 0.3154171 -8.965112 3.099648e-19 8.085948e-16
5 用bioMart對差異表達基因進行注釋
和RNA-seq(6): reads計數(shù)伴鳖,合并矩陣并進行注釋代碼一樣
library('biomaRt')
library("curl")
mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
my_ensembl_gene_id<-row.names(diff_gene_deseq2)
#listAttributes(mart)
mms_symbols<- getBM(attributes=c('ensembl_gene_id','external_gene_name',"description"),
filters = 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
結(jié)果如下:
> library('biomaRt')
> library("curl")
> mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id<-row.names(diff_gene_deseq2)
> mms_symbols<- getBM(attributes=c('ensembl_gene_id','external_gene_name',"description"),
+ filters = 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
> head(mms_symbols)
ensembl_gene_id external_gene_name description
1 ENSMUSG00000000120 Ngfr nerve growth factor receptor (TNFR superfamily, member 16) [Source:MGI Symbol;Acc:MGI:97323]
2 ENSMUSG00000000184 Ccnd2 cyclin D2 [Source:MGI Symbol;Acc:MGI:88314]
3 ENSMUSG00000000276 Dgke diacylglycerol kinase, epsilon [Source:MGI Symbol;Acc:MGI:1889276]
4 ENSMUSG00000000308 Ckmt1 creatine kinase, mitochondrial 1, ubiquitous [Source:MGI Symbol;Acc:MGI:99441]
5 ENSMUSG00000000320 Alox12 arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998]
6 ENSMUSG00000000708 Kat2b K(lysine) acetyltransferase 2B [Source:MGI Symbol;Acc:MGI:1343094]
5合并數(shù)據(jù):res結(jié)果+mms_symbols合并成一個文件
合并的話兩個數(shù)據(jù)必須有共同的列名,我們先看一下
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000003309 548.1926 3.231611 0.2658125 12.157485 5.234568e-34 8.193146e-30
ENSMUSG00000046323 404.1894 3.067050 0.2628220 11.669687 1.820923e-31 1.425055e-27
ENSMUSG00000001123 341.8542 2.797485 0.2766499 10.112004 4.887441e-24 2.549941e-20
ENSMUSG00000023906 951.9460 2.382307 0.2510718 9.488551 2.342684e-21 9.116395e-18
ENSMUSG00000018569 485.4839 3.136031 0.3312999 9.465836 2.912214e-21 9.116395e-18
ENSMUSG00000000184 601.0842 -2.827750 0.3154171 -8.965112 3.099648e-19 8.085948e-16
> head(mms_symbols)
ensembl_gene_id external_gene_name
1 ENSMUSG00000000120 Ngfr
2 ENSMUSG00000000184 Ccnd2
3 ENSMUSG00000000276 Dgke
4 ENSMUSG00000000308 Ckmt1
5 ENSMUSG00000000320 Alox12
6 ENSMUSG00000000708 Kat2b
可見徙硅,兩個文件沒有共同的列名榜聂,所以要先給'diff_gene_deseq2'添加一個‘ensembl_gene_id’的列名,方法如下:(應(yīng)該有更簡便的方法)
> ensembl_gene_id<-rownames(diff_gene_deseq2)
> diff_gene_deseq2<-cbind(ensembl_gene_id,diff_gene_deseq2)
> colnames(diff_gene_deseq2)[1]<-c("ensembl_gene_id")
> diff_name<-merge(diff_gene_deseq2,mms_symbols,by="ensembl_gene_id")
>head(diff_name)
#查看Akap8的情況
Akap8 <- diff_name[diff_name$external_gene_name=="Akap8",]
中間顯示過程如下:
> ensembl_gene_id<-rownames(diff_gene_deseq2)
> diff_gene_deseq2<-cbind(ensembl_gene_id,diff_gene_deseq2)
> colnames(diff_gene_deseq2)[1]<-c("ensembl_gene_id")
> diff_name<-merge(diff_gene_deseq2,mms_symbols,by="ensembl_gene_id")
> head(diff_name)
DataFrame with 6 rows and 9 columns
ensembl_gene_id baseMean log2FoldChange lfcSE stat pvalue padj external_gene_name
<character> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character>
1 ENSMUSG00000000120 434.04177 -1.293648 0.2713146 -4.768072 1.859970e-06 2.064698e-04 Ngfr
2 ENSMUSG00000000184 601.08417 -2.827750 0.3154171 -8.965112 3.099648e-19 8.085948e-16 Ccnd2
3 ENSMUSG00000000276 668.12500 -1.071362 0.2445381 -4.381168 1.180446e-05 9.603578e-04 Dgke
4 ENSMUSG00000000308 207.46719 1.944949 0.3427531 5.674489 1.391035e-08 3.819733e-06 Ckmt1
5 ENSMUSG00000000320 61.96266 1.451927 0.4637101 3.131109 1.741473e-03 4.105051e-02 Alox12
6 ENSMUSG00000000708 1070.03203 -1.046546 0.2049062 -5.107440 3.265530e-07 5.056107e-05 Kat2b
description
<character>
1 nerve growth factor receptor (TNFR superfamily, member 16) [Source:MGI Symbol;Acc:MGI:97323]
2 cyclin D2 [Source:MGI Symbol;Acc:MGI:88314]
3 diacylglycerol kinase, epsilon [Source:MGI Symbol;Acc:MGI:1889276]
4 creatine kinase, mitochondrial 1, ubiquitous [Source:MGI Symbol;Acc:MGI:99441]
5 arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998]
6 K(lysine) acetyltransferase 2B [Source:MGI Symbol;Acc:MGI:1343094]
> Akap8 <- diff_name[diff_name$external_gene_name=="Akap8",]
> Akap8
DataFrame with 1 row and 9 columns
ensembl_gene_id baseMean log2FoldChange lfcSE stat pvalue padj external_gene_name
<character> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character>
1 ENSMUSG00000024045 2281.053 -1.276329 0.2428315 -5.256028 1.471996e-07 2.775865e-05 Akap8
description
<character>
1 A kinase (PRKA) anchor protein 8 [Source:MGI Symbol;Acc:MGI:1928488]
至此嗓蘑,差異表達基因提取并注釋完畢须肆,下一步
-
先進行數(shù)據(jù)可視化(Data visulization)
-
然后進行富集分分析及可視化