1)簡介
edgeR作用對象是count文件掖举,rows 代表基因,行代表文庫娜庇,count代表的是比對到每個基因的reads數(shù)目塔次。它主要關(guān)注的是差異表達分析方篮,而不是定量基因表達水平。
edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries. The counts represent the total number of reads aligning to each gene (or other genomic locus).edgeR is concerned with di?erential expression analysis rather than with the quanti?cation of expression levels. It is concerned with relative changes in expression levels between conditions,but not directly with estimating absolute expression levels.
edgeR作用的是真實的比對統(tǒng)計励负,因此不建議用預(yù)測的轉(zhuǎn)錄本
Note that edgeR is designed to work with actual read counts. We not recommend that predicted transcript abundances are input the edgeR in place of actual counts.
歸一化原因:
技術(shù)原因影響差異表達分析:
1)Sequencing depth:統(tǒng)計測序深度(即代表的是library size)藕溅;
2)RNA composition:個別異常高表達基因?qū)е缕渌虿蓸硬蛔?br> 3)GC content: sample-speci?c e?ects for GC-content can be detected
4)sample-speci?c e?ects for gene length have been detected
注意:edgeR必須是原始表達量,而不能是rpkm等矯正過的熄守。
Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR. For example, users should not enter RPKM or FPKM values to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR.Similarly, users should not add arti?cial values to the counts before inputing them to edgeR.
個人是不太推薦沒有重復(fù)的差異表達分析蜈垮,因為畢竟統(tǒng)計學(xué)上的p值是為了證明兩個樣本的差異是真實存在而不是抽樣誤差導(dǎo)致的,
因此每當(dāng)別人提問的時候, 我個人的建議就是定性看看倍數(shù)變化吧. 但是如果真的強行要算p值, 其實也不是不行, edgeR就是一種選擇.
2)裕照、首先安裝edgeR 包
#如果沒有安裝BiocMaRnager則先安裝BiocManager攒发,之后通過BiocManager安裝edgeR包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
安裝結(jié)束之后開始處理文件
3)矩陣構(gòu)建及差異分析
需要構(gòu)建2個矩陣:1、表達矩陣晋南;2惠猿、分組矩陣( 實驗設(shè)計);
3.1 表達矩陣
3.11 讀取文件
# 首先讀取counts文件之后查看count文件前6行
> rawdata <- read.csv(file = "C://Users/My/Desktop/diff_name",header = T,stringsAsFactors = F)
# 查看讀取的diff_name文件
> head(rawdata)
X ensembl_gene_id gene_id control_W55 test_K54 external_gene_name
1 1 ENSMUSG00000000001 ENSMUSG00000000001.4 7 11 Gnai3
2 2 ENSMUSG00000000003 ENSMUSG00000000003.15 0 0 Pbsn
3 3 ENSMUSG00000000028 ENSMUSG00000000028.15 1 0 Cdc45
4 4 ENSMUSG00000000031 ENSMUSG00000000031.16 0 2 H19
5 5 ENSMUSG00000000037 ENSMUSG00000000037.17 0 0 Scml2
6 6 ENSMUSG00000000049 ENSMUSG00000000049.11 54 33 Apoh
讀取完數(shù)據(jù)之后我們先預(yù)處理一下數(shù)據(jù)负间,比如我只想要ensembl_gene_id偶妖、control_w55、tese_k54政溃、external_gene_name這幾列趾访,并調(diào)整一下順序。
> swap_rawdata <- cbind(rawdata$ensembl_gene_id,rawdata$external_gene_name,rawdata$control_W55,rawdata$test_K54)
> head(swap_rawdata)
[,1] [,2] [,3] [,4]
[1,] "ENSMUSG00000000001" "Gnai3" "7" "11"
[2,] "ENSMUSG00000000003" "Pbsn" "0" "0"
[3,] "ENSMUSG00000000028" "Cdc45" "1" "0"
[4,] "ENSMUSG00000000031" "H19" "0" "2"
[5,] "ENSMUSG00000000037" "Scml2" "0" "0"
[6,] "ENSMUSG00000000049" "Apoh" "54" "33"
# 得到的這個swap_rawdata是一個matrix董虱,如果想要讓其變?yōu)閐ata frame
> swap_rawdata <- data.frame(swap_rawdata)
# 查看一下是否轉(zhuǎn)化成功
> head(swap_rawdata)
ensembl_gene_id external_gene_name control_W55 test_K54
1 ENSMUSG00000000001 Gnai3 7 11
2 ENSMUSG00000000003 Pbsn 0 0
3 ENSMUSG00000000028 Cdc45 1 0
4 ENSMUSG00000000031 H19 0 2
5 ENSMUSG00000000037 Scml2 0 0
6 ENSMUSG00000000049 Apoh 54 33
# 轉(zhuǎn)化完轉(zhuǎn)化先存一份csv文件在電腦里扼鞋,便于之后用電腦查看
> write.csv(x = swap_rawdata,file = "C://Users/My/Desktop/swap_rawdata.csv")
# 存完之后直接從電腦導(dǎo)入你剛存的文件,這樣做可以避免出現(xiàn)numeric數(shù)據(jù)框變成factor形式
> swap_rawdata <- read.table("swap_rawdata.csv",header = T,sep = ",")
# 查看
> head(swap_rawdata)
X ensembl_gene_id external_gene_name control_W55 test_K54
1 1 ENSMUSG00000000001 Gnai3 7 11
2 2 ENSMUSG00000000003 Pbsn 0 0
3 3 ENSMUSG00000000028 Cdc45 1 0
4 4 ENSMUSG00000000031 H19 0 2
5 5 ENSMUSG00000000037 Scml2 0 0
6 6 ENSMUSG00000000049 Apoh 54 33
> data.class(swap_rawdata[1,1])
3.2 接著構(gòu)建DGEList對象
這里因為已經(jīng)有rawdata的count文件愤诱,因此直接用DGEList()函數(shù)就行了云头,否則要用readDGE()函數(shù)
# 首先載入edgeR 包
> library(edgeR)
# 構(gòu)建DGEList
> group <- 1:2
> y <- DGEList(counts = swap_rawdata[,4:5],genes = swap_rawdata[,2:3],group = group)
# 查看構(gòu)建完y的信息
> y
查看構(gòu)建DGElist的運行結(jié)果:
DGEList對象主要有三部分:
1、counts矩陣:包含的是整數(shù)counts;
2淫半、samples數(shù)據(jù)框:包含的是文庫(sample)信息溃槐。包含 lib.size列 :for the library size (sequencing depth) for each sample,如果不自定義, the library sizes will be computed from the column sums of the counts科吭。其中還有一個group列昏滴,用于指定每個sample組信息
3、一個可選的數(shù)據(jù)框genes:gene的注釋信息
第二步: 過濾 low counts數(shù)據(jù)对人。與DESeq2的預(yù)過濾不同影涉,DESeq2的預(yù)過濾只是為了改善后續(xù)運算性能,在運行過程中依舊會自動處理low count數(shù)據(jù)规伐,edgeR需要在分析前就要排除那些low count數(shù)據(jù)蟹倾,而且非常嚴格。從生物學(xué)角度,有生物學(xué)意義的基因的表達量必須高于某一個閾值鲜棠。從統(tǒng)計學(xué)角度上肌厨, low count的數(shù)據(jù)不太可能有顯著性差異,而且在多重試驗矯正階段還會拖后腿豁陆。
數(shù)據(jù)過濾
由于原來的表達量矩陣基因數(shù)太大, 可能存在某些基因根本沒有表達, 因此需要預(yù)先過濾
> keep <- rowSums(cpm(y)>1) >= 1
> y <- y[keep, , keep.lib.sizes=FALSE]
> y
這部分代碼的意思指的是保留在至少在一個樣本里有表達的基因(CPM > 1)柑爸。 基因數(shù)就從原來的55318變?yōu)?5868
標準化
考慮到測序深度不同, 我們需要對其進行標準化, 避免文庫大小不同導(dǎo)致的分析誤差.
edgeR里默認采用TMM(trimmed mean of M-values) 對配對樣本進行標準化,用到的函數(shù)是calcNormFactors
> y <- calcNormFactors(y)
執(zhí)行結(jié)果:
差異表達分析
不同差異表達分析工具的目標就是預(yù)測出dispersion(離散值), 有了離散值就能夠計算p值. 那么dispersion怎么計算呢? edgeR給了幾個方法
根據(jù)經(jīng)驗給定一個值(BCV, square-root-dispersion). edgeR給的建議是, 如果你是人類數(shù)據(jù), 且實驗做的很好(無過多的其他因素影響), 設(shè)置為0.4, 如果是遺傳上相似的模式物種(這里為小鼠), 設(shè)置為0.1 (查詢edgeR的bioconductor包所得)
Simply pick a reasonable dispersion value, based on your experience with similar data,and use that for exactTest or glmFit. Typical values for the common BCV (square-root dispersion) for datasets arising from well-controlled experiments are 0.4 for human data,0.1 for data on genetically identical model organisms or 0.01 for technical replicates.
> y_bcv <- y
# 因為本次的數(shù)據(jù)使用的是小鼠的數(shù)據(jù)盒音,所以使用0.1
> bcv <- 0.1
> et <- exactTest(y_bcv, dispersion = bcv ^ 2)
> gene1 <- decideTestsDGE(et, p.value = 0.05, lfc = 0)
> head(gene1)
> summary(gene1)
執(zhí)行結(jié)果:
由統(tǒng)計結(jié)果可知表鳍,下調(diào)的基因為816個,上調(diào)的基因為572個
如果覺得覺得基因較多的話祥诽,可以上調(diào)bcv的值
> y_bcv <- y
> bcv <- 0.2
> et2 <- exactTest(y_bcv, dispersion = bcv ^ 2)
> gene2 <- decideTestsDGE(et2, p.value = 0.05, lfc = 0)
> summary(gene2)
執(zhí)行結(jié)果:
由統(tǒng)計結(jié)果可知譬圣,下調(diào)的基因為377個,上調(diào)的基因為221個
與之前的結(jié)果相比雄坪,的確已經(jīng)減少很多
將結(jié)果整理成excel表
# 改一下gene1的名稱
> colnames(gene1) <- "Signifi"
# 組合將所需要的數(shù)據(jù)組成一個新的data.frame
> results <- cbind(y$genes,y$counts,et$table,gene1)
> head(results)
# 將新生成的results數(shù)據(jù)框?qū)懗梢粋€excel數(shù)據(jù)表
> write.csv(x = results,file = "C://Users/My/Desktop/DEresult.csv",row.names = F)
執(zhí)行結(jié)果:
生成的excel表可以將down expressed 和 up expressed基因分開
參考文獻:https://blog.csdn.net/u012110870/article/details/102804557
https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf