寫在前面
要幫朋友做RNA-seq的分析,cases vs. controls總共4個樣本(2 vs. 2)瘩欺,看到文獻(xiàn)(實(shí)驗(yàn)設(shè)計(jì)比較類似)里用的是EBSeq較為頻繁哗讥,所以用這個來做澎蛛。
但從文獻(xiàn)來看娇唯,EBSeq對樣本量依賴較大。
EBSeq
- TPR relatively independent of sample size and presence of outliers.
- Poor FDR control in most situations, relatively unaffected by outliers.
- Medium computational time requirement, increases slightly with sample size.
EBSeq的輸入數(shù)據(jù)是原始的read count戒良,可以通過featureCounts体捏、HTSeq-count等軟件包獲得。
命令參考鏈接 http://www.bioconductor.org/packages/release/bioc/html/EBSeq.html
安裝
數(shù)據(jù)不算大,PC應(yīng)該夠用了几缭,所以在window下安裝河泳。EBSeq依賴的安裝包blockmodeling
一直無法成功安裝,隔了兩天再看年栓,才想到要去仔細(xì)看報(bào)錯文件拆挥,發(fā)現(xiàn)其中有特殊字符無法識別,索性直接改了某抓,就能直接加載啦纸兔。
source("http://bioconductor.org/biocLite.R")
biocLite("EBSeq")
require(EBSeq)
載入需要的程輯包:blockmodeling
Error: package or namespace load failed for ‘blockmodeling’:
attachNamespace()里算'blockmodeling'時.onAttach失敗了,詳細(xì)內(nèi)容:
調(diào)用: parse(con)
錯誤: 16:28: 意外的INCOMPLETE_STRING
15: journal = "Social Networks",
16: author = as.person("Ale
^
錯誤: 無法載入程輯包‘blockmodeling’
In addition: Warning message:
In parse(con) :
invalid input found on input connection 'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'
library(blockmodeling)
#出現(xiàn)同樣的報(bào)錯
修改文件'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'
的內(nèi)容搪缨,其中對應(yīng)報(bào)錯的行有Ale? ?iberna
——帶帽子的字符食拜,刪掉或者改成別的常見字符就OK了鸵熟。當(dāng)然如果擔(dān)心以后出類似問題副编,改起來瑣碎,可以直接設(shè)置R的識別語言流强。
Note:搜索時使用關(guān)鍵詞INCOMPLETE_STRING
痹届,而非最后的invalid input found on input connection
。搜索方向決定了解決問題的效率打月。
library(EBSeq)
載入需要的程輯包:gplots
載入程輯包:‘gplots’
The following object is masked from ‘package:stats’:
lowess
載入需要的程輯包:testthat
實(shí)戰(zhàn)
Example
#example
data(GeneMat)
head(GeneMat)
#[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#Gene_1 1879 2734 2369 2636 2188 9743 9932 10099 9829 9831
#Gene_2 24 40 22 27 31 118 108 144 117 113
Sizes=MedianNorm(GeneMat) #EBSeq requires the library size factor ls for each sample s. Here, ls may be obtained via the function MedianNorm, which reproduces the median normalization approach in DESeq
Conditions=as.factor(rep(c("C1","C2"),each=5)) #設(shè)置樣本類型
Conditions
#[1] C1 C1 C1 C1 C1 C2 C2 C2 C2 C2
#Levels: C1 C2
EBOut = EBTest(GeneMat,Conditions=as.factor(rep(c("C1","C2"),each=5)),sizeFactors=Sizes, maxround=5)
EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)
str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq
實(shí)例
paste KD_In1.count KD_In2.count EGFP_In1.count EGFP_In2.count |awk '{print $1"\t"$2"\t"$4"\t"$6"\t"$8}'|sort -k 1 >combined.count
head combined.count
tail combined.count
sed -i '1,2d'combined.count #刪除1-2行
sed -i '53800,$d' combined.count #刪除53800-last 行
sed -i '1igene\tKD1\tKD2\tEGFP1\tEGFP2' combined.count #增加首行
data=read.table("combined.count",header=T,row.names=1)
head(data)
# EGFP1 EGFP2 KD1 KD2
#ENSMUSG00000000001 281 243 295 233
#ENSMUSG00000000003 0 0 0 0
#ENSMUSG00000000028 20 25 25 12
data=as.matrix(data)
#pre-filtering 這里僅保留了readcounts總和>=10的基因
keep <- rowSums(data[,1:4])>=10
data <- data[keep,]
#跑出結(jié)果
require(EBSeq)
Sizes=MedianNorm(data)
EBOut = EBTest(data,Conditions=as.factor(rep(c("EGFP","KD"),each=2)),sizeFactors=Sizes, maxround=15) #maxround迭代次數(shù)队腐,一般要求最后結(jié)果趨于收斂比較可信
EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)
str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
# PPEE PPDE
#ENSMUSG00000000001 1.0000000 3.203313e-17
#ENSMUSG00000000003 NA NA
str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq
#輸出DEG(FDR<0.05)
write.table(EBDERes$DEfound,file="DEG_FDR005.txt",quote=F,sep="\t",row.names=F,col.names=F)
#計(jì)算 Fold Change (FC) of the raw data as well as the posterior FC of the normalized data.
GeneFC=PostFC(EBOut)
str(GeneFC)
PlotPostVsRawFC(EBOut,GeneFC) #FC vs.Posterior FC的圖
write.table(GeneFC,file="FC.txt",quote=F,sep="\t") #注意$Direction
- 如何確定maxround?
查看參數(shù)值:EBOut$Alpha
,EBOut$P
,EBOut$Beta
,最后兩個值相差<0.01
就差不多了奏篙。
關(guān)于差異表達(dá)(RNA-seq)
標(biāo)準(zhǔn)化方法不同+檢驗(yàn)不同=多種組合/軟件柴淘,用之前需要結(jié)合自己的樣本量來考慮,多參考有相似實(shí)驗(yàn)設(shè)計(jì)的文獻(xiàn)秘通,常用的方法都跑一下为严,自己評估下結(jié)果差異,再做定奪肺稀。(研究本來就是充滿了不確定性第股,一切都只能用“可能性”來定義,所以话原,采用同樣參數(shù)仍然無法完全重復(fù)出文獻(xiàn)中的結(jié)果也是常見夕吻。即使你心有芥蒂...)
針對配合meRIP-seq的RNA-seq,因?yàn)榕笥研枰戳藥灼墨I(xiàn)繁仁,基本都是2 vs. 2的樣本量涉馅,文獻(xiàn)里的作法也有差異。
- 直接取了fold change >2 (2017Cancer Cell-m6A Demethylase ALKBH5 Maintains Tumorigenicity of Glioblastoma Stem-like Cells by Sustaining FOXM1 Expression and Cell Proliferation Program)
- EBSeq (2017Cancer Cell-FTO Plays an Oncogenic Role in Acute Myeloid Leukemia as a N6-Methyladenosine RNA Demethylase)
- DESeq2