Differential gene expression analysis:差異表達基因分析
Differentially expressed gene (DEG):差異表達基因
Volcano Plot:火山圖
差異倍數(shù)(fold change)
fold change翻譯過來就是倍數(shù)變化忽洛,假設(shè)A基因表達值為1咖摹,B表達值為3驳概,那么B的表達就是A的3倍姆蘸。一般我們都用count、TPM或FPKM來衡量基因表達水平,所以基因表達值肯定是非負數(shù),那么fold change的取值就是(0, +∞).
為什么我們經(jīng)辰惶溃看到差異基因里負數(shù)代表下調(diào)、正數(shù)代表上調(diào)劫笙?因為我們用了log2 fold change芙扎。當expr(A) < expr(B)時星岗,B對A的fold change就大于1,log2 fold change就大于0(見下圖)戒洼,B相對A就是上調(diào)俏橘;當expr(A) > expr(B)時,B對A的fold change就小于1圈浇,log2 fold change就小于0敷矫。通常為了防止取log2時產(chǎn)生NA,我們會給表達值加1(或者一個極小的數(shù))汉额,也就是log2(B+1) - log2(A+1). 【需要一點對數(shù)函數(shù)的基礎(chǔ)知識】
為什么不直接用表達之差,差直接有正負罢ヌ馈蠕搜?假設(shè)A表達為1,B表達為8收壕,C表達為64妓灌;直接用差B相對A就上調(diào)了7,C就相對B上調(diào)了56蜜宪;用log2 fold change虫埂,B相對A就上調(diào)了3,C相對B也只上調(diào)了3. 通過測序觀察我們發(fā)現(xiàn)圃验,不同基因在細胞里的表達差異非常巨大掉伏,所以直接用差顯然不合適,用log2 fold change更能表示相對的變化趨勢澳窑。
雖然大家都在用log2 fold change斧散,但顯然也是有缺點的:一、到底是5到10的變化大摊聋,還是100到120的變化大鸡捐?二雇逞、5到10可能是由于技術(shù)誤差導(dǎo)致的搞疗。所以當基因總的表達值很低時,log2 fold change的可信度就低了渔欢,尤其是在接近0的時候煎源。A disadvantage and serious risk of using fold change in this setting is that it is biased[7] and may misclassify differentially expressed genes with large differences (B ? A) but small ratios (B/A), leading to poor identification of changes at high expression levels. Furthermore, when the denominator is close to zero, the ratio is not stable, and the fold change value can be disproportionately affected by measurement noise.
差異的顯著性(P-value)
這就是統(tǒng)計學(xué)的范疇了色迂,顯著性就是根據(jù)假設(shè)檢驗算出來的。
假設(shè)檢驗首先必須要有假設(shè)手销,我們假設(shè)A和B的表達沒有差異(H0脚草,零假設(shè)),然后基于此假設(shè)原献,通過t test(以RT-PCR為例)算出我們觀測到的A和B出現(xiàn)的概率馏慨,就得到了P-value埂淮,如果P-value<0.05,那么說明小概率事件出現(xiàn)了写隶,我們應(yīng)該拒絕零假設(shè)倔撞,即A和B的表達不一樣,即有顯著差異慕趴。
顯著性只能說明我們的數(shù)據(jù)之間具有統(tǒng)計學(xué)上的顯著性痪蝇,要看上調(diào)下調(diào)必須回去看差異倍數(shù)。
這里只說了最基本的原理冕房,真正的DESeq2等工具里面的算法肯定要復(fù)雜得多躏啰。
這張圖對q-value(校正了的p-value)取了負log,相當于越顯著耙册,負log就越大给僵,所以在火山圖里,越外層的巖漿就越顯著详拙,差異也就越大帝际。
只需要看懂DEG結(jié)果的可以就此止步,想深入了解的可以繼續(xù)饶辙。
另一篇關(guān)于建庫的文章:RNA-seq建庫技術(shù) | RNA sequence library construction
下面會討論的問題有:
- RNA-seq基本分析流程
- DEG分析的常用算法
- 常見DEG工具的方法介紹和相互比較
前言
做生物生理生化生信數(shù)據(jù)分析時蹲诀,最常聽到的肯定是“差異(表達)基因分析”了,從最開始的RT-PCR弃揽,到基因芯片microarray脯爪,再到RNA-seq,最后到現(xiàn)在的single cell RNA-seq矿微,統(tǒng)統(tǒng)都在圍繞著差異表達基因做文章披粟。
(開個腦洞:再下一步應(yīng)該會測細胞內(nèi)特定空間內(nèi)特定基因的動態(tài)表達水平了)
表達量:我們假設(shè)基因轉(zhuǎn)錄表達形成的mRNA的數(shù)量反映了基因的活性,也會影響下游蛋白和代謝物的變化冷冗。我們關(guān)注的是基因的表達守屉,不是結(jié)構(gòu),也是不是isoform蒿辙。
為什么差異基因分析這么流行拇泛?一是中心法則得到了確立,基因表達是核心的一個環(huán)節(jié)思灌,決定了下游的蛋白組和代謝組俺叭;二是建庫測序的普及,獲取基因的表達水平變得容易泰偿。
在生物體內(nèi)熄守,基因的表達時刻都在動態(tài)變化,不一定服從均勻分布,在不同時間裕照、發(fā)育程度攒发、組織和環(huán)境刺激下,基因的表達肯定會發(fā)生變化晋南。
差異基因分析主要應(yīng)用在:
- 發(fā)育過程中關(guān)鍵基因的表達變化 - 發(fā)育研究
- 突變材料里什么核心基因的表達發(fā)生了變化 - 調(diào)控研究
- 細胞在受到藥物處理后哪些基因的表達發(fā)生了變化 - 藥物研發(fā)
目前我們對基因和轉(zhuǎn)錄組的了解到什么程度了惠猿?
基本的建庫方法?建庫直接決定了我們能測到什么序列负间,也決定了我們能做什么分析偶妖!
基因表達的normalization方法有哪些?
第一類錯誤政溃、第二類錯誤是什么趾访?
多重檢驗的校正?FDR
10x流程解釋
The mean UMI counts per cell of this gene in cluster I
The log2 fold-change of this gene's expression in cluster i relative to other clusters
The p-value denoting significance of this gene's expression in cluster i relative to other clusters, adjusted to account for the number of hypotheses (i.e. genes) being tested.
The differential expression analysis seeks to find, for each cluster, genes that are more highly expressed in that cluster relative to the rest of the sample. Here a differential expression test was performed between each cluster and the rest of the sample for each gene.
The Log2 fold-change (L2FC) is an estimate of the log2 ratio of expression in a cluster to that in all other cells. A value of 1.0 indicates 2-fold greater expression in the cluster of interest.
The p-value is a measure of the statistical significance of the expression difference and is based on a negative binomial test. The p-value reported here has been adjusted for multiple testing via the Benjamini-Hochberg procedure.
In this table you can click on a column to sort by that value. Also, in this table genes were filtered by (Mean UMI counts > 1.0) and the top N genes by L2FC for each cluster were retained. Genes with L2FC < 0 or adjusted p-value >= 0.10 were grayed out. The number of top genes shown per cluster, N, is set to limit the number of table entries shown to 10000; N=10000/K^2 where K is the number of clusters. N can range from 1 to 50. For the full table, please refer to the "differential_expression.csv" files produced by the pipeline.
不同單細胞DEG鑒定工具的比較
For data with a high level of multimodality, methods that consider the behavior of each individual gene, such as DESeq2, EMDomics, Monocle2, DEsingle, and SigEMD, show better TPRs. 這些工具敏感性高董虱,就是說不會漏掉很多真的DEG扼鞋,但是會包含很多假的DEG。
If the level of multimodality is low, however, SCDE, MAST, and edgeR can provide higher precision. 這些工具精準性很高空扎,意味著得到的DEG里假的很少,所以會漏掉很多真的DEG润讥,不會引入假的DEG转锈。
time-course DEG analysis
Comparative analysis of differential gene expression tools for RNA sequencing time course data
參考:
Question: How to calculate "fold changes" in gene expression?
Exact Negative Binomial Test with edgeR
Differential gene expression analysis
相關(guān)文章:
ggplot的boxplot添加顯著性 | Add P-values and Significance Levels to ggplots | 方差分析