RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State
因為最近剛好在學(xué)機器學(xué)習(xí)算法,結(jié)合在生信領(lǐng)域的具體應(yīng)用會更好,這里就大致解讀一下華人大神任兵老師作為通訊作者發(fā)表的基于隨機森林算法預(yù)測enhancer的文章:https://doi.org/10.1371/journal.pcbi.1002968
background
什么是熵?
熵是度量隨機變量不確定性的一個物理量谱仪,概率越小的,越不確定的绷蹲,蘊含信息量越大(因為存在的可能性越多)诺核,lnP越大(否則講一句廢話就表示信息量為零)衔肢,獲取信息就是消除熵的過程漂佩,噪音是信息獲取的干擾脖含,數(shù)據(jù)是信息+噪音,需要用知識分離投蝉。概率的輸入是微觀態(tài)养葵,熵的輸入是宏觀態(tài)(微觀態(tài)的加和)
參考:https://www.zhihu.com/question/22178202
在機器學(xué)習(xí)中,P往往用來表示樣本的真實分布墓拜,比如[1,0,0]表示當(dāng)前樣本屬于第一類港柜。Q用來表示模型所預(yù)測的分布请契,比如[0.7,0.2,0.1]
直觀的理解就是如果用P來描述樣本咳榜,那么就非常完美。而用Q來描述樣本爽锥,雖然可以大致描述涌韩,但是不是那么的完美,信息量不足氯夷,需要額外的一些“信息增量”才能達到和P一樣完美的描述臣樱。如果我們的Q通過反復(fù)訓(xùn)練,也能完美的描述樣本腮考,那么就不再需要額外的“信息增量”雇毫,Q等價于P。
什么是隨機森林踩蔚?隨機森林可以看成決策樹的延伸棚放,對決策樹而言,每個結(jié)點都會通過選擇最優(yōu)feature盡可能實現(xiàn)信息增量最大化馅闽,也就是最大熵減
隨機森林是指利用多棵決策樹對樣本數(shù)據(jù)進行訓(xùn)練飘蚯、分類并預(yù)測的一種方法馍迄,它在對數(shù)據(jù)進行分類的同時,還可以給出各個變量(基因)的重要性評分局骤,評估各個變量在分類中所起的作用攀圈。隨機森林的構(gòu)建大致如下:首先利用bootstrap方法又放回的從原始訓(xùn)練集中隨機抽取n個樣本,并構(gòu)建n個決策樹峦甩;然后假設(shè)在訓(xùn)練樣本數(shù)據(jù)中有m個特征赘来,那么每次分裂時選擇最好的特征進行分裂 每棵樹都一直這樣分裂下去,直到該節(jié)點的所有訓(xùn)練樣例都屬于同一類穴店;接著讓每顆決策樹在不做任何修剪的前提下最大限度的生長撕捍;最后將生成的多棵分類樹組成隨機森林,用隨機森林分類器對新的數(shù)據(jù)進行分類與回歸泣洞。對于分類問題忧风,按多棵樹分類器投票決定最終分類結(jié)果;而對于回歸問題球凰,則由多棵樹預(yù)測值的均值決定最終預(yù)測結(jié)果
參考:https://baijiahao.baidu.com/s?id=1612329431904493042&wfr=spider&for=pc
下列筆記參考StatQuest視頻做的記錄: https://statquest.org/video-index/
Building, using and evaluating random forest
Sometimes decision tree are not flexible when it comes to classifying new samples
Random forest combines the simplicity of decision tree and the flexibilty resulting in a vast improvement in accuracy
- Create a bootstrapped dataset, select sample randomly and the bootstrapped dataset is the same size as the original
- Create a decision tree using the bootstrapped dataset, but only use a random subset of variables (or columns) at each step, eg. each step consider two variables and choose one as the node
- And we generate a wide variety of trees
- We get a new patient, and we test this case in each tree we built, we can see which gets the most votes(yes have heart disease, or not) So we conclude based on the most votes
Bootstrapping the data plus using the aggregate to make a decision is called "Bagging"
Note: we allow duplicated entries in the bootstrapped dataset, that means we could have "Out-of-Bag Dataset", and we can see whether it is correctly labeled by random forest
Ultimately, we can measure how accurate our random forest is by the proportion of Out-Of -Bag samples that were corrrectly classified by the random forest
And we change the number of variables used per step, do this a bunch of times and then choose the most accurate random forest
Intro
基于隨機森林算法對enhancer進行預(yù)測狮腿,尋找optimal set of histone modifications for enhancer prediction in different cell
types. 用哪些組蛋白修飾對增強子進行預(yù)測可以取得最好的效果?
增強子鑒定困難來源于:
- 沒有共有的序列特征呕诉,非常diverse
- 距離受其調(diào)控的基因距離可以非常遠缘厢,從哪找?
先前的預(yù)測增強子的工作包括基于motif聚類和比較性分析的甩挫,但存在問題:
Computational techniques relying on transcription factor motif clustering or comparative analyses have had some success in identifying enhancers, but these predictions are neither comprehensive nor tissue-specific
另外贴硫,在全基因組范圍鑒定enhancer的很重要的一個方法就是chip-seq,但這個需基于先驗的TF的背景伊者,而且要在不同組織英遭、全基因組大范圍鑒定enhancer,工作量是巨大的
還有一種方法就是mapping the binding sites of transcriptional co-activators such as p300 and CBP亦渗,這種輔激活因子受大量序列特異TF募集挖诸,也就是募集到大量的enhancer上。但實際上并不是所有的enhancer都有set of co-activators法精,而且針對這些輔激活因子的chip級別的抗體也并非隨時available
最后一種方法就是尋找open chromatin 區(qū)域多律,但缺乏特異性,因為開放染色質(zhì)區(qū)還包括了silencers/repressors, insulators, promoters以及未知的核蛋白結(jié)合序列元件搂蜓,所以并不能特異地鑒定出我們想要的enhancer
certain histone modifications form a consistent signature of enhancers. It is on this approach that the present work is focused.文章基于的原理就是:組蛋白修飾marks的分布pattern應(yīng)該可以構(gòu)成功能元件的signatures狼荞,從而根據(jù)signatures來推斷潛在的enhancer元件
對于已有的計算生物學(xué)工具:
The supervised machine learning techniques include HMM, neural networks and genetic algorithm-optimized SVM based approaches, and have proved to be improvements over the profile-based method.
其實這些方法仍有局限性:一方面是受到訓(xùn)練集太少的局限,一方面僅測試了部分chromatin marks預(yù)測的效應(yīng)帮碰。而隨著更多的marks的發(fā)現(xiàn)相味,可以提高對于enhancer鑒定的精確性;也就是說收毫,使用一種方法確定最佳的marks集合攻走,充分利用更多的marks的信息殷勘,幫助我們更好地預(yù)測enhancer
首先,需要獲取足夠的訓(xùn)練集數(shù)據(jù)昔搂,作為Roadmap計劃的一部分玲销,作者在H1、IMR90兩種細胞系中檢測了24種chromatin modification摘符,使用ChIP-seq方法檢測了這兩種細胞系內(nèi)的p300 binding site
Additionally, we have experimentally determined a large number of promoter-distal p300 binding sites in each cell type, providing a rich training set for development of accurate and robust enhancer prediction algorithms 定義了啟動子遠端p300的結(jié)合位點(預(yù)示著潛在的enhancer)
Results
Prediction of enhancers using random forest and multiple chromatin marks
隨機森林算法在生物信息學(xué)中應(yīng)用廣泛贤斜,這得益于它對于大數(shù)據(jù)集高效的處理能力、不用擔(dān)心過擬合問題(算法本身是一個bagging逛裤、votes集成算法瘩绒,樣本和feature都加入了隨機,由于兩個隨機性的引入带族,一方面使得隨機森林不容易陷入過擬合锁荔,另一方面使其具有一定抗噪聲能力;用OOB數(shù)據(jù)作為驗證集)蝙砌。重要的是阳堕,隨機森林能夠衡量特征重要度。參考:https://blog.csdn.net/SMF0504/article/details/51939064
Random forests have recently become a popular machine learning technique in biology due to their ability to run efficiently on large datasets without over-fitting, and their inherently non-parametric structure. Since random forests use a
single variable at a time, they can give an automatic measure of feature importance
對 enhancer的初始定義:
We selected p300 binding sites overlapping DNase-I hypersensitive sites and distal to annotated TSS as active p300 binding sites representative of enhancers
于是择克,作者對他們定義的"enhancer"進行無監(jiān)督聚類恬总,每個cluster都有獨特的chromatin state pattern:
從上圖可以看出,這些cluster的H3K4me1信號值較強而H3K4me3信號較弱肚邢,這也與一直以來的認知:enhancer和promoter的marker差別是一致的壹堰;不過怎么解釋其他marker的信號差異呢?
Different clusters were characterized by varying levels of histone acetylation, H3K4me2 or H3K27me3. Clusters with presence or absence of H3K36me3 may represent genic and intergenic enhancers respectively.
一方面骡湖,活性染色質(zhì)marker的差異可能衡量了enhancer內(nèi)部subclasses還有獨特的“state”贱纠,PS個人觀點:獨特的"state“會不會與遠程相互作用的差異有關(guān)?標志不同的"blocks"?勺鸦;另一方面并巍,那些具有H3K36me3信號的cluster可能代表了分別落在基因內(nèi)和基因間區(qū)的enhancer
訓(xùn)練集的選饶磕尽:有活性的遠端p300結(jié)合位點作為enhancer class换途,TSS和隨機遠端100bp bins作為non-enhancer class
To train the forest, active and distal p300-binding sites (BS) were selected as representative of the enhancer class. As nonenhancer classes, we considered annotated transcription start sites (TSS) that overlap DNase-I, and random 100 bp bins that are distal to known p300 or TSS (see Methods)
Training一共分成兩步:
- 確定兩個class: p300 binding site和TSS、隨機背景序列
- 預(yù)測:用來做預(yù)測的數(shù)據(jù)以100bp的bin為單位刽射,每個element構(gòu)成20維的向量(each feature being a 20-dimensional vector of 100 bp bins from 21to +1 kb along the genomic element.)军拟,在每個node進行線性判別(這里作者運用的是一種特殊的隨機森林,每個node對應(yīng)一個線性分類器using Fischer Discriminant approach誓禁,應(yīng)該是判別分析的一種懈息,不是LDA,fisher是尋找這樣的一個空間摹恰,樣本投影在這個空間上辫继,類內(nèi)距離最小怒见,類間距離最大。那么怎么求這個空間呢姑宽,類似于PCA遣耍,求最大特征值對應(yīng)的特征向量組成的空間,參考:https://blog.csdn.net/u013943841/article/details/45080889)炮车,這樣舵变,每個bin最終都會被分配上enhancer or non-enhancer status
在上圖B,可以發(fā)現(xiàn)以p300結(jié)合位點為中心瘦穆,距離越遠纪隙,得票數(shù)越少;選擇得票數(shù)大于50%的用來做下游分析扛或,并假設(shè)在一定p300 binding距離范圍內(nèi)的bins都屬于同一個enhancer
基于OOB error的特征重要性度量:https://blog.csdn.net/m0_37770941/article/details/78330795
the importance of individual histone modifications for making enhancer predictions, we use an out-of-bag measure of variable importance implemented in Matlab as the function oobVarImp.
模型參數(shù)的優(yōu)化:經(jīng)典的ROC曲線
We used Receiver Operating Characteristic (ROC) curves to determine optimal parameters for our classification algorithm
但是作者在這里提到了一個問題绵咱,ROC曲線里面橫坐標的specificity并不是一個肯定的值,因為不能保證隨機選取的elements of non-p300 classes就都是真陰性熙兔。
Hence, in addition to the ROC curves generated using 5-fold cross-validation(5折交叉驗證), we also verified parameter selection by comparing the percentage of predicted enhancers at each cutoff that overlap markers of active enhancers (validation rate) or TSS (misclassification rate)
因此如上麸拄,作者在選取參數(shù)上進一步考量了不同閾值(即enhancer投票數(shù)高于多少才被認為是enhancer)下,有多少預(yù)測的enhancer的marker和已知的active enhancer/TSS的marker是比較一致的黔姜,已知的active enhancer marker/pattern比如超敏位點distal DNase-I hypersensitivity sites (HS), p300結(jié)合位點(excluding those used in training), occupancy by CBP or sequence specific TF known to act at embryonic stem cell enhancers such as NANOG, OCT4 and SOX2.
就隨機森林算法而言拢切,主要的一個需要決定的參數(shù)就是樹的數(shù)目。注意到在基因組中非增強子序列是遠多于增強子元件的(測試集的比例)秆吵,對訓(xùn)練集而言淮椰,non-p300 training sites也是要多于p300 classes的(需要考慮兩個class的比例)。另一個需要考慮的參數(shù)就是window的長度(-1kb--+1kb纳寂,在前面提到過主穗,這也是一個超參數(shù),0.5-1.0kb的window在ROC曲線的表現(xiàn)上其實差不多)
可以看到毙芜,雖然0.5k和1k的ROC曲線差不多忽媒,但是validation rate較低,錯誤分類率更高
B圖:Percentage of enhancers validated by true positive markers at different numbers of enhancers determined by various cutoffs (Validation rate or VR curve)腋粥,對訓(xùn)練集的驗證正確率
Enhancer predictions in H1 and IMR90 cells
決定樹的個數(shù):大于45棵樹后AUC值基本不變
In the end, we selected 65 trees for training the random forest as it appeared to be optimal for both cases. The training-set ratio of p300 to non-p300 was set at 1:7 since the ROC curve did not appear to change much beyond this ratio.
下圖的調(diào)參針對p300-non-p300 site的比例晦雨,前文有涉及
初步調(diào)好參數(shù)后,作者將訓(xùn)練好的RFECS算法運用在H1和IMR90的chromatin profiles上(24種組蛋白marks)
預(yù)測正確率的計算依據(jù)是:與超敏位點隘冲、p300結(jié)合位點闹瞧、一些特異轉(zhuǎn)錄因子結(jié)合位點相重合的enhancer占總的預(yù)測enhancer的比例(看上去條件比較苛刻)
We then calculated the validation rate as the percentage of predicted enhancers overlapping with DNase-I hypersensitivity sites and binding sites of p300 and a few sequence specific transcription factors known to function in each cell type (true positive markers).
Validation of enhancer predictions
對于同一個細胞系而言,如果分類器能正確地pick up更多的enhancer而不是tss展辞,那么它的表現(xiàn)就是好的奥邮,具體的評判分類是否正確的原則如下:
True Positive Markers (TPM) refer to DNase-I hypsersensitivity site, p300, CBP and Transcription factor binding sites
- If the nearest TPM lies within 2.5 kb of the enhancer and the nearest TSS is greater than 1 kb away from the TPM, the enhancer is ‘‘validated’’
- If a TSS lies within 2.5 kb of the enhancer, and the nearest TPM is either greater than 2.5 kb away from the enhancer or within 1 kb of the TSS, the enhancer is ‘‘misclassified’’
- If there is no TPM or TSS within 2.5 kb of the enhancer, it is ‘‘unknown’’.
作者除了評判validation rate和misclassification rate以外,還計算了這些在每個細胞系內(nèi)預(yù)測的enhancer到DNS, p300 結(jié)合位點的距離:基本都在200bp以內(nèi),說明預(yù)測效果良好
除了計算的驗證洽腺,作者還確認了proximal TSS上基因表達的激活效應(yīng)脚粟,即需要結(jié)合RNA-seq數(shù)據(jù)來確認細胞系特異的TSS,然后計算以這些TSS為中心蘸朋,預(yù)測enhancer的分布情況:可以看到兩種細胞系各自的predicted enhancer以各自TSS為中心呈現(xiàn)富集狀態(tài)
Random forest trained on one cell-type can accurately predict enhancers in other cell-types
雖然在各自細胞系(H1,IMR90)里珊楼,分類器的預(yù)測表現(xiàn)都不錯,但接下來的問題是度液,能不能把enhancer預(yù)測推廣到其他細胞系里厕宗?因為如果對每個細胞系都做分類器訓(xùn)練,然后測試堕担,這個工作是很耗時耗力的已慢。所以如果能用一個細胞系做訓(xùn)練集,另一個細胞系做測試集而且效果很好的話霹购,就可以省下不少麻煩佑惠。
這個想法很好,作者于是乎就干脆把H1作為訓(xùn)練集齐疙,IMR90作為測試集(反過來也是)來評估了一下分類器的泛化能力:
To evaluate the feasibility of such approach, we first trained a random-forest using chromatin modification profiles obtained in H1, and then applied it to the IMR90 cells.
Similarly, we also developed a random forest using the IMR90 data as the training set and then applied it to H1
上圖中紅點表示相同細胞系的validation膜楷,黑點表示不同細胞系間的validation(比如在H1 predict然后在IMR90里apply),可以看到分類器在不同細胞系間是有一定泛化能力的贞奋,不過正確率會比在相同細胞系里略低
那么接下來的問題就是赌厅,這種“略低”是技術(shù)誤差允許的還是細胞系之間真實的生物學(xué)差異造成的?
We sought to examine if this moderate decrease in performance was largely due to cell-type specific differences or was within the limits of technical or biological variability between replicates
下面這個地方可能需要理解一下轿塔,意思就是作者還測試了每個細胞系的replicate特愿,看有沒有可能是重復(fù)的批次造成了分類器上述的誤差。以上圖為例勾缭,藍色的點和藍色的星號分別表示同一細胞系和不同細胞系在replicate1中的驗證揍障,綠色的點和綠色的星號分別表示同一細胞系和不同細胞系在replicate2中的驗證,可以看出replicate1二者的差異很大而replicate2二者差異很小
we trained a random forest on one replicate of a cell-type, and made predictions on the other replicate of the same cell type. RFECS trained on IMR90 and then applied to the replicate 1 of the H1 profiles (blue dot vs asterisk) actually showed a higher validation rate and lower misclassification rate than RFECS trained using replicate 2 of H1 (fig. 2C,E)
因此俩由,作者發(fā)現(xiàn)不同的replicate可以造成很大的差別毒嫡,最后的結(jié)論是:之前的“略低”很可能是批次造成的,不是分類器本身的問題(不過我覺得replicate有點少幻梯,可以多一些兜畸;也可能作者只放了兩個在圖上)
In conclusion, predicting enhancers using the random forest built from a different cell type exhibits a modest decrease in performance compared to a same-cell training set. However, this decrease in performance is comparable to the decrease that can arise due to variability between two replicates of the same cell-type.
Optimal set of chromatin marks required for enhancer prediction
接下來關(guān)心的問題是,feature的importance? 到底哪些組蛋白mark對于預(yù)測enhancer是重要的礼旅?其實在上述隨機森林建樹的過程中膳叨,feature的重要性就可以衡量出來了:
In both H1 and IMR90, the variable importance was assessed for random forests trained on 5 cross-sections of data for each of the 2 sets of replicates individually as well as the set of averaged replicates. Upon ranking histone modifications by variable importance, it is apparent that H3K4me1 and H3K4me3 are the top 2 most robust modifications across replicates and cross-sectional samples in both cell types, followed by H3K4me2
上圖中可以看到排在前三的feature分別是H3K4me3洽洁、H3K4me1痘系、H3K4me2(H1和IMR90里的順序不一樣)
不過需要注意到的是,feature importance在不同細胞系里還是有差異的饿自,比如說下面這個相關(guān)性熱圖:
Differences observed in correlation clustering of the same 24 modifications in C.)H1 and D.)IMR90 explain some of the differences in ordering of variables in the two cell types. Same non-black colors of modifications indicate clusters that co-occur in both cell-types.
黑色的label表示是H1/IMR90獨有的汰翠,顯示出細胞系間mark importance的差異:一方面相對而言在H1中許多marker都是相關(guān)性很高的龄坪,說明這些marker代表的信息相對冗余(換句話說這些marker分布比較集中、一致)复唤,而IMR90中相關(guān)性相對弱一些健田,因此marker之間分布相對更獨立(信息不冗余,每個marker對variable importance)的貢獻比較平均
相關(guān)性熱圖具體的細節(jié)見method:
The Pearson correlation coefficient between any two modifications was computed for RPKM -normalized histone modification reads between -1to +1 kb for all elements within the selected training set. The correlation patterns of each histone modification was used to cluster the modifications and order them using MATLAB tools
接下來作者嘗試了使用不同組蛋白修飾集合對enhancer進行預(yù)測并衡量準確性(accuracy)
可以看出minimal set of H3K4me1-3表型還是不錯的(圖例如下)
In both cases the use of the minimal set of 3 modifications shows a much closer resemblance in performance to all 24 modifications than to the set of 2 marks H3K4me1 and H3K4me3
另外作者還檢驗了這個minimal set在replicate佛纫、chromosome1中的prediction表現(xiàn)(見上圖E)妓局,同樣是很好的。同時H3K27ac在此例中表現(xiàn)也很不錯呈宇,說明當(dāng)H3K4me2無相應(yīng)數(shù)據(jù)時可以用H3K27ac作為替代的feature好爬,而且后者是更為commonly used marker for active enhancer(因為H3K4me2數(shù)據(jù)相對少,這個marker關(guān)注度不如H3K27ac高甥啄,當(dāng)然希望feature越common越好)存炮,最后作者選擇H3K4me1, H3K4me3、H3K27ac作為最后的featrue
Comparison of RFECS with other enhancer prediction methods
接近尾聲蜈漓,比較一下這個方法和其他enhancer預(yù)測算法(比較當(dāng)然離不開ROC)穆桂,說明這個算法的重要性:
同樣地通過validation rate和misclassification rate來衡量:
The validation rate of RFECS predictions is around 70%, which is considerably higher than the other three methods (57% ChromaGenSVM, 51% CSIANN, 60% Chromia). Further, the misclassification rates of RFECS is less than 7%, much lower than the 27%, 35% and 15% rates of Chroma-GenSVM, CSIANN and Chromia, respectively.
不過比較的前提是訓(xùn)練集數(shù)據(jù)的上游處理步驟要一致:
we ran the various enhancer prediction methods on H3K4me1, H3K4me2 and H3K4me3 datasets of H1.
We tried to make the pre-processing stages of the various algorithms as consistent as possible by merging several replicates of each histone modification files and input files into single bed files and randomly selecting a smaller subset of p300 peaks for training, since these were the requirements of the other algorithms such as CSIANN and ChromaGenSVM.
Prediction of enhancers in multiple human cell-types
對ENCODE計劃里其他細胞系的enhancer進行預(yù)測:
We trained our random forest on the p300 ENCODE data in H1 and made enhancer predictions in 12 ENCODE cell-types using the three marks H3K4me1, H3K4me3 and H3K27ac since these were available for all the cell-types. Validation rates were assessed based on overlap with existing DNAse-I hypersensitivity data while misclassification rates were calculated based on overlap with UCSC TSS.
作者認為個別細胞系的validation rate較低可能是因為細胞系本身DHS比較少
Discussion
作者認為,除了p300以外融虽,結(jié)合包括其他enhancer結(jié)合蛋白的信息或許還能使RFECS增強對調(diào)控元件的預(yù)測能力享完,而且基于隨機森林的預(yù)測算法泛化能力好,有特征選擇能力有额,以后還可以用其他類型的featrue進行training驼侠;比如添加序列、motif谆吴、DNA甲基化信息來預(yù)測其他基因組元件
引用匯總:
RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State:https://doi.org/10.1371/journal.pcbi.1002968
機器學(xué)習(xí)概念:
https://www.zhihu.com/question/22178202
https://blog.csdn.net/tsyccnh/article/details/79163834
https://baijiahao.baidu.com/s?id=1612329431904493042&wfr=spider&for=pc