BETA轉錄因子與表達關聯(lián)分析

今天介紹和測試這款關聯(lián)ChIP-seq喘鸟、RNA-seq分析軟件，BETA（Binding and Expression Target Analysis)最盅。

該分析整合了ChIP-seq和轉錄表達水平來探究基因表達調控的機理突雪。此前，一些研究用于轉錄因子靶基因預測涡贱，但是鮮有高效的工具供人們使用咏删。

作者開發(fā)了名為BETA的軟件來研究結合位點與轉錄表達的關系，基于結合位點信息和差異表達信息可以進行如下三個主要分析：

（1）用于預測轉錄因子就有激活或抑制的功能问词；

（2）用于推斷轉錄因子的靶基因督函；

（3）用于鑒定轉錄因子的motif及其結合者。

該軟件具有三個主要的子命令：

第一部分：

BETA basic：轉錄因子激活或抑制功能預測和直接靶基因檢測激挪。

BETA basic \
    –p peaks.bed \
    –e expr.xls  \
    –k BSF  \
    –g hg19 \
    --da 500 \
    –n prefix \
    -o outdir

涉及參數(shù)及其他可選參數(shù)解釋如下：

-p peaks結合位點文件奏黑，BED格式跛溉，支持3列或5列格式，其格式為：

CHROM, START, END（NAME, SCORE）。

使用基本的3列信息即可忍啸。

示例如下（不包含表頭）：
chrom start end name（可選） score（可選）
chr1 1208689 1209509 AR_LNCaP_2 51.58
chr1 1334246 1335348 AR_LNCaP_7 54.55
chr1 2179351 2180790 AR_LNCaP_9 257.72

注：macs2檢峰時刻通過p-value進行peak的過濾只保留10000個結果嗜价。

-e 差異表達文件鸭廷，可以使用LIMMA和Cuffdiff的結果象颖，也可使用BETA specific format（BSF）格式豌鸡，其格式如下：

GeneID log2(FC) FDR/pval
NM_000015 1.28221 0.185744
NM_000016 0 1
NM_000017 0.0598207 0.844755

-k 差異表達文件的類型，即LIM, CUF, BSF, O段标，分別對應LIMMA涯冠、cuffdiff標準輸出格式、BSF格式或其他格式的文件需要使用--info指定逼庞。

-g 基因組蛇更，支持任何小鼠，hg38, hg19, hg18, mm10, mm9赛糟，也可使用-r指定其他基因組文件派任。

--gname2 給定次參數(shù)說明文件中的基因名稱或轉錄本名稱是官方的基因symbols，默認FALSE璧南。

--info 指定表達數(shù)據(jù)中的geneID, up/down status和statistcal values列掌逛。

-r 參考基因組文件。

-o 輸出路徑司倚。

--bl 是否使用CTCF邊界過濾基因周圍的peaks豆混，默認FALSE。

--bf CTCF保守peakBED文件动知，需要同時指定--bl參數(shù)且基因組不能指定為hg19和mm9皿伺。

--pn 打算分析的peaks數(shù)目，默認10000盒粮。

--method 支持score和distance鸵鸥，指定TF/CR給你預測的方法，score：調控潛能的打分丹皱，disrance：鄰近的結合peak的距離妒穴。

-n 輸出文件前綴名稱。

-d 從TSS一定范圍內獲取peaks摊崭，默認100000（100kb）讼油。

--df 輸入一個0~1范圍內的閾值來篩選最顯著的差異表達基因，默認1爽室，全部輸入基因汁讼。

--da 通過比例或數(shù)目選取最顯著的差異表達基因淆攻，默認0.5阔墩。如果想使用diff_fdr請設置此參數(shù)為1。

-c 設置0~1閾值瓶珊，通過p值（單尾KS檢驗）選擇最近的靶基因啸箫，默認0.001。

測試運行日志如下：

BETA basic \
  -p peaks.bed \
  -e beta_specific_expression.txt \
  -k BSF \
  -g hg19 \
  --da 500 \
  -n basic \
  -o basic_out

[15:41:26] Argument List:
[15:41:26] Name = basic
[15:41:26] Peak File = ../peaks.bed
[15:41:26] Top Peaks Number = 10000
[15:41:26] Distance = 100000 bp
[15:41:26] Genome = hg19
[15:41:26] Expression File = ../beta_specific_expression.txt
[15:41:26] BETA specific Expression Type
[15:41:26] Number of differential expressed genes = 500.0
[15:41:26] Differential expressed gene FDR Threshold = 1
[15:41:26] Up/Down Prediction Cutoff = 0.001000
[15:41:26] Function prediction based on regulatory potential
[15:41:26] Check ../peaks.bed successfully!
[15:41:26] #ID  Status  Value
 is the header of the expression file
[15:41:26] Checking the differential expression infomation...
[15:41:26] Take the first line with Differential Information as an example: NM_000014   -0.325878       0.618293

[15:41:26] Differential Expression file format successful passed
[15:41:26] You do not like filter peak by CFCT boundary, it will be filtered only by the distance
[15:41:26] Read file <../peaks.bed> OK! All <7031> peaks.
[15:41:41] Process <50801> genes
[15:41:56] Finished! Preliminary results saved into temporary file: <basic.txt>
[11815, 11643]
[15:41:56] Genes were seprated to two parts: up regulated and down regulated.
[15:41:56] Prepare file for the Up/Down Test
null device
          1
[15:42:02] Finished, Find the result in basic_function_prediction.pdf
[15:42:02] Get the Rank Product of the "upregulate" genes
[15:42:46] Get the Rank Product of the "downregulate" genes
['"upregulate"', '"downregulate"']
[15:43:30] pick out the peaks 100000bp around the selected genes
[15:43:30] Finished: Find target gene associated peaks in basic_out
total time: 0:2:4

輸出文件存儲于basic_out文件夾：

$tree
.
├── basic_downtarget_associate_peaks.bed
├── basic_downtarget.txt
├── basic_function_prediction.pdf
├── basic_function_prediction.R
├── basic_uptarget_associate_peaks.bed
└── basic_uptarget.txt

激活/抑制功能預測打分

圖片.png

上調調控靶標/下調調控靶標文件

#文件截取示例
Chroms  txStart txEnd   refseqID        rank product    Strands GeneSymbol
chr1    27992571        27998724        NM_002038       4.592e-06       -       IFI6
chr12   113376237       113411055       NM_006187       4.835e-06       +       OAS3
chr12   113344738       113357712       NM_016816       5.523e-06       +       OAS1
chr4    169277885       169401665       NM_001012967    5.616e-06       -       DDX60L

靶基因相關的peak文件

hrom   pStart  pEnd    Refseq  Symbol  Distance        Score
chr1    27971570        27971739        NM_002038       IFI6    -27070  0.205399174599
chr1    27971776        27971885        NM_002038       IFI6    -26894  0.20685028671
chr12   113364362       113364438       NM_006187       OAS3    -11837  0.377766121894
chr12   113364362       113364438       NM_016816       OAS1    19662   0.276241443606
chr4    169422368       169422769       NM_001012967    DDX60L  20903   0.262863603291

第二部分：

BETA plus：進行激活/抑制功能預測伞芹，直接靶基因預測和靶向區(qū)域motif分析忘苛，比基礎版本增加了motif分析

BETA plus \
    –p peaks.bed \
    –e expr.xls  \
    –k BSF \
    –g hg19 \
    --gs hg19.fa \
    --bl

該模塊與basic相似蝉娜，以下參數(shù)功能有差異：

--gs 基因序列文件。

--mn 指定0~1范圍內數(shù)值作為p-value或大于1作為數(shù)目來獲取motif的閾值扎唾，默認10召川。

其他參數(shù)參考BETA Basic參數(shù)描述。

測試運行日志如下：

[16:49:30] Argument List:
[16:49:30] Name = NA
[16:49:30] Peak File = ../peaks.bed
[16:49:30] Top Peaks Number = 10000
[16:49:30] Distance = 100000 bp
[16:49:30] Genome = hg19
[16:49:30] Expression File = ../beta_specific_expression.txt
[16:49:30] Genome Sequence fasta formated data = hg19.fa
[16:49:30] BETA specific Expression Type
[16:49:30] Number of differential expressed genes = 0.5
[16:49:30] Differential expressed gene FDR Threshold = 1
[16:49:30] Up/Down Prediction Cutoff = 0.001000
[16:49:30] Function prediction based on regulatory potential
[16:49:30] Check ../peaks.bed successfully!
[16:49:30] #ID  Status  Value
 is the header of the expression file
[16:49:30] Checking the differential expression infomation...
[16:49:30] Take the first line with Differential Information as an example: NM_000014   -0.325878       0.618293

[16:49:30] Differential Expression file format successful passed
[16:49:30] Read file <../peaks.bed> OK! All <7031> peaks.
[16:50:16] Process <50801> genes
[16:50:25] Finished! Preliminary results saved into temporary file: <NA.txt>
[11815, 11643]
[16:50:25] Genes were seprated to two parts: up regulated and down regulated.
cut: write error: Broken pipe
cut: write error: Broken pipe
[16:50:26] Prepare file for the Up/Down Test
null device
          1
[16:51:03] Finished, Find the result in NA_function_prediction.pdf
[16:51:03] Get the Rank Product of the "upregulate" genes
[16:51:45] Get the Rank Product of the "downregulate" genes
['"upregulate"', '"downregulate"']
[16:52:26] pick out the peaks 100000bp around the selected genes
[16:52:27] Finished: Find target gene associated peaks in plus_out
[16:52:27] get three regions of every peak around the up genes
['up_middle.bed', 'up_left.bed', 'up_right.bed']
[16:53:54] get the fasta format sequence data of the three regions of up
[16:53:55] get three regions of every peak around the down genes
['down_middle.bed', 'down_left.bed', 'down_right.bed']
[16:55:14] get the fasta format sequence data of the three regions of down
[16:55:14] get three regions of every peak around the non genes
['non_middle.bed', 'non_left.bed', 'non_right.bed']
[16:56:37] get the fasta format sequence data of the three regions of non
[16:56:37] run mis to do the known motif scan with cistrome motif database
Calculating Background..
A or T: 93234
C or G: 100766
Others: 0
Calculating Background..
A or T: 97696
C or G: 95334
Others: 0
Calculating Background..
A or T: 97264
C or G: 95766
Others: 0
Calculating Background..
A or T: 118812
C or G: 142588
Others: 0
Calculating Background..
A or T: 125071
C or G: 135022
Others: 0
Calculating Background..
A or T: 123892
C or G: 136201
Others: 0
Calculating Background..
A or T: 181589
C or G: 169211
Others: 0
Calculating Background..
A or T: 191633
C or G: 157413
Others: 0
Calculating Background..
A or T: 191487
C or G: 157559
Others: 0
[16:59:58] T statistic test performed here to do the significance testing of every motif
[17:58:18] motif result will be in html format
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
Success parser from table.
[17:58:22] Done: find motif result in beatmotif.html file
[17:58:22] Done: find all BETA result in plus_out
total time: 1:8:52

基本一個小時胸遇，2G內存即可完成荧呐。

同樣，plus功能生成上/下調靶標文件和靶基因相關的peak文件纸镊，此外生成了靶基因相關motif的結果文件（存放于motifresult文件夾倍阐，包含一個*motif.html的motif分析文件）。

motif分析示例截圖如下：

圖片.png

同時生成上調逗威、下調差異差異motif的文件：

表頭：MotifID Species Symbol DNA BindDom PSSM Tscore Pvalue

MS00266 Homo sapiens MLX Helix-Loop-Helix Family [[[0.534, 0.012, 0.331, 0.123], [0.01, 0.043, 0.152, 0.795], [0.011, 0.963, 0.011, 0.015], [0.955, 0.01, 0.025, 0.01], [0.01, 0.97, 0.01, 0.01], [0.01, 0.01, 0.97, 0.01], [0.01, 0.01, 0.01, 0.97], [0.01, 0.01, 0.97, 0.01], [0.8, 0.03, 0.117, 0.053], [0.066, 0.241, 0.017, 0.676]]] 5.08 4.30e-07

注： PSSM：位置特異性得分矩陣（position-specific scoring matrix）峰搪，PSSM示例格式為: [[0.2,0.3,0.3,0.2],[0.1,0.8,0.05,0.05]]； Tscore：t-test的統(tǒng)計量數(shù)值凯旭。

另外概耻，官網(wǎng)提供了motif分析結果的展示示例參考： http://cistrome.org/BETA/motifresult/betamotif.html。

這里面臨一個問題罐呼，有朋友會問一個轉錄因子靶向哪個或哪幾個基因：基于已知的知識可以檢索數(shù)據(jù)庫咐蚯，數(shù)據(jù)庫這里推薦：http://cistrome.org/db/#/或進行預測，比如本工具提供一系列的結合區(qū)域（peak）對其靶基因進行預測弄贿，但該工具未給出具體的轉錄因子和哪些靶基因存在直接的關系春锋，需要對motif的結合位置情況與peak進行關聯(lián)從而初步建立關系。

參考資料

官方使用手冊：http://cistrome.org/BETA/tutorial.html

BETA網(wǎng)頁版分析入口：http://cistrome.org/BETA/index.html#runweb

參考文獻

Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., ... & Liu, X. S. (2013). Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols, 8(12), 2502-2515.