Juicer Tools 簡介以及前期處理
Juicer 軟件分析流程以及幾大模塊,如下圖所示:
JUICER主要分為三個模塊 JUICER Tools读存,JUICEBOX,STAW
JUICER Tools 主要用于數(shù)據(jù)分析让簿,特征注釋
JUICEBOX 主要用于Hi-C可視化
STAW 主要是數(shù)據(jù)說明
<u style="text-decoration: underline;">juicer 軟件的基礎(chǔ)文件為.hic 文件尔当,這是一類高度壓縮的二進(jìn)制文件存儲數(shù)據(jù)的交互信息蹂安。</u>
Juicer可以做點(diǎn)什么呢?
juicer 可以call AB call TAD call loop 以及對loop進(jìn)行注釋以及motif 識別田盈,是一款集大成者的軟件,如下圖所示:
那么 .hic 文件是如何生成的呢简软?
我們一般用 juicer_tools 的pre 模塊來生成.hic文件述暂,輸入文件是HiC Pro vaildpairs 文件(注意vaild Paires 文件格式要微調(diào)參見 hicpro2juicebox.sh
pre_vaildPairs 格式:
Usage:
必須輸入的文件:infile path ,outfile path,genomesize
infile: 存儲交互信息的text文件.具體格式如下:
注意要以空格分隔
格式一:
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
格式二:
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
str = strand (0 for forward, anything else for reverse)
chr = chromosome (must be a chromosome in the genome)
pos = position
frag = restriction site fragment
#其他格式請參考https://github.com/theaidenlab/juicer/wiki/Pre#file-format
outfile: 輸出文件的路徑畦韭,注意文件名要以.hic結(jié)尾
genomesize:兩列 染色體名稱以及染色體大小
簡單使用實(shí)例:
java -Xmx10g -jar juicebox_tools.jar pre chrsvalidpair_sam1.chr10.validpairs.gz sam1.chr10.hic chrom_mm9.sizes
chrsvalidpair_sam1.chr10.validpairs.gz :
chrom_mm9.sizes:
兩列: 染色體編號 染色體大小
chr1 197195432
chr2 181748087
chr3 159599783
chr4 155630120
chr5 152537259
chr6 149517037
chr7 152524553
詳情請見:
java -Djava.io.tmpdir=/tmp -Djava.awt.headless=true -Djava.library.path=juice/lib64. -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar pre chrsvalidpair_sam1.chr10.validpairs.gz sam1.chr10.hic chrom_mm9.sizes
可選參數(shù):
-d 只計(jì)算染色體內(nèi)的交互 默認(rèn)false
-f 根據(jù)酶切片段計(jì)算 需要 restriction site file
-m <int>只輸出reads count 大于threadthod 的
-q <int>通過MAPQ score 過濾一部分?jǐn)?shù)據(jù)只輸出 MAPQ score大于或等于q的 [not set]
-c <chromosome id="">只計(jì)算某一條染色體 [not set]
-n 不對矩陣進(jìn)行標(biāo)準(zhǔn)化
…</chromosome></int></int>
如果前期pre 處理的時候 我們選擇不進(jìn)行標(biāo)準(zhǔn)化,生成了.hic文件据过,而后期我們又想進(jìn)行標(biāo)準(zhǔn)化妒挎,該如何操作呢西饵?
我們可以使用addNorm模塊
簡單用法如下:
java -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar addNorm sam1.chr10.hic -w 10000 -F
參數(shù)說明:
input_HiC_file :輸入.hic file
-w : Smallest resolution to calculate genome-wide resolution
-F :不對以酶切片段為分辨率的矩陣進(jìn)行標(biāo)準(zhǔn)化
-d: For genome-wide normalization, include intra-chromosomal matrices; by default, inter-only matrices are used.
結(jié)果:.hic file 內(nèi)容發(fā)生了改變
java -Djava.io.tmpdir= /tmp -Djava.awt.headless=true -Djava.library.path=juice/lib64 -Xmx8000m -Xms5000m -jar juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar addNorm sam1.chr10.hic -w 10000 -F
其核心代碼:
https://github.com/theaidenlab/Juicebox/tree/master/src/juicebox/tools
此外針對Juicer內(nèi)嵌的標(biāo)準(zhǔn)化方法眷柔,以下是詳細(xì)說明:
Normalization of Hi-C maps
To normalize the Hi-C maps, several methods are implemented.
Iterative Correction (IC) [1] This method normalize the raw contact map by removing biases from experimental procedure. This is an method of matrix balancing, however, in the normalized, sum of rows and columns are not equal to one.
Knight-Ruiz Matrix Balancing (KR) [2] The Knight-Ruiz (KR) matrix balancing is a fast algorithm to normalize a symmetric matrix. A doubly stochastic matrix is obtained after this normalization. In this matrix, sum of rows and columns are equal to one.
Vanilla-Coverage (VC) [3] This method was first used for inter-chromosomal map. Later it was used for intra-chromosomal map by Rao et al., 2014. This is a simple method where at first each element is divided by sum of respective row and subsequently divided by sum of respective column.
來看一下標(biāo)準(zhǔn)化的效果~~
References
[1] Imakaev et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods 9, 999–1003 (2012).
[2] Knight P and D. Ruiz. A fast algorithm for matrix balancing. IMA J Numer Anal (2013) 33 (3): 1029-1047.
[3] Lieberman-Aiden et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science (2009) 326 : 289-293.