軟件名:GEMINI
版本號(hào):0.20.2-dev
1. 軟件用途綜述
GEMINI (GEnome MINIng)是一款基因組變異挖掘軟件。該軟件依賴強(qiáng)大的注釋文件毫别,故僅適用于人基因組分析。該軟件在進(jìn)行分析時(shí)是將遺傳變異、表型该编、基因型及注釋信息整合形成SQLite數(shù)據(jù)庫(kù)向抢,在此基礎(chǔ)上進(jìn)行種類(lèi)多樣的分析蔓涧。改軟件使用范圍很廣:家系分析(新生突變、常染色體顯性遺傳突變笋额、常染色體銀杏果遺傳突變)、群體分析篷扩、成對(duì)樣本腫瘤分析兄猩。
網(wǎng)址:http://gemini.readthedocs.io/en/latest/content/installation.html
2. 分析原理
該軟件上游可以是VCF也可以使PED格式。該軟件在進(jìn)行分析時(shí)是將遺傳變異鉴未、表型枢冤、基因型及注釋信息整合形成SQLite數(shù)據(jù)庫(kù),在此基礎(chǔ)上進(jìn)行種類(lèi)多樣的分析铜秆。該軟件自帶很多數(shù)據(jù)庫(kù)淹真,如ENCODE tracks, UCSC tracks, OMIM, dbSNP, KEGG, HPRD等,整合了注釋功能连茧。
3. 實(shí)現(xiàn)方法
3.1 使用示例
1)軟件安裝:
wget https://github.com/arq5x/gemini/raw/master/gemini/scripts/gemini_install.py
python gemini_install.py $tools $data
PATH=$tools/bin:$data/anaconda/bin:$PATH
$ gemini update --dataonly --extra cadd_score
$ gemini update --dataonly --extra gerp_bp
#其中$tools是軟件安裝路徑核蘸,$data是軟件數(shù)據(jù)庫(kù)所在路徑。
2)分析前準(zhǔn)備:
GEMINI的上游輸入文件為VCF或者ped文件啸驯。0.12.2以后的版本需要對(duì)VCF文件進(jìn)行預(yù)處理客扎,如分解多于兩個(gè)allele的變異位點(diǎn)并用VT工具包進(jìn)行格式化。該數(shù)據(jù)庫(kù)也對(duì)將用于VCF注釋的數(shù)據(jù)庫(kù)文件進(jìn)行了同樣的處理罚斗,具體如下:
- If working with GATK VCFs, you need to correct the AD INFO tag definition to play nicely with vt.
- Decompose the original VCF such that variants with multiple alleles are expanded into distinct variant records; one record for each REF/ALT combination.
- Normalize the decomposed VCF so that variants are left aligned and represented using the most parsimonious alleles.
- Annotate with VEP or snpEff.
- bgzip and tabix.
流程如下:
*# setup*
VCF=/path/to/my.vcf
NORMVCF=/path/to/my.norm.vcf.gz
REF=/path/to/human.b37.fasta
SNPEFFJAR=/path/to/snpEff.jar
*# decompose, normalize and annotate VCF with snpEff.*
*# NOTE: can also swap snpEff with VEP*
zless $VCF **\**
| sed 's/ID=AD,Number=./ID=AD,Number=R/' **\**
| vt decompose -s - **\**
| vt normalize -r $REF - **\**
| java -Xmx4G -jar $SNPEFFJAR GRCh37.75 **\**
| bgzip -c > $NORMVCF
tabix -p vcf $NORMVCF
*# load the pre-processed VCF into GEMINI*
gemini load --cores 3 -t snpEff -v $NORMVCF $db
*# query away*
gemini query -q "select chrom, start, end, ref, alt, (gts).(*) from variants" **\**
--gt-filter "gt_types.mom == HET and \
gt_types.dad == HET and \
gt_types.kid == HOM_ALT" **\**
$db
3)使用示例
將待分析VCF導(dǎo)入 數(shù)據(jù)庫(kù):
gemini load -v snp.filter.vcf --cores 8 test.db
ROH分析:
gemini roh --min-snps 50 --min-gt-depth 20 --min-size 1000000 -s S138 test.db
3.2 程序說(shuō)明
該程序可以輸入文件可以是VCF格式(單樣本或者群體均可)或者是ped格式徙鱼,該程序可調(diào)用VEP 或者snpEff進(jìn)行注釋?zhuān)士山邮芪醋⑨尩奈募部梢越邮茏⑨尯蟮奈募胱耍饕獏?shù)說(shuō)明如下
-v 待分析VCF袱吆;
--cores 導(dǎo)入vcf時(shí)使用的線程數(shù)
Roh roh分析
3.3軟件參數(shù)詳細(xì)說(shuō)明
gemini roh --min-snps 50 \ROH****包含的****SNP****數(shù)
--min-gt-depth 20 *樣本的最低深度*
--min-size 1000000 \ROH****的最小片段長(zhǎng)度
-s S138 *樣本名*
roh_run.db \vcf****導(dǎo)入后的數(shù)據(jù)庫(kù)名
3.4 結(jié)果展示及說(shuō)明
chrom start end sample num_of_snps density_per_kb run_length_in_bp
chr2 233336080 234631638 S138 2583 1.9953 1295558
chr2 238341281 239522281 S138 2899 2.4555 1181000
注:結(jié)果是屏幕輸出,中間還夾雜著log日志距淫,如下圖所示:
- chrom:染色體
- start:變異位點(diǎn)在染色體上的起始位置
- end:變異位點(diǎn)在染色體上的終止位置
- sample:樣本名
- num_of_snps:roh內(nèi)的snp數(shù)目
- density_per_kb:?jiǎn)挝婚L(zhǎng)度上的密度
- run_length_in_bp:roh長(zhǎng)度
4. 注意事項(xiàng)
a) GEMINI solely supports human genetic variation mapped to build 37 (aka hg19) of the human genome.
b) GEMINI is very strict about adherence to VCF format 4.1.
c) For best performance, load and query GEMINI databases on the fastest hard drive to which you have access.
d) 軟件安裝時(shí)需要下載數(shù)據(jù)庫(kù)绞绒,安裝時(shí)自帶月15G數(shù)據(jù)庫(kù),額外還需要下載兩個(gè)數(shù)據(jù)庫(kù):CADD(39G)和GERP(7G)
e) 該軟件注釋需要VEP 或者snpEff
f) 該軟件要輸入VCF嚴(yán)格要求vcf4.1
g) 該軟件在分析前需要將VCF導(dǎo)入SQL數(shù)據(jù)庫(kù)榕暇,一個(gè)723M的vcf需要16h处铛!
h) 該軟件安裝需要依賴
- Python 2.7.x
- git
- wget
- a working C / C++ compiler such as gcc
- zlib (including headers)
5. 軟件相關(guān)文獻(xiàn)引用
Paila U, Chapman BA, Kirchner R, Quinlan AR (2013)GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations.PLoS Comput Biol 9(7): e1003153. doi:10.1371/journal.pcbi.1003153