vConTACT2病毒分類注釋

文章:Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks
中文:通過基因分享網(wǎng)絡(luò)給META中的病毒基因組做分類注釋
雜志:Nature Biotechnology
時間:2019

bitbucket: https://bitbucket.org/MAVERICLab/vcontact2/wiki/Home

安裝

conda install -n vcontact2 python=3
conda activate vcontact2
conda install -y -c bioconda vcontact2
conda install -y -c bioconda mcl blast diamond
# -y, --yes             Do not ask for confirmation.

獲取依賴cluster_one

# 下載聚類軟件,移動到conda/bin路徑 (可使用win下載代替)
wget -c http://www.paccanarolab.org/static_content/clusterone/cluster_one-1.0.jar

java -jar cluster_one-1.0.jar -h

查看數(shù)據(jù)庫:

安裝vcontact2也順便下載了數(shù)據(jù)庫

ll -alh /route/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/data

查看蛋白序列數(shù):

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | wc -l
268145
zcat ViralRefSeq-prokaryotes-v88.faa.gz | grep '^>' | wc -l
230992
zcat ViralRefSeq-prokaryotes-v85.faa.gz | grep '^>' | wc -l
231165
zcat ViralRefSeq-prokaryotes-v201.faa.gz | grep '^>' | wc -l
363514

蛋白信息

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | head
>NP_037662.1 terminase small subunit [Escherichia virus HK022]
>NP_037663.1 terminase large subunit [Escherichia virus HK022]
>NP_037664.1 head portal protein [Escherichia virus HK022]
>NP_037665.1 head maturation protease [Escherichia virus HK022]
>NP_037666.1 major capsid subunit precursor [Escherichia virus HK022]
>NP_037667.1 gp6 [Escherichia virus HK022]

注釋信息

less -S ViralRefSeq-prokaryotes-v94.Merged-reference.csv | head
Organism/Name,origin,order,family,subfamily,genus
Acholeplasma virus L2,RefSeq-94,,Plasmaviridae,,Plasmavirus
Acholeplasma virus MV-L51,RefSeq-94,,Inoviridae,,Plectrovirus

protein_id,contig_id,keywords

less -S ViralRefSeq-prokaryotes-v94.protein2contig.csv | head
protein_id,contig_id,keywords
NP_955551.1,Acholeplasma virus L2,envelope protein
NP_040808.1,Acholeplasma virus L2,envelope protein
NP_040809.1,Acholeplasma virus L2,hypothetical protein L2_02
NP_040810.1,Acholeplasma virus L2,hypothetical protein L2_03
NP_040811.1,Acholeplasma virus L2,hypothetical protein L2_04
NP_040812.1,Acholeplasma virus L2,hypothetical protein L2_05

vcontact2參數(shù)

vcontact2 --help

--raw-proteins FASTA-formatted proteins file
--proteins-fp A file linking the protein name and the genome names (csv or tsv)
--rel-mode {BLASTP,Diamond,MMSeqs2} 蛋白比對方法拾给,計算蛋白序列相似性
--pcs-mode {ClusterONE,MCL} 蛋白聚類方法
--vcs-mode {ClusterONE,MCL} 病毒聚類方法
--c1-bin "cluster_one-1.0.jar"的路徑
--db 參考庫
{None,
ProkaryoticViralRefSeq85-ICTV [default],
ProkaryoticViralRefSeq85-Merged,
ProkaryoticViralRefSeq88-Merged,
ProkaryoticViralRefSeq94-Merged,
ProkaryoticViralRefSeq97-Merged,
ProkaryoticViralRefSeq201-Merged,
ArchaeaViralRefSeq85-Merged,
ArchaeaViralRefSeq94-Merged,
ArchaeaViralRefSeq97-Merged,
ArchaeaViralRefSeq201-Merged}

推薦參數(shù):

vcontact --raw-proteins [proteins file] \
--rel-mode ‘Diamond’ \
--proteins-fp [gene-to-genome mapping file] \
--db 'ProkaryoticViralRefSeq94-Merged' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin [path to ClusterONE] \
--output-dir [target output directory]

輸入數(shù)據(jù)格式:

1 prodigal獲取蛋白序列

提取中質(zhì)量ID和序列

# medium more quality sequence id
cat quality_summary.tsv | awk -F"\t" '{if($8=="Medium-quality") print $1}' > medium_more.contigs
# medium more quality sequence
for i in `cat medium_more.contigs`;
do
    cat combined.fna | grep -A 1 $i >> medium_more.fna
    echo -e "$i done..."
done

蛋白預(yù)測和翻譯

prodigal \
-a ./prodigal/out.faa \
-d ./prodigal/out.fna \
-f gff \
-g 11 \
-o ./prodigal/out.gff \
-p single \
-s ./prodigal/out.stat \
-i ./checkv/output_sop/medium_more.fna

2 準(zhǔn)備gene2genome文件

conda activate vcontact2
vcontact2_gene2genome \
--proteins out.faa \
--output out_map.csv \
--source-type Prodigal-FAA

必須使用csv后綴坠狡,否則后續(xù)分析報錯

參數(shù):
--source-type
{VIRSorter,Prodigal-coords,Prodigal-FAA, MetaGeneMark,NCBICodingSequence,NCBIFasta}

過程:

vcontact2_gene2genome:174: DeprecationWarning: 'U' mode is deprecated
  with open(results.proteins, 'rU') as proteins_fh:

結(jié)果:

3 vcontact2獲取PC和VC

vcontact2 \
--rel-mode 'Diamond' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/ \
--db 'ProkaryoticViralRefSeq94-Merged' \
--verbose --threads 4 \
--raw-proteins ./prodigal/out.faa \
--proteins-fp ./prodigal/out_map.csv \
--output-dir ./vcontact2/

過程:

============================This is vConTACT2 0.9.19

----------------------------------Pre-Analysis

INFO:vcontact2: Found Diamond
INFO:vcontact2: Found MCL
INFO:vcontact2: Identified 4 CPUs
INFO:vcontact2: Using reference database: ProkaryoticViralRefSeq94-Merged
INFO:vcontact2: Using existing directory ./vcontact2/

------------------------------Reference databases

INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user sequences...
INFO:vcontact2: Creating Diamond database and running Diamond...
INFO:vcontact2.protein_clusters: Creating Diamond database...
INFO:vcontact2.protein_clusters: Running Diamond...

-------------------------------Protein clustering

INFO:vcontact2: Loading proteins...
INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
DEBUG:vcontact2: Read 268201 proteins from ./prodigal/out_map.csv.
DEBUG:vcontact2.protein_clusters: Generating abc file...
DEBUG:vcontact2.protein_clusters: Running MCL...
INFO:vcontact2: Building the cluster and profiles 
INFO:vcontact2: Saving intermediate files...

----------------------------------Loading data

DEBUG:vcontact2: Read 2617 entries from ./vcontact2/vConTACT2_contigs.csv
INFO:vcontact2: Read 232886 entries (dropped 2328 singletons)

--------------------------------Adding Taxonomy

------------------------Calculating Similarity Networks

DEBUG:vcontact2.pcprofiles: Hypergeometric contig-similarity network:
DEBUG:vcontact2.pcprofiles: 21269 PCs present in strictly more than 3 contigs
DEBUG:vcontact2.pcprofiles: Hypergeometric PCs-similarity network
DEBUG:vcontact2: Network Contigs

------------------------Contig Clustering & Affiliation

DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, fami
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Saving network in file ./vcontact2/c1.ntw (9513ines).
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClustNE
INFO:vcontact2.contig_clusters: Running clusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE results are being saved to ./vcontacc1.clusters.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connecteodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
ERROR:vcontact2: Error in viral clusters
ERROR:vcontact2: type object 'object' has no attribute 'dtype'

Traceback (most recent call last):
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/bin/vcontact2", line 637, in main
    vc = vcontact2.cluster_refinements.ViralClusters(gc.contigs, profiles_fp, optimize=options.optimize)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/cluster_refinements.py", line 37, in __in
    self.metrics = pd.DataFrame(columns=summary_headers)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/frame.py", line 411, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 242, i
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1221, in construc
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

debug round 1

更新pandas泞莉,自動downgrade vcontact2梯啤,結(jié)果依然bug

conda install pandas=1.2.3

Downloading and Extracting Packages
certifi-2021.10.8    | 145 KB    | ####################################### | 100%
pandas-1.2.3         | 12.1 MB   | ####################################### | 100%
vcontact2-0.9.15     | 98.0 MB   | ####################################### | 100%
openssl-3.0.0        | 2.9 MB    | ####################################### | 100%
ca-certificates-2021 | 139 KB    | ####################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 運行
# vConTACT2 0.9.13
ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 2

此次vcontact2版本依然是0.9.15嫩实,沒有實現(xiàn)更新堕汞,僅更新了numpy等依賴罪帖。

conda update vcontact2

Downloading and Extracting Packages
numpy-1.21.2         | 6.2 MB    | ####################################### | 100%

conda list

vcontact2                 0.9.15                     py_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
pandas                    1.2.3            py38h51da96c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.21.2           py38he2449b9_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 3

conda list
pandas                    0.25.3
numpy                     1.21.2
conda install numpy=1.19.5

------------------------Contig Clustering & Affiliation-------------------------
DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Network file already exist.
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE file ./vcontact2/c1.clusters already exist.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connected nodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
DEBUG:vcontact2.cluster_refinements: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.cluster_refinements: Optimizing on distance: 9
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements: Identified a single best composite score 2.761417223726828 for distance 9
INFO:vcontact2.cluster_refinements: Merging optimal distance determined from performance evaluations.
DEBUG:vcontact2.evaluations: 3 taxonomic levels detected: order, family, genus
INFO:vcontact2.evaluations: Performance evaluations at the order level...
INFO:vcontact2.evaluations: Performance evaluations at the family level...
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements:              PPV  Sensitivity  Accuracy
order   1.000000     0.351764  0.593097
family  0.994381     0.630965  0.792098
genus   0.869642     0.972256  0.919519

--------------------------------Protein modules---------------------------------
DEBUG:vcontact2.modules: Filtered 0 edges according to the sig. threshold 1.0.
INFO:vcontact2.modules: Exporting the PC-network for MCL
DEBUG:vcontact2.modules: Saving network in file ./vcontact2/modules.ntwk (2292198 lines)
INFO:vcontact2.modules: Clustering the PC similarity-network
DEBUG:vcontact2.modules: MCL(5.0) results are saved in ./vcontact2/modules_mcl_5.0.clusters.
INFO:vcontact2.modules: Loading the clustering results
DEBUG:vcontact2.modules: Saving 622 modules containing 18958  protein clusters in ./vcontact2/modules_mcl_5.0_modules.pandas.
---------------------------Link modules and clusters----------------------------
INFO:vcontact2.modules: 2844 contigs-modules owning association, 50018 filtered (a contig must have 50% of the PCs to own a module).
INFO:vcontact2.modules: Linking 622 modules with 346 contigs clusters...
INFO:vcontact2.modules: Network done 346 clusters, 622 modules and 297 edges.


----------------------------Exporting results files-----------------------------
INFO:vcontact2: Identifying genomes that are not clustered (i.e. singletons, outliers and overlaps
There were 540 genomes (including refs) that were singleton, outlier or overlaps.
INFO:vcontact2: Building final summary table
INFO:vcontact2.exports.summaries: Reading edges for 2617 contigs
INFO:vcontact2.exports.summaries: Building PC array
INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
...
INFO:vcontact2.exports.summaries: Writing viral cluster overview file...
INFO:vcontact2.exports.summaries: Examining each viral cluster and breaking it down into individual genomes...
INFO:vcontact2.exports.summaries: Writing the genome-by-genome overview file...

yesssssssssssssssss

4 結(jié)果

更多:
Supplementing and Colouring vConTACT2 Clusters
Applying vContact to Viral Sequences and Visualizing the Output
https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
(2017). vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ
Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and Structural Biotechnology Journal. 2020

https://github.com/pandas-dev/pandas/issues/39520

if you are stuck on pandas==0.24.2 (don't ask); downgrading to numpy==1.19.5 works
THANK YOU
also works with pandas==0.25.3

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市迫吐,隨后出現(xiàn)的幾起案子库菲,更是在濱河造成了極大的恐慌,老刑警劉巖志膀,帶你破解...
    沈念sama閱讀 218,941評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件熙宇,死亡現(xiàn)場離奇詭異鳖擒,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)烫止,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,397評論 3 395
  • 文/潘曉璐 我一進(jìn)店門蒋荚,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人馆蠕,你說我怎么就攤上這事期升。” “怎么了互躬?”我有些...
    開封第一講書人閱讀 165,345評論 0 356
  • 文/不壞的土叔 我叫張陵播赁,是天一觀的道長。 經(jīng)常有香客問我吼渡,道長容为,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,851評論 1 295
  • 正文 為了忘掉前任寺酪,我火速辦了婚禮坎背,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘房维。我一直安慰自己沼瘫,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 67,868評論 6 392
  • 文/花漫 我一把揭開白布咙俩。 她就那樣靜靜地躺著,像睡著了一般湿故。 火紅的嫁衣襯著肌膚如雪阿趁。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,688評論 1 305
  • 那天坛猪,我揣著相機(jī)與錄音脖阵,去河邊找鬼。 笑死墅茉,一個胖子當(dāng)著我的面吹牛命黔,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播就斤,決...
    沈念sama閱讀 40,414評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼悍募,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了洋机?” 一聲冷哼從身側(cè)響起坠宴,我...
    開封第一講書人閱讀 39,319評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎绷旗,沒想到半個月后喜鼓,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體副砍,經(jīng)...
    沈念sama閱讀 45,775評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,945評論 3 336
  • 正文 我和宋清朗相戀三年庄岖,在試婚紗的時候發(fā)現(xiàn)自己被綠了豁翎。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,096評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡隅忿,死狀恐怖谨垃,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情硼控,我是刑警寧澤刘陶,帶...
    沈念sama閱讀 35,789評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站牢撼,受9級特大地震影響匙隔,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜熏版,卻給世界環(huán)境...
    茶點故事閱讀 41,437評論 3 331
  • 文/蒙蒙 一纷责、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧撼短,春花似錦再膳、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,993評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至禾嫉,卻和暖如春灾杰,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背熙参。 一陣腳步聲響...
    開封第一講書人閱讀 33,107評論 1 271
  • 我被黑心中介騙來泰國打工艳吠, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人孽椰。 一個月前我還...
    沈念sama閱讀 48,308評論 3 372
  • 正文 我出身青樓昭娩,卻偏偏與公主長得像,于是被迫代替她去往敵國和親黍匾。 傳聞我的和親對象是個殘疾皇子栏渺,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,037評論 2 355

推薦閱讀更多精彩內(nèi)容