導讀
第一次做宏基因組還是在2018年诞帐,現(xiàn)在2021很多軟件數(shù)據(jù)庫大更新欣尼,這里搭建目前比較受認可的流程,即用Kneaddata(trimmomatic bowtie2) kraken2 humann3進行宏基因組數(shù)據(jù)基礎(chǔ)分析景埃。
一媒至、質(zhì)控 - KneadData
kneaddata
GITHUB: https://github.com/biobakery/kneaddata
kneaddata軟件獲取
conda install kneaddata
參考:Kneaddata, Fastqc, Trimmomatic, Bowtie2下載,安裝谷徙,使用
kneaddata數(shù)據(jù)處理 - 腳本
#!/usr/bin/env bash
# 幫助文檔
helpdoc(){
cat <<EOF
Description:
This is a help document
- Plot circos
Usage:
$0 -i <input file> -o <output file>
Option:
-i this is a input file/fold
-o this is a output file/fold
EOF
}
# 參數(shù)傳遞
while getopts ":i:o:" opt
do
case $opt in
i)
infile=`echo $OPTARG`
;;
o)
outfile=`echo $OPTARG`
;;
?)
echo "未知參數(shù)"
exit 1;;
esac
done
# 若無指定任何參數(shù)則輸出幫助文檔
if [ $# = 0 ]
then
helpdoc
exit 1
fi
# 核心代碼
source /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
conda activate r403
mkdir Result/kneaddata/$infile
kneaddata \
-i rawdata/${infile}_1.fq.gz \
-i rawdata/${infile}_2.fq.gz \
-o Result/kneaddata/$infile/ \
-db /hwfssz1/ST_META/PN/hutongyuan/database/hg38/ \
--trimmomatic /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/envs/r403/share/trimmomatic/ \
-t 64 \
--trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2 /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/envs/r403/bin/ \
--bowtie2-options "--very-sensitive --dovetail" \
--remove-intermediate-output
kneaddata數(shù)據(jù)處理 - 運行
mkdir rawdata
cd rawdata
# 軟連接原始數(shù)據(jù)
ln -s /[route]/E100011007_L01_501_1.fq.gz 501_1.fq.gz
ln -s /[route]/E100011007_L01_501_2.fq.gz 501_2.fq.gz
cd ../
mkdir Result
mkdir Result/kneaddata
# qsub提交
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P PID -binding linear:8 q_kneaddata.sh -i 503
kneaddata數(shù)據(jù)處理 - 過程
Running Trimmomatic ...
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.1.fastq ): 184963001
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.2.fastq ): 184963001
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.single.1.fastq ): 9683036
Total reads after trimming ( /route/Result/kneaddata/502/502_1_kneaddata.trimmed.single.2.fastq ): 15706257
Decontaminating ...
kneaddata數(shù)據(jù)處理 - 結(jié)果
運行中已舍棄trim文件拒啰、比對文件,再舍棄污染序列完慧,舍棄unpaired序列谋旦。
參考:
KneadData質(zhì)控、SortMeRNA去rRNA
Quality Control
二屈尼、物種分類 - Kraken2
kraken2
官網(wǎng):https://ccb.jhu.edu/software/kraken2/
GITHUB:https://github.com/DerrickWood/kraken2/wiki
MANUAL:https://github.com/DerrickWood/kraken2/wiki/Manual
DATABASE: https://benlangmead.github.io/aws-indexes/k2
OLD_DB:http://ccb.jhu.edu/software/kraken2/downloads.shtml
KRAKEN2 2.1.1 (newest version):https://github.com/DerrickWood/kraken2
KRAKEN1/2文章:
標題:Improved metagenomic analysis with Kraken 2
譯題:kraken2優(yōu)化宏基因組分析
期刊:Genome Biology
時間:2019
標題:Kraken: ultrafast metagenomic sequence classification using exact alignments
譯題:利用精確比對進行超快的宏基因組序列鑒定
期刊:Genome Biology
時間:2014
kraken2軟件獲取
# 激活
source /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
# 新建
conda create -n kraken2
# 啟動
conda activate kraken2
# 安裝
conda install kraken2 -c bioconda
kraken2 --help
kraken2 --version # version 2.0.7-beta
# 升級
conda update kraken2 -c bioconda
kraken2 --version # All requested packages already installed.
kraken2是一個perl程序
kraken2數(shù)據(jù)庫獲取
DATABASE: https://benlangmead.github.io/aws-indexes/k2
網(wǎng)頁數(shù)據(jù)鏈接:
kraken2-build下載選項:
試圖下載標準庫 + PFP:
archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa, fungi, plant
古菌册着,細菌,病毒脾歧,質(zhì)粒甲捏,人,載體鞭执,原生生物司顿,真菌,植物的基因組
# 下載PLUSPFP
nohup wget -c https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20210127.tar.gz &
成功
kraken2數(shù)據(jù)處理 - 腳本
#!/usr/bin/env bash
# 幫助文檔
helpdoc(){
cat <<EOF
Description:
This is a help document
- Plot circos
Usage:
$0 -i <input file> -o <output file>
Option:
-i this is a input file/fold
-o this is a output file/fold
EOF
}
# 參數(shù)傳遞
while getopts ":i:o:" opt
do
case $opt in
i)
infile=`echo $OPTARG`
;;
o)
outfile=`echo $OPTARG`
;;
?)
echo "未知參數(shù)"
exit 1;;
esac
done
# 若無指定任何參數(shù)則輸出幫助文檔
if [ $# = 0 ]
then
helpdoc
exit 1
fi
# 核心代碼
source /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
conda activate kraken2
db_kraken2="/hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/kraken2"
mkdir Result/kraken2/$infile
kraken2 --db $db_kraken2 \
--threads 64 \
#--output Result/kraken2/$infile/${infile}_out.tsv \
--report Result/kraken2/$infile/${infile}_report.tsv \
--use-names \
--report-zero-counts \
--paired Result/kneaddata/$infile/${infile}_1_kneaddata_paired_1.fastq Result/kneaddata/$infile/${infile}_1_kneaddata_paired_2.fastq
kraken2數(shù)據(jù)處理 - 運行
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 501
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 502
qsub -cwd -l vf=100G,p=64 -q st.q -P P18 -binding linear:8 script/q_kraken2.sh -i 503
kraken2數(shù)據(jù)處理 - 過程
Loading database information... done.
59865565 sequences (11707.39 Mbp) processed in 697.071s (5152.9 Kseq/m, 1007.71 Mbp/m).
53282884 sequences classified (89.00%)
6582681 sequences unclassified (11.00%)
kraken2數(shù)據(jù)處理 -結(jié)果
out file
C E100011007L1C001R00300000012 Prevotella sp. oral taxon 299 str. F0039 (taxid 575
C E100011007L1C001R00300000018 Streptococcus sanguinis (taxid 1305) 100|100 0:6
C E100011007L1C001R00300000528 Streptococcus pneumoniae (taxid 1313) 100|100 0:1
C E100011007L1C001R00300000539 Streptococcus pneumoniae (taxid 1313) 100|100 0:2
C E100011007L1C001R00300002413 Streptococcus (taxid 1301) 63|100 0:15 1301:1
report file
11.00 6582681 6582681 U 0 unclassified
89.00 53282884 19056 R 1 root
88.91 53228791 72548 R1 131567 cellular organisms
87.30 52262033 397920 D 2 Bacteria
73.21 43825487 145121 D1 1783272 Terrabacteria group
68.81 41192272 92634 P 1239 Firmicutes
65.68 39320728 138891 C 91061 Bacilli
62.23 37256687 72766 O 186826 Lactobacillales
60.86 36431407 18819 F 1300 Streptococcaceae
60.81 36401562 11156413 G 1301 Streptococcus
report file explain
1 Percentage of fragments covered by the clade rooted at this taxon
2 Number of fragments covered by the clade rooted at this taxon
3 Number of fragments assigned directly to this taxon
4 A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
5 NCBI taxonomic ID number
6 Indented scientific name
舉例說明:這里的A = B兄纺,第二列是cover大溜,第三列是direct
多樣本結(jié)果數(shù)據(jù)合并-依賴和腳本
腳本:Kraken2-output-manipulation
依賴:
conda create -n python3.7 python=3.7 -c bioconda
conda activate python3.7
python3
import numpy
import scipy
import argparse
import pandas
import collection
python3 -m pip install numpy
python3 -m pip install collection
vi kraken_multiple.py # copy code
vi kraken_multiple_taxa.py # copy code
python /hutongyuan/analysis/oral/script/kraken_multiple.py --help
python /hutongyuan/analysis/oral/script/kraken_multiple_taxa.py --help
參數(shù):
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY Enter a directory with kraken summary reports
-r {U,R,D,K,P,C,O,F,G,S} Enter a rank code
-c {1,2,3,4,5,6} Enter the column number in the report you would like to include in the output
-o OUTPUT Enter the output file name
col1: Percentage of fragments covered by the clade rooted at this taxon
col2: Number of fragments covered by the clade rooted at this taxon
col3: Number of fragments assigned directly to this taxon
col4: A rank code
col5: NCBI taxonomic ID number (this doesnt make sense to report for every column but its possible)
col6: Indented scientific name
合并數(shù)據(jù)
# 種水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r S -c 2 -o kraken2_species_tax.tsv &
# 屬水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r G -c 2 -o kraken2_genus_tax.tsv &
# 門水平
python /script/kraken_multiple_taxa.py \
-d kraken2_report -r P -c 2 -o kraken2_phylum_tax.tsv &
## 修改格式
cat kraken2_report_merge_raw/kraken2_species_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_species_tax.tsv
cat kraken2_report_merge_raw/kraken2_genus_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_genus_tax.tsv
cat kraken2_report_merge_raw/kraken2_phylum_tax.tsv | sed -e "s/\[//g;s/\]//g;s/'//g;s|\t|,|g" | sed -e "s/kraken2_report\///g;s/_report.tsv//g" > kraken2_report_merge_clean/kraken2_phylum_tax.tsv
結(jié)果
cat kraken2_phylum_tax.tsv | wc -l
74
cat kraken2_genus_tax.tsv | wc -l
3055
cat kraken2_species_tax.tsv | wc -l
17092
注釋結(jié)果包含所有分類,所以該kraken數(shù)據(jù)庫種有:74個門估脆,3055個屬钦奋,17092個種。舉例疙赠,門水平注釋付材,合并,格式轉(zhuǎn)換結(jié)果如下:
更多KRAKEN2:
Kraken2:宏基因組快速物種注釋神器
Kraken2+Bracken(Kraken2聯(lián)合Bracken宏基因組物種注釋)
Kraken拓展工具KrakenTools
bracken
官網(wǎng):https://ccb.jhu.edu/software/bracken/index.shtml
文章:Bracken: estimating species abundance in metagenomics data. peerJ Computer Science. 2017
bracken軟件獲取
conda create -n bracken
conda activate bracken
conda install bracken -c bioconda
bracken使用
source /your_route/miniconda3/etc/profile.d/conda.sh
conda activate bracken
mkdir ./kraken_quant/bracken
for i in `ls ./kraken_quant/$infile/`; do
base=${i%_report.txt}
bracken \
-d /public/home/zzumgg03/huty/projects/oral/kraken_database \
-i ./kraken_quant/$infile/$i \
-t 30 \
-o ./kraken_quant/bracken/${base}_report_bracken.txt \
-w ./kraken_quant/bracken/${base}_report_bracken.report
done
參數(shù):
-d kraken2數(shù)據(jù)庫地址
-i∑匝簟kraken2∩∽狻report文件
-t 線程數(shù)
-o 輸出結(jié)果
-w 輸出報告
bracken輸出文件
1 sample_report_bracken.txt
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
s__4 217 S 489584 44999 534583 0.02665
s__159 154 S 392118 123430 515548 0.02570
s__6 239 S 339437 143082 482519 0.02405
2 sample_report_bracken.report
100.00 20062259 0 R 1 root
100.00 20062259 0 R1 2 d__Bacteria
36.07 7237346 0 P 5 p__Firmicutes
36.07 7237346 0 C 11 c__Bacilli
32.20 6459679 0 O 27 o__Lactobacillales
29.05 5828230 0 F 55 f__Streptococcaceae
29.05 5827410 0 G 86 g__Streptococcus
2.66 534583 534583 S 217 s__4
2.57 515548 515548 S 154 s__159
3 過程文件
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i ./kraken_quant/merge/ERR589349.1_report.txt -o ./kraken_quant/bracken/ERR589349.1_report_bracken.txt -k /public/home/zzumgg03/huty/projects/oral/kraken_database_195/database100mers.kmer_distrib -l S -t 30
>> Checking report file: ./kraken_quant/merge/ERR589349.1_report.txt
PROGRAM START TIME: 04-26-2022 07:14:14
BRACKEN SUMMARY (Kraken report: ./kraken_quant/merge/ERR589349.1_report.txt)
>>> Threshold: 30
>>> Number of species in sample: 195
>> Number of species with reads > threshold: 192
>> Number of species with reads < threshold: 3
>>> Total reads in sample: 7570278
>> Total reads kept at species level (reads > threshold): 4931353
>> Total reads discarded (species reads < threshold): 46
>> Reads distributed: 5533145
>> Reads not distributed (eg. no species above threshold): 68
>> Unclassified reads: 2037065
BRACKEN OUTPUT PRODUCED: ./kraken_quant/bracken/ERR589349.1_report_bracken.txt
PROGRAM END TIME: 04-26-2022 07:14:14
Bracken complete.
三、功能注釋 - HUMAnN3
HUMAnN3
官網(wǎng):https://huttenhower.sph.harvard.edu/humann
GITHUB MANUAL: https://github.com/biobakery/humann
humann3軟件獲取
conda create -n humann3
conda activate humann3
conda install python=3.7
失敗
conda create -n humann3 python=3.7
conda activate humann3
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery
conda config --show
# 安裝humann
conda install humann -c biobakery
humann --version # humann v3.0.0.alpha.3
humann3 --version # humann v3.0.0.alpha.3
metaphlan --version # MetaPhlAn version 3.0.13 (27 Jul 2021)
成功
bowtie2 --help報錯限佩,直接影響humann能否正常工作
miniconda3/envs/humann3/bin/bowtie2-align-s:
symbol lookup error: miniconda3/envs/humann3/bin/bowtie2-align-s:
undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv
(ERR): Description of arguments failed!
Exiting now ...
bowtie2-build-s: symbol lookup error, undefined symbol
conda安裝bowtie2的報錯:undefined symbol
一個成功的偏方
# bowtie2 error
# https://blog.csdn.net/u012110870/article/details/115166861
# 先啟動metawrap, 再啟動r403搞定
在另一臺服務(wù)器中葵诈,metawrap安裝不了裸弦,利用metawrap空環(huán)境python2.7僅安裝bowtie2也報錯。bowtie2默認會同時安裝perl tbb=2021作喘,下面試試安裝tbb=2020理疙,然后我成功了,嘎嘎嘎嘎泞坦。
conda create -y -n metawrap-env python=2.7
conda activate metawrap-env
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels ursky
# conda install biopython blas=2.5 blast=2.6.0 bmtagger bowtie2 bwa checkm-genome fastqc kraken=1.1 kraken=2.0 krona=2.7 matplotlib maxbin2 megahit metabat2 pandas prokka quast r-ggplot2 r-recommended salmon samtools=1.9 seaborn spades trim-galore
# 失敗
conda install bowtie2
bowtie2 --help # 失敗
conda install tbb=2020.2
bowtie2 --help # 成功
humann3數(shù)據(jù)庫獲取
http://huttenhower.sph.harvard.edu/humann_data/
物種庫:
舊http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/
新http://huttenhower.sph.harvard.edu/humann_data/chocophlan/
功能庫:
舊http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/
新http://huttenhower.sph.harvard.edu/humann_data/
注釋庫:
舊http://huttenhower.sph.harvard.edu/humann2_data/
新http://huttenhower.sph.harvard.edu/humann_data/
wget -c http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz
wget -c http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz
wget -c http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
成功
解壓窖贤,打開chocophlan看看
tar -zxvf chocophlan/full_chocophlan.v296_201901.tar.gz
ls chocophlan | wc -l
# 10670 個基因組序列文件
ls chocophlan | head
rm chocophlan/full_chocophlan.v296_201901.tar.gz
du -h chocophlan # 16G
g__Abditibacterium.s__Abditibacterium_utsteinense.centroids.v296_201901.ffn.gz
g__Abiotrophia.s__Abiotrophia_defectiva.centroids.v296_201901.ffn.gz
g__Abiotrophia.s__Abiotrophia_sp_HMSC24B09.centroids.v296_201901.ffn.gz
g__Absiella.s__Absiella_dolichum.centroids.v296_201901.ffn.gz
g__Absiella.s__Absiella_sp_AM09_45.centroids.v296_201901.ffn.gz
g__Abyssibacter.s__Abyssibacter_profundi.centroids.v296_201901.ffn.gz
g__Acaryochloris.s__Acaryochloris_marina.centroids.v296_201901.ffn.gz
g__Acaryochloris.s__Acaryochloris_sp_RCC1774.centroids.v296_201901.ffn.gz
g__Acetanaerobacterium.s__Acetanaerobacterium_elongatum.centroids.v296_201901.ffn.gz
解壓,打開uniref_annotated看看
tar -zxvf uniref90_annotated_v201901.tar.gz
ll -alh
rm uniref90_annotated_v201901.tar.gz
-rw-r--r-- 1 hutongyuan ST_META 34G Jun 12 2019 uniref90_201901.dmnd
-rw-r--r-- 1 hutongyuan ST_META 20G Apr 22 10:21 uniref90_annotated_v201901..gz
解壓贰锁,打開full_mapping文件看看
tar -zxvf full_mapping_v201901.tar.gz
rm full_mapping_v201901.tar.gz
ls -alh
配置數(shù)據(jù)庫路徑
humann_config # 更新前
humann_config --update database_folders nucleotide /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/chocophlan/
humann_config --update database_folders protein /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/uniref_annotated/
humann_config --update database_folders utility_mapping /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/mapping/
humann_config # 更新后
humann3數(shù)據(jù)處理 - 腳本
#!/usr/bin/env bash
# 幫助文檔
helpdoc(){
cat <<EOF
Description:
This is a help document
- Plot circos
Usage:
$0 -i <input file> -o <output file>
Option:
-i this is a input file/fold
-o this is a output file/fold
EOF
}
# 參數(shù)傳遞
while getopts ":i:o:" opt
do
case $opt in
i)
infile=`echo $OPTARG`
;;
o)
outfile=`echo $OPTARG`
;;
?)
echo "未知參數(shù)"
exit 1;;
esac
done
# 若無指定任何參數(shù)則輸出幫助文檔
if [ $# = 0 ]
then
helpdoc
exit 1
fi
# 核心代碼
source /hwfssz1/ST_META/PN/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
conda activate humann3
file="Result/kneaddata/$infile/${infile}_1_kneaddata_paired_cat.fastq"
if [ ! -f $file ]; then
echo -e "\033[32m 文件不存在赃梧,正在創(chuàng)建 \033[0m"
cat Result/kneaddata/$infile/${infile}_1_kneaddata_paired_1.fastq Result/kneaddata/$infile/${infile}_1_kneaddata_paired_2.fastq > Result/kneaddata/$infile/${infile}_1_kneaddata_paired_cat.fastq
else
echo -e "\033[32m 文件存在 \033[0m"
fi
mkdir Result/humann3/$infile
humann \
--input Result/kneaddata/$infile/${infile}_1_kneaddata_paired_cat.fastq \
--output Result/humann3/$infile/ \
--threads 64
關(guān)于paired-end處理,manaual給出意見:
HUMAnN 3.0 and paired-end sequencing data
End-pairing relationships are currently not taken into account during HUMAnN 3.0's alignment steps. This is due to the fact that HUMAnN 3.0 strictly aligns reads to isolated coding sequences: either at the nucleotide level or through translated search. As such, it will frequently be the case that one read (READ1) will map inside a given coding sequence while its mate-pair (READ2) will not.
運行bug
更多:
HUMAnN 3.0 (alpha)安裝及使用
Humann3 Paired end reads
Build an appropriate kraken2 database