Metagenomic analysis reveals oropharyngeal microbiota alterations in patients with COVID-19
哈工大省醫(yī)
宏基因組分析揭示了新冠病人口腔微生物的改變牢裳。
31 covid-19逢防,29 flu,28 normal 三組對(duì)照蒲讯。
宏基因組數(shù)據(jù)先區(qū)分出metagenomics忘朝,和COVID-19,reads伶椿,然后mapping to covid-19 reference.
組裝出whole genome 看看snp辜伟,哪些位點(diǎn)是否有變異臣淤?
virus discovery抄谐?
看鑒別出是哪種virus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7441049/
鑒定出的COVID-19 樣本:定的標(biāo)準(zhǔn)是什么塘娶?
討論:
meta對(duì)疾病的惡化,和預(yù)后的影響偎痛。
[]metaquast 提供reference
[]prokka參數(shù) --kingdom 中Bacteria 和Viruses的區(qū)別
########################################################
用contig跟coronavirus reference做blast,
找到所有的coronavirus的contig独郎,并畫(huà)圖:
cd /USB/metagenomics/analysis/my_project/blastn_coronavirus_result
python? gather_blastn_contig.py /USB/metagenomics/analysis/my_project/blastn_coronavirus_result /USB/metagenomics/analysis/my_project/megahit_result >all_sample.coronavirus.contigs.fa
下一步用mega或什么對(duì)齊軟件畫(huà)個(gè)樹(shù)看看踩麦。
保留這個(gè)結(jié)果,并且驗(yàn)證跟kraken2的結(jié)果的一致性:
#######################################################
braken build database
/USB/tools/Bracken-2.6.2/bracken-build -d /USB/meta_genomics_pipeline/database/kraken2_database -t 32 -l 250
#######################################################
troubleshooting for kraken2 results
Question:kraken2的結(jié)果中有48% unclassified氓癌,并且沒(méi)有找到virus相關(guān)的任何比例谓谦。
Loading database information... done.
275881 sequences (55.16 Mbp) processed in 10.696s (1547.6 Kseq/m, 309.44 Mbp/m).
? 142595 sequences classified (51.69%)
? 133286 sequences unclassified (48.31%)
解決:在kraken2構(gòu)建完成的database中seqid2taxid.map文件里面沒(méi)有找到virus的序列,故此推斷在構(gòu)建時(shí)沒(méi)有把virus包含進(jìn)去贪婉,因?yàn)橹匦率褂胟raken2-build 構(gòu)建數(shù)據(jù)庫(kù)反粥。Note:重新構(gòu)建數(shù)據(jù)庫(kù)前必須要?jiǎng)h掉原來(lái)的seqid2taxid.map,taxo.k2d疲迂,hash.k2d才顿,opts.k2d四個(gè)文件,
第二次:after rebuild the database尤蒿,重新跑kraken2:
$ /USB/meta_genomics_pipeline/bin/kraken2 --db /USB/meta_genomics_pipeline/database/kraken2_database? --threads 50? --report /USB/metagenomics/analysis/my_project/kraken2_result/houmaohu.kraken2.report --use-names --output /USB/metagenomics/analysis/my_project/kraken2_result/houmaohu.kraken2.output /USB/metagenomics/analysis/my_project/kneaddata_result/houmaohu.kneaddata.fastq
Loading database information... done.
275881 sequences (55.16 Mbp) processed in 3.922s (4221.0 Kseq/m, 843.98 Mbp/m).
? 174591 sequences classified (63.28%)
? 101290 sequences unclassified (36.72%)
finally we found Covid-19 in kraken2 result. yeah. but still 36.72% of sequences are unclassified. we doubt that they are human related sequencing .
從第二次的結(jié)果中能夠找到virus結(jié)果了郑气,但是仍然有36%是unclassified,驗(yàn)證是否是human genome腰池,發(fā)現(xiàn)不是尾组。
驗(yàn)證方法如下:
使用bowtie2 將用于kraken2的fastq跟hg19 reference比對(duì),發(fā)現(xiàn)比對(duì)率為0.說(shuō)明該fastq已經(jīng)不含有human genome了示弓。那么懷疑是其它fungi讳侨,plasmid,等數(shù)據(jù)庫(kù)避乏,為了驗(yàn)證爷耀,
需要下載其他數(shù)據(jù)庫(kù),并重新使用kraken2-build 構(gòu)建index拍皮。
第三次更新數(shù)據(jù)庫(kù)之后
$ /USB/meta_genomics_pipeline/bin/kraken2 --db /USB/meta_genomics_pipeline/database/kraken2_database? --threads 32? --report /USB/metagenomics/analysis/my_project/kraken2_result/houmaohu.kraken2.report --use-names --output /USB/metagenomics/analysis/my_project/kraken2_result/houmaohu.kraken2.output /USB/metagenomics/analysis/my_project/kneaddata_result/houmaohu.kneaddata.fastq 2>/USB/metagenomics/analysis/my_project/kraken2_result/houmaohu.kraken2.log
Loading database information... done.
275881 sequences (55.16 Mbp) processed in 9.124s (1814.2 Kseq/m, 362.74 Mbp/m).
? 174783 sequences classified (63.35%)
? 101098 sequences unclassified (36.65%)
still 36.65% unclassified
那么這些序列是什么呢歹叮?輸出來(lái)看看吧跑杭,重新用kraken2 輸出unclassified reads。在ncbi上
發(fā)現(xiàn)還是virus和一些bacteria咆耿,那下載nt庫(kù)試試看吧德谅。
kraken2-build --download-library nt --db ./
######################################################################
kraken2 report explaination: here
######################################################################
'a' : all taxonomic levels
'k' : kingdoms
'p' : phyla only
'c' : classes only
'o' : orders only
'f' : families only
'g' : genera only
's' : species only
######################################################################
記錄cornonavirus的相關(guān)信息:
單獨(dú)的cov-reference放在:/USB/meta_genomics_pipeline/database/SARS-CoV-2-reference
NCBI名稱(chēng):
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
長(zhǎng)度:29903bp
######################################################################
添加contig注釋功能:目的:鑒定組裝出來(lái)的contig都是什么物種。目標(biāo):能夠找到target物種萨螺。用于后續(xù)的分析窄做。
思路:基于kraken2使用的bacteria,virus慰技,fungi椭盏,archaea,等6個(gè)數(shù)據(jù)庫(kù)的fasta文件吻商,重新構(gòu)建blastn 所需index掏颊,然后使用blastn 將sample組裝好的contigalign到該數(shù)據(jù)庫(kù)上。使用in-house腳本統(tǒng)計(jì)并整理每個(gè)contig的物種信息艾帐,并匯總乌叶。
1)構(gòu)建數(shù)據(jù)庫(kù):
cd /USB/meta_genomics_pipeline/database/blastn_dbs
cat ../kraken2_database/library/*/library.fna > my_kraken_cat6dbs_20210726.fna
makeblastdb? -parse_seqids? -dbtype? nucl? -in? my_kraken_cat6dbs_20210726.fna
2)mapping 添加到main pipeline中,做一個(gè)測(cè)試:
blastn -db /USB/meta_genomics_pipeline/database/blastn_dbs/my_kraken_cat6dbs_20210726.fna -query /USB/metagenomics/analysis/my_project/megahit_result/houmaohu/houmaohu.megahit.final.contigs.fa -out ./sample01.corona.out -outfmt 7 -num_threads 30 -subject_besthit
3)寫(xiě)腳本統(tǒng)計(jì)每條contig都是什么物種(besthit)
輸入文件1:contig文件柒爸,
輸入文件2:blast out 文件准浴,cat 所有blast out文件到一個(gè)文件中,
輸入文件3:blast 時(shí)用到的reference文件中就有id捎稚,叫做seqid2taxid.6abfpuv.map
輸出文件1:contig fasta
思路二:
無(wú)需cat 6個(gè)database乐横,直接分別比對(duì),這樣節(jié)省時(shí)間阳藻,提高運(yùn)算速度晰奖,只是一個(gè)組裝文件會(huì)得到6個(gè)比對(duì)結(jié)果,再用腳本整理到一起
1)那么對(duì)6個(gè)database分別構(gòu)建blastn index
2)分別比對(duì)6次
3)統(tǒng)計(jì)結(jié)果
ls /USB/meta_genomics_pipeline/database/kraken2_database/library/*/library.fna |while read line;do name=`ls $line|awk -F '/' '{print$(NF-1)}'`;echo "makeblastdb? -parse_seqids? -dbtype? nucl? -in $line -out ./$name ";done |sh
有個(gè)數(shù)據(jù)庫(kù)出錯(cuò)了腥泥,檢查blastn的報(bào)錯(cuò)匾南,并重新構(gòu)建blast index,
######################################################################
遺留問(wèn)題:
[]下一步用mega或什么對(duì)齊軟件畫(huà)個(gè)樹(shù)看看蛔外。
[x]rebuild kraken2 database (running)
[x]rerun kraken2 to check ratio after 2 was done
[x]rerun blastn when finished build reference index.
[x]kraken2 結(jié)果畫(huà)圖蛆楞,不畫(huà)了,krona的圖就夠了夹厌。
[x]處理結(jié)果統(tǒng)計(jì)腳本豹爹,完成,將fasta文件用于galaxy展示矛纹。
[x]檢查blast index 的錯(cuò)誤臂聋,重跑blast
######################################################################
使用krona畫(huà)物種分類(lèi)圖:
注意:kraken2.output中第三列為taxa_id,so do not using --use-names when using kraken2.
le ./houmaohu.kraken2.output |cut -f 2,3 >houmaohu.kraken.krona
ktImportTaxonomy houmaohu.kraken.krona -o houmaohu.taxonomy.krona.html
[ WARNING ] Too many query IDs to store in chart; storing supplemental files in 'taxonomy.krona.html.files'.
會(huì)報(bào)一個(gè)warning,但不影響html上的圖
done
######################################################################
驗(yàn)證kneaddata之后的reads是否還有hg19.genome.
$ bowtie2? -p 20 -x /USB/meta_genomics_pipeline/database/kneaddata_database/hg37dec_v0.1 -U ../kneaddata_result/houmaohu.kneaddata.fastq -S houmaohu_r1_bowtie.fw.sam --nofw --local
275881 reads; of these:
? 275881 (100.00%) were unpaired; of these:
? ? 275870 (100.00%) aligned 0 times
? ? 1 (0.00%) aligned exactly 1 time
? ? 10 (0.00%) aligned >1 times
0.00% overall alignment rate
(prokka_env) ceid 15:08:09 /USB/metagenomics/analysis/my_project/test_kneaddata_2hg19
$ bowtie2? -p 20 -x /USB/meta_genomics_pipeline/database/kneaddata_database/hg37dec_v0.1 -U /USB/metagenomics/analysis/my_project/fastqc_result/houmaohu.raw.fastq.gz -S houmaohu_raw_bowtie.fw.sam --nofw --local
488767 reads; of these:
? 488767 (100.00%) were unpaired; of these:
? ? 474782 (97.14%) aligned 0 times
? ? 3171 (0.65%) aligned exactly 1 time
? ? 10814 (2.21%) aligned >1 times
2.86% overall alignment rate
1)原始fastq中就只有2.8%的human reads,
2)kneaddata后沒(méi)有human reads孩等,證明kneaddata會(huì)去除humanreads
3)那么問(wèn)題來(lái)了艾君,kraken2記過(guò)中37% no hit是什么?是什么肄方?是什么冰垄?
############################################################################
看VIP工具文章,寫(xiě)思路权她,出結(jié)果虹茶。
############################################################################
data from paper:
https://www.nature.com/articles/srep23774/tables/2
補(bǔ)充數(shù)據(jù)分析,新增PE數(shù)據(jù)隅要,適配流程蝴罪。
seqkit fx2tab -lg UniVec_Core_library.fixed.fna |cut -f 1,4,5|head -1
python /USB/meta_genomics_pipeline/bin/blastout2contig_annotation.py -c /USB/metagenomics_analysis/analysis/my_project/megahit_result/houmaohu/houmaohu.megahit.final.contigs.fa -b /USB/metagenomics_analysis/analysis/my_project/blastn_result/houmaohu/houmaohu.cat.all.blast.out -m /USB/meta_genomics_pipeline/database/blastn_dbs/seqid2taxid.6abfpuv.map -o /USB/metagenomics_analysis/analysis/my_project/blastn_result/houmaohu.annotated.final.contigs.fa
contig_name? annotation? ref_name? ? ref_gc? ? contig_len? ? contig_gc Genome_coverage
k141_259? ? ? ? Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome? ? 29903? 37.97? 7433? ? 39.63? 24.857%
個(gè)性化分析:
提取所有樣本中的target 物種的組裝結(jié)果序列,并檢查N50拾徙,N90等洲炊,看看全長(zhǎng)多少感局,能不能畫(huà)進(jìn)化樹(shù)之類(lèi)的尼啡,或者后續(xù)snp分析。
################################################
method for phylogeny:
conda create -n phylo blast mafft raxml iqtree bedtools
conda activate phylo
mafft --auto all.coronavirus.with.ref.fa > all.coronavirus.with.ref.out
raxmlHPC -s all.coronavirus.with.ref.out -m GTRGAMMA -n coronavirus_tree -p 12345
# or
iqtree -s all.coronavirus.with.ref.out
cat all.coronavirus.with.ref.out.treefile
paste treefile to https://itol.embl.de/upload.cgi
圖不直觀询微,應(yīng)該用gene,
用contig 注釋出gene崖瞭,在用gene作圖。
補(bǔ)充:
1)Genome Coverage (%)? 撑毛,contig長(zhǎng)度與reference長(zhǎng)度的比例书聚,在已有的腳本上修改。
2)Total reads? (過(guò)濾kneaddata后的fastq的reads藻雌,)
3)Number of reads (%) 統(tǒng)計(jì)比對(duì)上每個(gè)contig后的reads雌续,咋統(tǒng)計(jì)?
評(píng)估每種virus的結(jié)果胯杭。
參數(shù)修正:
megahit --min-len-contig 1000
blastn -evalue 10
reference:
https://genomics.sschmeier.com/ngs-orthology/index.html