本文轉(zhuǎn)載自肖斌科學(xué)網(wǎng)博客
16s rRNA早期的分析策略炮捧,如FISH(fluorescent in situ hybridization)肃拜、DDGE(denaturing gradient gel electrophoresis)、PCR cloning焰坪、T-RFLP(terminal restriction fragment length polymorphism)。隨著NGS(next generation sequencing)測序技術(shù)的發(fā)展聘惦,在此主要討論NGS技術(shù)在16s rRNA分析中的應(yīng)用某饰。
16s rRNA NGS數(shù)據(jù)分析的主要工具有:
16s rRNA NGS數(shù)據(jù)的分析主要有3個大步驟:
原始數(shù)據(jù)預(yù)處理:包括去接頭,數(shù)據(jù)過濾善绎,信號雜音去除黔漂,嵌合體檢查,數(shù)據(jù)均一化禀酱;
微生物多樣性分析:OTU和OTU代表序列界定炬守,包括OTU和代表序列的挑選,物種分類分配剂跟,進化樹分析等减途;
數(shù)據(jù)深入及可視化分析:包括alpha和beta多樣性分析,聚類和相關(guān)性分析曹洽,數(shù)據(jù)可視化等鳍置。
下面詳細(xì)說一下整個流程步驟~
01
去接頭和數(shù)據(jù)過濾
16s經(jīng)常是pooling測序,為此需要將下機數(shù)據(jù)根據(jù)barcode序列信息將數(shù)據(jù)拆分到各樣品中衣洁。QIIME中的“split_libraries.py” 和“split_libraries_fastq.py”實現(xiàn)數(shù)據(jù)拆分和數(shù)據(jù)過濾的雙重目的墓捻。Mothur利用“Trim.seqs”。不過QIIME和Mothur都不能直接處理sff文件(454下機產(chǎn)生的數(shù)據(jù)格式)坊夫,不過可各自利用“process_sff.py”和Sffinfo將sff格式轉(zhuǎn)換為FASTA和QUAL文件砖第。
數(shù)據(jù)過濾考慮的參數(shù)有:minimum average quality score allowed in a read、maximum number of ambiguous bases allowed环凿、minimum and maximum sequence length梧兼、maximum length of homopolymer allowed、maximum mismatches inprimer or barcode allowed智听、whether to truncate reverse primer羽杰,and so on.
02
雜音去除和嵌合體排查
16s建庫的pcr過程、測序過程均會導(dǎo)致序列出現(xiàn)錯誤到推,在分析過程過程中需要有效排除這種錯誤考赛。測序誤差矯正常用的工具有Denoiser(implemented in QIIME)、AmpliconNoise莉测、Acacia颜骤、Pre.cluster(implemented in Mothur)。嵌合體查找的工具有ChimeraSlayer捣卤、UCHIME忍抽、Persus八孝、DECIPHER,ChimeraSlayer鸠项、UCHIME干跛、Persus在mothur中均可調(diào)用。在這些工具中祟绊,存在有待于優(yōu)化的問題(these different methods often disagree with one another on the list of identified chimeras楼入,probably because of their different mechanisms or algorithms. More efforts are required to evaluate these methods and coordinate their inconsistencies in chimera identification.)
在分析中有個關(guān)于古細(xì)菌序列的情況需要注意:a very small proportion of archaeal sequences may be generated for 16S rRNA gene amplicon datasets amplified with bacteria-specific primers. These unexpected sequences should be identified after denoising and chimera removal, and are advised to be discarded before subsequent data normalization.
03
數(shù)據(jù)均一化
測序深度不理想和不均勻的話會對alpha多樣性及beta多樣性均有影響。Uneven sequencing depth can affect diversity estimates in a single sample(i.e.,alpha diversity)久免,as well as comparisons across different samples(i.e., beta diversity)浅辙,thus data normalization is required. 對于此問題有兩種處理策略,分別是relative abundance and random sampling(i.e., rarefaction)阎姥,in addition记舆,z-score亦用于normalization的過程中。但不同的方法均會有缺點呼巴。
04
OTU界定
OTU的界定主要根據(jù)序列的一致性進行泽腮,(The OTUs are picked based on sequence identity,and various identity cutoffs of 16S rRNA gene have been used for different taxonomic ranks. For example, identity cutoffs recommended by MEGAN are 99 % for species衣赶,97 % for genus诊赊,95 % for family,and 90 % for order level府瞄,respectively)碧磅。OTU界定時選擇的工具與算法對后期分析有很大影響(The OTU picking strategy and algorithms have significant effects in the downstream data interpretation. )。
根據(jù)分析過程中是否使用數(shù)據(jù)庫遵馆,OTU界定的策略可分為de novo鲸郊、closed reference和open reference。在OTU界定中有很多聚類的方法货邓,There are many clustering or alignment tools available for OTU picking秆撮,such as Uclust,cd-hit换况,BLAST职辨,mothur,usearch戈二,and prefix/suffix. These tools are implemented in QIIME. Among them舒裤,the mothur method contains three clustering algorithms to pick de novo OTUs,namely, nearest neighbor觉吭,furthest neighbor腾供,or average neighbor.
當(dāng)序列聚類好后,代表了一個OTU,接下來就是從這個OTU找到代表序列台腥,一種做法是a representative sequence can be a random,the longest绒北,the most abundant(as default in QIIME)黎侈, 另一種操作方法是the first sequence in an OTU cluster。 還有一種策略是the distance method in mothur identifies the sequence with the smallest maximum distance to the other sequences as the representative sequence闷游。
05
物種分類
taxonomic assignment的策略有:
word match峻汉,如RDP classfier;
best hit脐往;
Lowest Common Ancestor休吠,如MEGAN、SINA Alignment Service业簿。
06
進化樹分析
Phylogenetic relationships一般可以用樹來表示瘤礁,phylogenetic relationships主要是通過序列比對來實現(xiàn)的,序列比對的工具有ClustalW梅尤,MUSCLE柜思,Clustal Omega,Kalign巷燥,T-COFFEE赡盘,COBLAT和FastTree。目前針對16s rRNA NGS數(shù)據(jù)的分析工具都可以實現(xiàn)缰揪,如MEGA陨享,RAxML,MRBAYES钝腺,PhyML抛姑,TreeView,Clearcut拍屑,F(xiàn)itTree途戒。其中RAxMLand PhyML are the most widely used programs for maximum-likelihood phylogenetic analysis,probably because they are specifically designed and optimized for such purpose僵驰。
07
alpha和beta多樣性分析
alpha多樣性有眾多指標(biāo)可以表示喷斋,在mothur中有Shannon,Berger-Parker蒜茴,Simpson星爪,Q statistic;observed richness粉私,Chao1顽腾,ACE,and jackknife。而在QIIME中抄肖,有phylogenetic diversity(PD)-whole tree久信,chao1,and observed species漓摩。
還有一種物種豐度的比較方法:rarefaction curve裙士。QIIME中主要用“single_rarefaction.py”、 “multiple_rarefaction.py”管毙,在mothur中主要用“Rarefaction.single”和“Rarefaction.shared”腿椎。
beta多樣性計算主要反映不同樣本之間的差異度,several distance metrics夭咬,such as Unifrac啃炸,Bray-Curtis,Euclidean卓舵,Jaccard index南用,Yue & Clayton,and Morisita-Horn边器,have been often employed训枢。beta多樣性計算根據(jù)是否考慮OTU的相對豐度,可分為定量指數(shù)和定性指數(shù)忘巧。
08
統(tǒng)計檢驗
在Two-sample/group中恒界,多考慮t-test。在其中需要注意砚嘴,Particularly for independent two-sample t-test, independence and equal variances(which canbe tested by F-test十酣,Levene’s test,etc.)of two populations arerequired. In the case of non-normal distribution of data sets际长,nonparametric two-sample tests robust to data non-normality耸采,such as Wilcoxon signed-rank test,and Mann-Whitney U testare applicable for significance testing of difference betweengroup medians工育。
在Multiple-sample/group tests中虾宇,用ANOVA。
09
樣本聚類分析
clustering可以分析樣品之間的親疏關(guān)系如绸。classfication的策略用來對樣品進行類別判定嘱朽。
10
樣本相關(guān)性分析
在樣本的相似度和距離計算完后,可以利用principal component analysis(PCA)怔接,principal coordinates analysis(PCoA搪泳,also known as metric multidimensional scaling),Nonmetric multidimensional scaling(NMDS)扼脐,canonical correspondence analysis(CCA)岸军,linear discriminantanalysis(LDA),and redundancy analysis(RDA)等構(gòu)建樣本間的關(guān)系。
10
網(wǎng)絡(luò)模型建立
與基因表達艰赞、代謝分子佣谐、蛋白等數(shù)據(jù)一起分析共表達網(wǎng)路或者共表達模式(co-occurrence and co-exclusion patterns)。
參考文獻:JuF, ZhangT. 16s rRNA gene high throughput sequencing data mining of microbiota diversity and interactions, Appl Microbiol Biotechnol. 2015, 99(10):4119-4129