Window下如何下載SRA數(shù)據(jù)跑DADA2流程

由于沒有服務(wù)器力喷，只能自己搗鼓著用windows的cmd命令行來批量下載一些數(shù)據(jù)足删，但是windows下高通量測(cè)序數(shù)據(jù)（如sra）下載體驗(yàn)真的很差，沒辦法匣距，只能硬著頭皮去搞

一、下載sratoolkit軟件

一）軟件位置：https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

圖1.1.1 選擇箭頭所指windows版本

二）軟件安裝

將下載好的sratoolkit軟件解壓縮哎壳，然后配置環(huán)境變量

圖1.2.1

打開windows環(huán)境變量設(shè)置窗口：按快捷鍵win+R后,輸入sysdm.cpl,然后回車,完事毅待，具體步驟請(qǐng)跟著后續(xù)圖示來。

圖1.2.2

圖1.2.3

圖1.2.4

圖1.2.5

三）檢測(cè)軟件是否可以正常工作

快捷鍵Win+R归榕，輸入cmd進(jìn)入cmd命令行

圖1.3.1
進(jìn)入cmd命令行界面尸红，進(jìn)入軟件所在位置，如：bio-tools\sratoolkit\sratoolkit.2.10.7-win64\bin\prefetch.exe -h刹泄，如成功顯示幫助信息外里，說明軟件安裝成功

圖1.3.2

二、下載測(cè)試數(shù)據(jù)

如果你自己沒有數(shù)據(jù)（16S rRNA基因測(cè)序數(shù)據(jù)）特石，那就去下載別人上傳到網(wǎng)絡(luò)各大數(shù)據(jù)庫的數(shù)據(jù)盅蝗，比如以NCBI為例

打開NCBI官網(wǎng)：https://www.ncbi.nlm.nih.gov/，隨便輸入關(guān)鍵詞如16S rRNA sequencing姆蘸，選擇SRA數(shù)據(jù)庫墩莫，進(jìn)入。

圖2.1.1
選擇所需下載的個(gè)體

圖2.2.1

圖2.2.2

獲得所下載個(gè)體的SRR號(hào)逞敷，然后用sratoolkit軟件的prefetch函數(shù)批量下載狂秦，如下圖所示。

F:\>bio-tools\sratoolkit\sratoolkit.2.10.7-win64\bin\prefetch.exe --option-file C:\Users\jnzd_\Desktop\SRR_Acc_List.txt

可以看到下載信息：表示下載成功

2020-06-28T00:49:08 prefetch.2.10.7: 1) Downloading 'SRR12073641'...
2020-06-28T00:49:08 prefetch.2.10.7:  Downloading via HTTPS...
2020-06-28T00:54:27 prefetch.2.10.7:  HTTPS download succeed
2020-06-28T00:54:27 prefetch.2.10.7:   verifying 'SRR12073641'...
2020-06-28T00:54:27 prefetch.2.10.7:  'SRR12073641' is valid
2020-06-28T00:54:27 prefetch.2.10.7: 1) 'SRR12073641' was downloaded successfully
2020-06-28T00:54:28 prefetch.2.10.7: 'SRR12073641' has 0 unresolved dependencies

2020-06-28T00:54:30 prefetch.2.10.7: 2) Downloading 'SRR12073642'...
2020-06-28T00:54:30 prefetch.2.10.7:  Downloading via HTTPS...
2020-06-28T01:00:17 prefetch.2.10.7:  HTTPS download succeed
2020-06-28T01:00:17 prefetch.2.10.7:   verifying 'SRR12073642'...
2020-06-28T01:00:17 prefetch.2.10.7:  'SRR12073642' is valid
2020-06-28T01:00:17 prefetch.2.10.7: 2) 'SRR12073642' was downloaded successfully
2020-06-28T01:00:17 prefetch.2.10.7: 'SRR12073642' has 0 unresolved dependencies

2020-06-28T01:00:20 prefetch.2.10.7: 3) Downloading 'SRR12073643'...
2020-06-28T01:00:20 prefetch.2.10.7:  Downloading via HTTPS...
2020-06-28T01:08:38 prefetch.2.10.7:  HTTPS download succeed
2020-06-28T01:08:38 prefetch.2.10.7:   verifying 'SRR12073643'...
2020-06-28T01:08:38 prefetch.2.10.7:  'SRR12073643' is valid
2020-06-28T01:08:38 prefetch.2.10.7: 3) 'SRR12073643' was downloaded successfully
2020-06-28T01:08:38 prefetch.2.10.7: 'SRR12073643' has 0 unresolved dependencies

2020-06-28T01:08:40 prefetch.2.10.7: 4) Downloading 'SRR12073644'...
2020-06-28T01:08:40 prefetch.2.10.7:  Downloading via HTTPS...
2020-06-28T01:15:32 prefetch.2.10.7:  HTTPS download succeed
2020-06-28T01:15:32 prefetch.2.10.7:   verifying 'SRR12073644'...
2020-06-28T01:15:32 prefetch.2.10.7:  'SRR12073644' is valid
2020-06-28T01:15:32 prefetch.2.10.7: 4) 'SRR12073644' was downloaded successfully
2020-06-28T01:15:32 prefetch.2.10.7: 'SRR12073644' has 0 unresolved dependencies

sra數(shù)據(jù)轉(zhuǎn)換為fastq格式

for /l %i in (41,1,44) do (fastq-dump.exe --split-files --gzip D:\R\R-exercise\ncbi-data\SRR120736%i\SRR120736%i.sra -O D:\R\R-exercise\ncbi-data\output)

# 本想用--split-3參數(shù)兰粉，但windows下一直報(bào)錯(cuò)故痊，不知道怎么回事，用s--split-files不會(huì)玖姑，-O接轉(zhuǎn)換后的數(shù)據(jù)存放目錄

圖2.2.3 轉(zhuǎn)換后的數(shù)據(jù)愕秫，以fastq結(jié)尾

三、R語言運(yùn)行data2

將轉(zhuǎn)換為fastq的16S rRNA 基因測(cè)序數(shù)據(jù)用data2進(jìn)行測(cè)序數(shù)據(jù)分析焰络，包括去除引物戴甩、去除低質(zhì)量數(shù)據(jù)、去重復(fù)闪彼、去嵌合體等步驟：

#############################################
##  FGFP sample inference from amplicon data 
##  by Raul Tito & Rodrigo Bacigalupe
##  date: March 28th 2018
#############################################
### Pipeline created from https://benjjneb.github.io/dada2/tutorial.html
### More information available at https://benjjneb.github.io/dada2/index.html

### This pipeline requires demultiplexed files: Two fastq files per sample (*R1 and *R2), they can be *.gz

### Load required modules
# module load R/3.4.2   ### very important to use this version

### Load necessary libraries and set seed
# 國(guó)內(nèi)用戶清華鏡像站甜孤，加速國(guó)內(nèi)用戶下載
site="https://mirrors.tuna.tsinghua.edu.cn/CRAN"
# 檢查是否存在Biocondoctor安裝工具，沒有則安裝
if (!requireNamespace("BiocManager", quietly = TRUE))
 install.packages("BiocManager",repo=site)
# 加載安裝工具畏腕，并安裝dada2
library(BiocManager)
BiocManager::install("dada2")

library(dada2)
packageVersion("dada2")
set.seed(12345)

### Every new MiSeq or Hiseq run is quality checked for number of reads and presence of unused barcodes to estimate 
### levels of carryover from prevous runs or contamination events. As part of this initial analysis demultipled files 
### per each sample are generated (Using LotuS).

### Set directory and files
#' datasets used were downloaded from:https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP268421&o=acc_s%3Aa
path <- "D:\\R\\R-exercise\\ncbi-data\\output" # CHANGE to the directory containing your demultiplexed R1 and R2 fastq files
fns <- list.files(path) #列出某文件夾下的所有文件
#fns

####################
### load your data
####################
fastqs <- fns[grepl(".fastq.gz$", fns)]  # grepl查找匹配后返回true或false缴川，而grep查找匹配后返回的是索引
fastqs <- sort(fastqs) # Sort ensures forward/reverse reads are in same order
### make sure that R1 is for forward read and R2 for reverse

fnFs <- fastqs[grepl(".1.fastq.gz", fastqs)] ## Just the forward read files
fnRs <- fastqs[grepl(".2.fastq.gz", fastqs)] ## Just the reverse read files
## Get sample names from the first part of the forward read filenames
sample.names <- sapply(strsplit(fnFs, ".1.fastq.gz"), `[`, 1) ## check if it is 1 or 2

## Fully specify the path for the fnFs and fnRs
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)

雙端測(cè)序數(shù)據(jù)的質(zhì)量值作圖

###########################################
## Examine quality profiles of F & R reads
###########################################
pdf("plotQualityProfile.pdf", onefile=T)
plotQualityProfile(fnFs[1:2]) ## remove 20 plus 10 (primers and first odd bases)
plotQualityProfile(fnRs[1:2]) ## remove 20 plus 10 (primers and first odd bases)
dev.off()

圖3.1

- 根據(jù)質(zhì)量圖對(duì)數(shù)據(jù)進(jìn)行過濾和截取
##################################
## Perform filtering and trimming
##################################
filt_path <- file.path(path, "filtered") # Place filtered files in filtered/subdirectory
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

過濾雙端序列，F(xiàn)ilter the forward and reverse reads:Important to remove primers and low quality regions（去除引物和低質(zhì)量的數(shù)據(jù)）

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(130,200), ## Different settings tried,these are good for current primer constructs for MiSeq and HiSeq描馅，truncLen參數(shù)設(shè)定序列長(zhǎng)度范圍把夸，不在該范圍則過濾
                     trimLeft=c(30, 30), # 每條reads開頭的30個(gè)堿基一般質(zhì)量值偏低，建議過濾铭污，或者根據(jù)實(shí)際情況設(shè)定參數(shù)范圍恋日。
                     maxN=0, maxEE=c(2,2), truncQ=11, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE) #
head(out)

## Examine quality profiles of filtered reads
pdf("plotQualityProfile.filt.pdf", onefile=T)
plotQualityProfile(filtFs[1:2])
plotQualityProfile(filtRs[1:2])
dev.off()

圖3.2 過濾后的序列質(zhì)量分布圖

錯(cuò)誤率

## Learn the Error Rates
## DADA2算法使用機(jī)器學(xué)習(xí)構(gòu)建參數(shù)誤差模型膀篮，認(rèn)為每個(gè)擴(kuò)增子測(cè)序樣品都具有不同的誤差比率。該learnErrors方法通過交替估計(jì)錯(cuò)誤率和對(duì)參考樣本序列學(xué)習(xí)錯(cuò)誤模型岂膳，直到學(xué)習(xí)模型同真實(shí)錯(cuò)誤率收斂于一致誓竿。與許多機(jī)器學(xué)習(xí)方法一樣，算法有一個(gè)初始假設(shè)：數(shù)據(jù)中的最大可能錯(cuò)誤率就是只有最豐富的序列是正確的谈截，其余都是錯(cuò)誤筷屡。
#########################
## Learn forward error rates
errF <- learnErrors(filtFs, nread=1e6, multithread=TRUE) ## variable but this is the minimum number of reads
## Learn reverse error rates
errR <- learnErrors(filtRs, nread=1e6, multithread=TRUE) ## variable but this is the minimum number of reads
## Sample inference and merger of paired-end reads
mergers <- vector("list", length(sample.names))
names(mergers) <- sample.names

## Plot estimated error as sanity check 
pdf("plotErrors_F.pdf", onefile=T)
plotErrors(errF, nominalQ=TRUE)
dev.off()

pdf("plotErrors_R.pdf", onefile=T)
plotErrors(errR, nominalQ=TRUE)
dev.off()

圖3.3 錯(cuò)誤率模型圖

去重復(fù)序列

與usearch去冗余步驟類似，僅僅保留重復(fù)序列中的一條序列

#########################
## Perform dereplication
#########################
## Dereplicate the filtered fastq files
derepRs <- derepFastq(filtRs, verbose=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)

# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names


####################
## Sample Inference
####################
## Apply the core sequence-variant inference algorithm to the dereplicated data
## Infer the sequence variants in each sample
## 基于錯(cuò)誤模型進(jìn)一步質(zhì)控
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

## Inspect the dada-class object returned by dada
dadaFs[[1]]
# 473 sequence variants were inferred from 5288 input unique sequences 從5288個(gè)輸入獨(dú)特序列中推斷出473個(gè)真實(shí)序列

去噪后的正反向序列進(jìn)行合并

合并的條件是：默認(rèn)情況下傻盟，僅當(dāng)正向和反向序列重疊至少12個(gè)堿基并且在重疊區(qū)域中彼此相同時(shí)才輸出合并序列

## Merge the denoised forward and reverse reads
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
## Inspect the merger data.frame from the first sample
head(mergers[[1]])

mergers對(duì)象格式為R語言的數(shù)據(jù)框速蕊，數(shù)據(jù)框中包含序列及其豐度信息嫂丙，未能成功合并的序列被刪除娘赴。

構(gòu)建ASV表

開始構(gòu)建 amplicon sequence variant（ASV）表，跟啤；類似于我們傳統(tǒng)的OTU表一樣诽表，使用makeSequenceTable參數(shù)進(jìn)行構(gòu)建

############################
## Construct sequence table 
############################
seqtab <- makeSequenceTable(mergers)
## Get dimensions
dim(seqtab)
# [1]    4 1237  可以看到我們是4個(gè)樣本，然后最終得到1237個(gè)ASV

## Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

去除嵌合體序列

Dada核心質(zhì)控方法糾正了一些錯(cuò)誤隅肥，但嵌合體仍然存在竿奏。幸運(yùn)的是，去噪后序列的準(zhǔn)確性使得識(shí)別嵌合體比處理模糊OTU時(shí)更簡(jiǎn)單腥放。

## Remove chimeras
###################
## Remove chimeric sequences:
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)

sum(seqtab.nochim)/sum(seqtab)

統(tǒng)計(jì)序列信息

根據(jù)上述幾個(gè)步驟的序列篩選泛啸，來mark一下原始序列到最終可以采用的時(shí)候其艱辛旅程吧

####################################
## Track reads through the pipeline
####################################
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)

圖3.4 序列篩選過程統(tǒng)計(jì)

物種注釋

###################
## Assign taxonomy
###################
#' 幾個(gè)常用的注釋數(shù)據(jù)庫：https://benjjneb.github.io/dada2/training.html
taxHS <- assignTaxonomy(seqtab.nochim, "data/rdp_train_set_16.fa.gz", multithread=TRUE) ## CHANGE to directory and pertinent database
taxHS_spe <- addSpecies(taxHS, "data/rdp_species_assignment_16.fa.gz") # 需要注釋種水平另外加載相應(yīng)數(shù)據(jù)庫
colnames(taxHS) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
unname(head(taxHS)) # Remove the names or dimnames attribute of an R object when presentation
unname(tail(taxHS))


#######################
## Write data to files
#######################
write.table(track, file = "track.tsv", quote=FALSE)
seqtab.nochim <- as.data.frame(seqtab.nochim) # 將其轉(zhuǎn)為data.frame
names(seqtab.nochim) <- paste("ASV",seq(1,dim(seqtab.nochim)[2],1), sep = "_") 
# ASV標(biāo)簽名稱太長(zhǎng)，這里將其轉(zhuǎn)換下命名
write.table(seqtab.nochim, file = "sequence_table_SV.tsv", quote=FALSE)
write.table(taxHS, file = "taxa_SV.tsv", quote=FALSE)
taxHS_spe <- as.data.frame(taxHS_spe)
rownames(taxHS_spe) <- paste("ASV",seq(1,dim(seqtab.nochim)[2],1), sep = "_")
write.table(taxHS_spe, file = "taxa_SV_spe.txt", quote = FALSE)

圖3.5 ASV物種注釋結(jié)果

圖3.6 個(gè)體對(duì)應(yīng)的ASV豐度表

參考

qiime2官網(wǎng)資料https://docs.qiime2.org/2020.6/
DADA2中文教程v1.8https://blog.csdn.net/woodcorpse/article/details/86773312

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末秃症，一起剝皮案震驚了整個(gè)濱河市候址，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌种柑，老刑警劉巖岗仑，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異聚请，居然都是意外死亡荠雕，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門驶赏，熙熙樓的掌柜王于貴愁眉苦臉地迎上來炸卑，“玉大人，你說我怎么就攤上這事煤傍「俏模” “怎么了？”我有些...
開封第一講書人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵患久，是天一觀的道長(zhǎng)椅寺。經(jīng)常有香客問我浑槽，道長(zhǎng)，這世上最難降的妖魔是什么返帕？我笑而不...
開封第一講書人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任桐玻，我火速辦了婚禮，結(jié)果婚禮上荆萤，老公的妹妹穿的比我還像新娘镊靴。我一直安慰自己，他們只是感情好链韭，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布偏竟。她就那樣靜靜地躺著，像睡著了一般敞峭。火紅的嫁衣襯著肌膚如雪踊谋。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說
那天旋讹，我揣著相機(jī)與錄音殖蚕，去河邊找鬼。笑死，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的莹菱。我是一名探鬼主播，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼蛤育，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了葫松？” 一聲冷哼從身側(cè)響起瓦糕，我...
開封第一講書人閱讀 36,916評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎进宝，沒想到半個(gè)月后刻坊，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡党晋，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年谭胚，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片未玻。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡灾而，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出扳剿，到底是詐尸還是另有隱情旁趟，我是刑警寧澤，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布庇绽，位于F島的核電站锡搜，受9級(jí)特大地震影響橙困，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜耕餐，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一凡傅、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧肠缔，春花似錦夏跷、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評(píng)論 0贊 19
一樁弒父案槽华，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至趟妥，卻和暖如春猫态，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背煮纵。一陣腳步聲響...
開封第一講書人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國(guó)打工懂鸵，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留偏螺，地道東北人行疏。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像套像，于是被迫代替她去往敵國(guó)和親酿联。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345