高通量數(shù)據(jù)庫(kù)的數(shù)據(jù)蔼水,下載方法有三種:
- 常規(guī)下載(wget,迅雷、IDM)
- aspera
- SRA Toolkit 的prefetch
wget下載數(shù)據(jù)速度很慢,比較容易斷窃植。
wget -c 下載地址
保持?jǐn)帱c(diǎn)下載
ENA數(shù)據(jù)下載方法匯總 https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html#using-wget
下載數(shù)據(jù)庫(kù)
優(yōu)先選擇 快速下載fq格式的數(shù)據(jù),EBI數(shù)據(jù)庫(kù)下載荐糜。
從EBI數(shù)據(jù)庫(kù)直接獲取到aspera的下載代碼巷怜,復(fù)制到本地服務(wù)器葛超,可以直接運(yùn)行。
cd ~/wes_cancer/project/1.raw_fq
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR318/008/SRR3182418/SRR3182418_2.fastq.gz .
ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR318/003/SRR3182423/SRR3182423_1.fastq.gz .
注意:上述是apera的下載格式延塑,最后面的.
代表保存的路徑是本目錄绣张,一定不能省略,不然會(huì)報(bào)錯(cuò)关带。
網(wǎng)絡(luò)有時(shí)候會(huì)提示SSH或者UDP錯(cuò)誤侥涵,可能是服務(wù)器的33001端口沒(méi)有開(kāi),打開(kāi)這個(gè)端口即可宋雏。
上面這種下載時(shí)候芜飘,搜索框可以是SRA號(hào)、SRR號(hào)或者Project號(hào)也可以磨总。輸入PRJAN號(hào)可以一次獲取所有的下載鏈接地址嗦明。
新的工具ffq
可以用來(lái)批量獲取下載地址。
ffq github
直接pip安裝即可 pip install ffq
ffq 參數(shù)說(shuō)明
-o 輸出文件名稱
-t 指定輸入的編號(hào)類型舍败,默認(rèn)是:SRR,可用選項(xiàng):SRR, ERR, DRR, SRP, ERP, DRP, GSE, DOI
ffq輸出的結(jié)果格式是json
#獲取一個(gè)SRR的下載地址
ffq SRR12455819
輸出信息如下:直接包括文章的標(biāo)題摘要招狸,數(shù)據(jù)類型,數(shù)據(jù)下載地址和md5
{
"SRR12455819": {
"accession": "SRR12455819",
"experiment": {
"accession": "SRX8950230",
"title": "Illumina HiSeq 4000 sequencing; GH0202",
"platform": "ILLUMINA",
"instrument": "Illumina HiSeq 4000"
},
"study": {
"accession": "SRP275570",
"title": "Genebank resequencing of tetraploid cottons",
"abstract": "Modern cultivated tetraploid cottons contain two species, Gossypium barbadense and Gossypium hirsutum. Among them, G. hirsutum is the significant species cultivated worldwide and contributes more than 90% natural fiber for the industry. To completely reveal the genetic diversity and population divergence within cultivated tetraploid cotton, we resequenced more than 1,700 accessions, which mostly included the South-China landraces, the elite introgression lines, and the obsoleted historic varieties. After integrating them with the major public resequencing data (PRJNA257154, PRJNA336461, PRJNA375965, PRJNA399050, and PRJNA414461), we obtained a whole Genebank variation map contained 3,248 tetraploid cottons (included 2,922 G. hirsutum). This variation map covered more than 1/3 of current G. hirsutum Genebank in China (~9,000 accessions), which could represent the genetic diversity of G. hirsutum."
},
"sample": {
"accession": "SRS7128615",
"title": "Gossypium hirsutum",
"organism": "Gossypium hirsutum",
"attributes": {
"isolate": "not applicable",
"cultivar": "GH0202",
"ecotype": "not applicable",
"age": "not applicable",
"dev_stage": "not applicable",
"geo_loc_name": "not applicable",
"tissue": "Leaf",
"BioSampleModel": "Plant",
"ENA-SPOT-COUNT": "231824444",
"ENA-BASE-COUNT": "34773666600",
"ENA-FIRST-PUBLIC": "2020-12-09",
"ENA-LAST-UPDATE": "2020-12-09"
}
},
"title": "Illumina HiSeq 4000 sequencing; GH0202",
"files": [
{
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/019/SRR12455819/SRR12455819.fastq.gz",
"md5": "056333e1c9fba8b9ff930496073cbb95",
"size": "7552277314"
}
]
}
}
獲取的地址ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/019/SRR12455819/SRR12455819.fastq.gz
和上面使用aspera下載的地址era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR124/019/SRR12455819/SRR12455819.fastq.gz
只有前面的ftp://ftp
修改為era-fasp@fasp
即可邻薯。
批量獲取下載地址輸出到1個(gè)json文件
ffq -o SRA_Acc.json SRR12455818 SRR12455819 SRR12455820 SRR12455821 SRR12455822 SRR12455823 SRR12455824
所有的結(jié)果輸出到SRA_Acc.json文件
批量獲取下載地址裙戏,并自動(dòng)生成ascp的下載命令行
#輸出文件在NCBI.json
ffq -o NCBI.json SRR12455818 SRR12455819 SRR12455820 SRR12455821 SRR12455822 SRR12455823 SRR12455824
json2tab.py腳本內(nèi)容如下:
此腳本目前適配的是SRR和ERP的輸出,其他的輸出如果不一致厕诡,需要自己手動(dòng)修改累榜。
#!/usr/bin/python3
##用法:python3 json2tab.py NCBI.json
#輸入文件是ffq輸出的json文件
#輸出是"projectID","sampleID","fq_url","size","md5","organism"
import sys
import pandas as pd
import json
#jsonfile = "NCBI.json"
jsonfile = sys.argv[1]
#filename = jsonfile.split('.')[0]
#解析輸入的json文件
with open(jsonfile,"r") as load_f:
load_dict = json.load(load_f)
#輸出文件頭部信息
#print("projectID","sampleID","fq1_url","fq2_url","size_1","size_1","md5_1","md5_2")
for i in load_dict.keys(): #i是項(xiàng)目的編號(hào),當(dāng)有多個(gè)NCBI的號(hào)的時(shí)候
idlist=[]
for j in load_dict[i].keys():
idlist.append(j) #把每一個(gè)的keys輸出到idlist,如果keys里有files選項(xiàng)灵嫌,則直接輸出壹罚,如果有runs,需要多解析一層
if "files" in idlist:
data_ID=load_dict[i] #此時(shí)有files直接解析即可
for list_url in data_ID['files']:
print(load_dict[i]['accession'],data_ID['accession'],list_url['url'],list_url['size'],list_url['md5'],data_ID['sample']['organism'],sep='\t')
elif "runs" in idlist: #此時(shí)是需要多解析一層
for id in load_dict[i]['runs'].keys(): #id是每個(gè)項(xiàng)目里寿羞,對(duì)應(yīng)的文件的號(hào)
data_ID=load_dict[i]['runs'][id]
for list_url in data_ID['files']:
print(load_dict[i]['accession'],data_ID['accession'],list_url['url'],list_url['size'],list_url['md5'],data_ID['sample']['organism'],sep='\t')
else: #此時(shí)可能是其他情況猖凛,自行解決
print("不符合已知的字段,需要自行解析如下內(nèi)容:")
print(load_dict[i])
腳本輸出的格式如下:
SRR12455818 SRR12455818 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/018/SRR12455818/SRR12455818.fastq.gz 5425719230 67a1ba5cafd70f6137916589bf9bb437 Gossypium hirsutum
SRR12455819 SRR12455819 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/019/SRR12455819/SRR12455819.fastq.gz 7552277314 056333e1c9fba8b9ff930496073cbb95 Gossypium hirsutum
直接生成轉(zhuǎn)換后的下載地址
python3 json2tab.py NCBI.json >info.list
cat info.list|awk '{print $3}'|sed 's/ftp:\/\/ftp.sra.ebi.ac.uk/ascp -QT -l 30m -P33001 -i \$HOME\/.aspera\/connect\/etc\/asperaweb_id_dsa.openssh era-fasp\@fasp.sra.ebi.ac.uk:/g;s/$/ ./g' >download.sh
nohup bash download.sh & #直接即可下載
download.sh的內(nèi)容如下:
ascp -QT -l 30m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR124/018/SRR12455818/SRR12455818.fastq.gz .
ascp -QT -l 30m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR124/019/SRR12455819/SRR12455819.fastq.gz .
md5檢測(cè)下載的數(shù)據(jù)是否完整
awk '{print $3,$5}' info.list|rev|cut -d "/" -f1|rev|awk '{print $2,$1}' >md5.txt
md5sum -c md5.txt
如果全部ok,就說(shuō)明數(shù)據(jù)沒(méi)問(wèn)題绪穆。
下面內(nèi)容可忽略
批量獲取下載地址分別輸出到j(luò)son文件
ffq -o srr_split --split SRR12455818 SRR12455819 SRR12455820 SRR12455821 SRR12455822 SRR12455823 SRR12455824
所有的結(jié)果輸出到srr_split的目錄里辨泳,分別是SRR12455818.json …… SRR12455824.json
不推薦 下載原始的SRA格式,NCBI數(shù)據(jù)庫(kù)
在NCBI數(shù)據(jù)庫(kù)會(huì)獲取到Accession List里面是SRR的列表玖院。
使用prefetch下載
##單行手動(dòng)下載
prefetch SRR3182423
##批量自動(dòng)化下載
cat SRR_Acc_List.txt | while read id
do
prefetch ${id} -O ./
done
數(shù)據(jù)格式
SRA數(shù)據(jù)格式菠红,是NCBI數(shù)據(jù)庫(kù)的格式,我們下載之后难菌,需要自己手動(dòng)轉(zhuǎn)換成fq格式试溯。
SRA的數(shù)據(jù)是每一個(gè)SRR數(shù)據(jù),是一個(gè)文件夾郊酒。