今天把官網(wǎng)流程全部看了一遍按照這樣操作了赖阻,看了一下數(shù)據(jù)正在下載弄屡,明天看一下有沒有報錯產(chǎn)生,不知道會有啥問題等待解決该押。
Source:
https://www.ncbi.nlm.nih.gov/books/NBK36439/
下載步驟
使用NCBI的SRA toolkit中的prefetch
命令行功能和cart
文件或者SRA accession
進(jìn)行下載
- 下載并安裝Aspera connect
Aspera:一個高速文件傳輸系統(tǒng)问麸,方便下載數(shù)據(jù)往衷。
下載鏈接:https://downloads.asperasoft.com/en/downloads/8?list
- 選擇并保存數(shù)據(jù)信息在
cart
文件中
(除了cart文件,也可以根據(jù)SRA accession下載严卖,步驟5中詳解)
- 登錄dbgap
- 點擊My Requests席舍,查看批準(zhǔn)的請求
-
查看request file
選擇dbGap file selctor下載基因型和表型數(shù)據(jù)
選擇SRA RUN selector下載SRA數(shù)據(jù)
Wait until the page loading is complete. Click on the “Help” icon on top of the page to see instruction/information about the selector).
-
選擇數(shù)據(jù)并下載Cart文件(這里是non-SRA數(shù)據(jù))
non-SRA cart文件下載的SRA cart文件
- 編譯SRA toolkit
- 下載最新的SRA Toolkit并解壓
(https://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) - 在使用toolkit之前需要根據(jù) Protected Data Usage Guide 進(jìn)行編譯,導(dǎo)入dbGaP repository key(如果SRA Toolkit版本高于2.10.2就不需要編譯了)【最近把版本更新到3.0版本后發(fā)現(xiàn)哮笆,不再需要額外導(dǎo)入dbGaP repository key了】
編譯步驟:
我使用的版本低于2.10.2需要編譯:
Quick Toolkit Configuration
https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration
$ vdb-config -i
A. 選擇"Remote Access"
B. 轉(zhuǎn)到"Cache"選擇"local file-caching"并設(shè)置路徑(必須是空文件夾)
C. 轉(zhuǎn)到"cloud provider"并且選擇"report cloud instance identity"
- 在編譯SRA toolkit過程中導(dǎo)入"dbGaP repository key"
編譯后會自動創(chuàng)建文件夾類似于~/ncbi/dbGap-XXXXX
(也叫做工作目錄)
這個目錄下會有子目錄来颤,比如sra
,refseq
等等。
【最近把版本更新到3.0版本后發(fā)現(xiàn)稠肘,不再需要額外導(dǎo)入dbGaP repository key了】在prefetch中增加了--ngc參數(shù)福铅,下載時給出key即可。
prefetch --ngc prj_33085.ngc --cart cart_DAR116028_202209070105.krt
-
dbGaP repository key文件包括了SRA Toolkit所需要用來確定申請人和dbga數(shù)據(jù)所屬項目的信息项阴,那么如何下載dbGaP repository key呢滑黔?
在action位置找到對應(yīng)的批準(zhǔn)的數(shù)據(jù)對應(yīng)的project的get dbGap repository key
,下載得到.ngc
格式的文件环揽。
什么是cart文件或SRA accession略荡?
- 數(shù)據(jù)塊
cart文件中提供了dbgap的非SRA和SRA數(shù)據(jù)文件塊 - 單個SRA
當(dāng)?shù)玫絾蝹€的SRR accession時可以下載單個的SRA run
但是不管是以上哪種情況,在執(zhí)行命令前歉胶,sratoolkit都要使用dbGaP repository key來編譯汛兜。
- 使用prefetch進(jìn)行數(shù)據(jù)下
在通過編譯產(chǎn)生的dbGaP project directory
目錄下,運行prefetch命令通今,把cart文件的地址寫完整粥谬,
nohup和末尾的&可以后臺運行
-X 99999999 是下載大小限制放大
> nohup prefetch -X 9999999999999 /public/home/liuxs/taozy/dbGap/cart_DAR94672_202007210554.krt &
sra解壓成fastq文件報錯,使用validate
檢測
(wes) [myname@HPC-login sra]$ vdb-validate SRR7554958
2020-07-23T02:26:44 vdb-validate.2.10.0 info: Validating '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra'...
2020-07-23T02:26:44 vdb-validate.2.10.0 info: Validating encrypted file '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra'...
2020-07-23T02:27:31 vdb-validate.2.10.0 info: Encrypted file '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra' appears valid
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Database 'SRR7554958.sra' metadata: md5 ok
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Table 'PRIMARY_ALIGNMENT' metadata: md5 ok
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Column 'GLOBAL_REF_START': checksums ok
2020-07-23T02:27:35 vdb-validate.2.10.0 info: Column 'HAS_MISMATCH': checksums ok
2020-07-23T02:27:36 vdb-validate.2.10.0 info: Column 'HAS_REF_OFFSET': checksums ok
2020-07-23T02:27:36 vdb-validate.2.10.0 info: Column 'MAPQ': checksums ok
2020-07-23T02:27:37 vdb-validate.2.10.0 info: Column 'MISMATCH': checksums ok
2020-07-23T02:27:37 vdb-validate.2.10.0 info: Column 'REF_LEN': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_OFFSET': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_OFFSET_TYPE': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_ORIENTATION': checksums ok
2020-07-23T02:27:39 vdb-validate.2.10.0 info: Column 'SEQ_READ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_SPOT_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'REFERENCE' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_HIGH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_INDELS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_LOW': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_MISMATCHES': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CIRCULAR': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CS_KEY': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'OVERLAP_REF_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'OVERLAP_REF_POS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'PRIMARY_ALIGNMENT_IDS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SECONDARY_ALIGNMENT_IDS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_START': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'SECONDARY_ALIGNMENT' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'GLOBAL_REF_START': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'HAS_REF_OFFSET': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MAPQ': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_ORIENTATION': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_POS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_OFFSET': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_OFFSET_TYPE': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_ORIENTATION': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_READ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_SPOT_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TEMPLATE_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TMP_HAS_MISMATCH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TMP_MISMATCH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'SEQUENCE' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'ALIGNMENT_COUNT': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CMP_ALTREAD': checksums ok
2020-07-23T02:27:44 vdb-validate.2.10.0 info: Column 'CMP_READ': checksums ok
2020-07-23T02:27:44 vdb-validate.2.10.0 info: Column 'PLATFORM': checksums ok
2020-07-23T02:27:47 vdb-validate.2.10.0 info: Column 'PRIMARY_ALIGNMENT_ID': checksums ok
2020-07-23T02:28:58 vdb-validate.2.10.0 info: Column 'QUALITY': checksums ok
2020-07-23T02:29:00 vdb-validate.2.10.0 info: Column 'RD_FILTER': checksums ok
2020-07-23T02:29:03 vdb-validate.2.10.0 info: Column 'READ_TYPE': checksums ok
2020-07-23T02:29:51 vdb-validate.2.10.0 info: Referential Integrity: SEQ_SPOT_ID <-> PRIMARY_ALIGNMENT_ID 76.3% complete
2020-07-23T02:29:53 vdb-validate.2.10.0 info: Referential Integrity: SEQ_SPOT_ID <-> PRIMARY_ALIGNMENT_ID 100.0% complete
2020-07-23T02:29:53 vdb-validate.2.10.0 info: Database '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra': SEQUENCE.PRIMARY_ALIGNMENT_ID <-> PRIMARY_ALIGNMENT.SEQ_SPOT_ID referential integrity ok
2020-07-23T02:30:10 vdb-validate.2.10.0 info: Referential Integrity: REF_ID <-> PRIMARY_ALIGNMENT_IDS 76.3% complete
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Referential Integrity: REF_ID <-> PRIMARY_ALIGNMENT_IDS 100.0% complete
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Database '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra': REFERENCE.PRIMARY_ALIGNMENT_IDS <-> PRIMARY_ALIGNMENT.REF_ID referential integrity ok
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Database 'SRR7554958.sra' is consistent
表型數(shù)據(jù)解密
下載下來的表型數(shù)據(jù)后綴是.ncbi.enc
衡创,需要進(jìn)行解密
分為導(dǎo)入密鑰和進(jìn)行解密兩個步驟
$ vdb-config --import xxxx.ngc
$ vdb-decrypt xx.ncbi_enc # 單個文件解密
$ vdb-decrypt ~/ncbi/dbGaP-26086/files/ # 整個表型數(shù)據(jù)存放的文件夾進(jìn)行解密
解密完成之后帝嗡,文件的后綴不見了,變成了正常的文件格式
【新的版本做了更新璃氢,vdb-config --import 失效了哟玷,此功能整合進(jìn)vdb-decrypt --ngc】
部分sra文件下載失敗的解決方法
提取下載失敗的SRRXXX名字,放入一個新的文件中一也,對這個新的文件進(jìn)行prefetch下載
步驟:
- 創(chuàng)建一個shell腳本
$ vi download.sh
shell腳本內(nèi)容如下:
cat是逐行讀取文件按內(nèi)容巢寡,我的文件每行都是SRA序號,就是直接`prefetch`的對象椰苟。
-
nohup
提交shell腳本
開始下載...
文件整理:
- 從上到下分別是
cart file(selected accession for processing sra toolkit)
抑月、key(密鑰)
、下載的SRA內(nèi)容
(full list of accession recordset)
[圖片上傳失敗...(image-1963a2-1598250164533)]
- 下載表型
-
下載過程中出現(xiàn)的這些文件是做啥用的舆蝴?谦絮?题诵??
[圖片上傳失敗...(image-f0c890-1598250164532)]