Biostar學(xué)習(xí)筆記(4)GenBank, FASTA, FASTQ and download SRA files from NCBI

Sequence data formats

1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.

(1) GenBank file format

GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found here

When do we use the GenBank format?

GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.

(2) FASTA format

在生物信息學(xué)中坞琴,F(xiàn)ASTA格式是一種用于記錄核酸序列或肽序列的文本格式,其中的核酸或氨基酸均以單個字母編碼呈現(xiàn)逗抑。該格式同時還允許在序列之前定義名稱和編寫注釋剧辐。這一格式最初由FASTA軟件包定義,但現(xiàn)今已是生物信息學(xué)領(lǐng)域的一項標準邮府。
FASTA簡明的格式降低了序列操縱和分析的難度荧关,令序列可被文本處理工具和諸如Python、Ruby和Perl等腳本語言處理褂傀。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式

(3) FASTQ file format

FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:

  1. fasta與fastq格式文件解讀
  2. Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
    FASTQ文件中忍啤,一個序列通常由四行組成:
    第一行以@開頭,之后為序列的標識符以及描述信息(與FASTA格式的描述行類似)
    第二行為序列信息
    第三行以+開頭仙辟,之后可以再次加上序列的標識及描述信息(可選)
    第四行為質(zhì)量得分信息同波,與第二行的序列相對應(yīng)鳄梅,長度必須與第二行相同
    The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Further reading:
Differences between FASTA, FASTQ and SAM formats

2. Databases that contain gene sequencing data

  1. NCBI GEO: can search datasets (sequencing data from a series of participants)
  2. NCBI SRA: can search sequencing data from individual participant
  3. ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
  4. European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.

I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.

Install Aspera connect on Ubuntu Linux

mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version

If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:

# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect

3. How to download SRA files from NCBI SRA database?

According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.

1. Download SRA files by using prefetch

I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra.

# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc 
source ~/.bashrc

Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.

mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt

Alternatively, you can use curl, wget or ftp to download from generated download links, but will be as slow as snail.

2. Convert SRA files to FASTQ files on the fly

This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.

cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1

other

  1. R中修改個別變量名(reshape包)使用names()函數(shù)
names(leadership)

names(leadership)[2] <- “testDate”

names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)
  1. How do I remove part of a string?
    https://stackoverflow.com/questions/9704213/r-remove-part-of-string
    gsub
    sub_str
# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin
# Download and unzip the file on the fly. 
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa

# Look at the file
cat chr22.fa | head -4

# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N  | wc -l

# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }' 
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市未檩,隨后出現(xiàn)的幾起案子戴尸,更是在濱河造成了極大的恐慌,老刑警劉巖冤狡,帶你破解...
    沈念sama閱讀 217,657評論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件孙蒙,死亡現(xiàn)場離奇詭異,居然都是意外死亡筒溃,警方通過查閱死者的電腦和手機马篮,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,889評論 3 394
  • 文/潘曉璐 我一進店門沾乘,熙熙樓的掌柜王于貴愁眉苦臉地迎上來怜奖,“玉大人,你說我怎么就攤上這事翅阵⊥崃幔” “怎么了?”我有些...
    開封第一講書人閱讀 164,057評論 0 354
  • 文/不壞的土叔 我叫張陵掷匠,是天一觀的道長滥崩。 經(jīng)常有香客問我,道長讹语,這世上最難降的妖魔是什么钙皮? 我笑而不...
    開封第一講書人閱讀 58,509評論 1 293
  • 正文 為了忘掉前任,我火速辦了婚禮顽决,結(jié)果婚禮上短条,老公的妹妹穿的比我還像新娘。我一直安慰自己才菠,他們只是感情好茸时,可當我...
    茶點故事閱讀 67,562評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著赋访,像睡著了一般可都。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上蚓耽,一...
    開封第一講書人閱讀 51,443評論 1 302
  • 那天渠牲,我揣著相機與錄音,去河邊找鬼步悠。 笑死签杈,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的贤徒。 我是一名探鬼主播芹壕,決...
    沈念sama閱讀 40,251評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼汇四,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了踢涌?” 一聲冷哼從身側(cè)響起通孽,我...
    開封第一講書人閱讀 39,129評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎睁壁,沒想到半個月后背苦,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,561評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡潘明,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,779評論 3 335
  • 正文 我和宋清朗相戀三年行剂,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片钳降。...
    茶點故事閱讀 39,902評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡厚宰,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出遂填,到底是詐尸還是另有隱情铲觉,我是刑警寧澤,帶...
    沈念sama閱讀 35,621評論 5 345
  • 正文 年R本政府宣布吓坚,位于F島的核電站撵幽,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏礁击。R本人自食惡果不足惜盐杂,卻給世界環(huán)境...
    茶點故事閱讀 41,220評論 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望哆窿。 院中可真熱鬧链烈,春花似錦、人聲如沸更耻。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,838評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽秧均。三九已至食侮,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間目胡,已是汗流浹背锯七。 一陣腳步聲響...
    開封第一講書人閱讀 32,971評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留誉己,地道東北人眉尸。 一個月前我還...
    沈念sama閱讀 48,025評論 2 370
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親噪猾。 傳聞我的和親對象是個殘疾皇子霉祸,可洞房花燭夜當晚...
    茶點故事閱讀 44,843評論 2 354

推薦閱讀更多精彩內(nèi)容