Sequence data formats
1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.
(1) GenBank file format
GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found here
When do we use the GenBank format?
GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.
(2) FASTA format
在生物信息學(xué)中坞琴,F(xiàn)ASTA格式是一種用于記錄核酸序列或肽序列的文本格式,其中的核酸或氨基酸均以單個字母編碼呈現(xiàn)逗抑。該格式同時還允許在序列之前定義名稱和編寫注釋剧辐。這一格式最初由FASTA軟件包定義,但現(xiàn)今已是生物信息學(xué)領(lǐng)域的一項標準邮府。
FASTA簡明的格式降低了序列操縱和分析的難度荧关,令序列可被文本處理工具和諸如Python、Ruby和Perl等腳本語言處理褂傀。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式
(3) FASTQ file format
FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:
- fasta與fastq格式文件解讀
- Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
FASTQ文件中忍啤,一個序列通常由四行組成:
第一行以@開頭,之后為序列的標識符以及描述信息(與FASTA格式的描述行類似)
第二行為序列信息
第三行以+開頭仙辟,之后可以再次加上序列的標識及描述信息(可選)
第四行為質(zhì)量得分信息同波,與第二行的序列相對應(yīng)鳄梅,長度必須與第二行相同
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Further reading:
Differences between FASTA, FASTQ and SAM formats
2. Databases that contain gene sequencing data
- NCBI GEO: can search datasets (sequencing data from a series of participants)
- NCBI SRA: can search sequencing data from individual participant
- ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
- European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.
I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.
Install Aspera connect on Ubuntu Linux
mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version
If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:
# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect
3. How to download SRA files from NCBI SRA database?
According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.
1. Download SRA files by using prefetch
I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit
because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra
.
# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc
source ~/.bashrc
Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.
mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt
Alternatively, you can use curl
, wget
or ftp
to download from generated download links, but will be as slow as snail.
2. Convert SRA files to FASTQ files on the fly
This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.
cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1
other
- R中修改個別變量名(reshape包)使用names()函數(shù)
names(leadership)
names(leadership)[2] <- “testDate”
names(leadership)[6:10] <-c(“item1”, “item2”, “item3”, “item4”, “item5”)
- How do I remove part of a string?
https://stackoverflow.com/questions/9704213/r-remove-part-of-string
gsub
sub_str
# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin
# Download and unzip the file on the fly.
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa
# Look at the file
cat chr22.fa | head -4
# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N | wc -l
# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }'