最近接二連三帶了不少實(shí)習(xí)生和輪轉(zhuǎn)生馁启,可以預(yù)見后面幾年實(shí)驗(yàn)室再有實(shí)習(xí)或者輪轉(zhuǎn)的十有八九應(yīng)該都是我?guī)А?br> 這一篇列舉一些生物信息部分常用工具和幾個(gè)神奇網(wǎng)站驾孔。基本上每個(gè)工具都給出一兩句(或中文或英文)簡(jiǎn)要功能介紹和官網(wǎng)地址惯疙。
師妹翠勉,你要的,都在這里了霉颠。
生物信息學(xué)常用工具
fastq格式相關(guān)
-
SRAtoolkit
- SRA數(shù)據(jù)庫(kù)下載公用數(shù)據(jù)時(shí)的工具
-
fastx toolkit
- a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing
- 有各種各樣的小功能对碌,比如提取反向互補(bǔ)序列等等。
- http://hannonlab.cshl.edu/fastx_toolkit/
-
fastqc
- A quality control tool for high throughput sequence data
- 評(píng)估測(cè)序數(shù)據(jù)質(zhì)量
- https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
MultQC
- Aggregate results from bioinformatics analyses across many samples into a single report
- 一次同時(shí)生成多個(gè)數(shù)據(jù)質(zhì)量報(bào)告掉分,省時(shí)省力方便對(duì)比俭缓,支持fastqc
- https://github.com/ewels/MultiQC; http://multiqc.info/docs/
-
Trim Galore
- around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files,
- with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries
- 和fastqc出自一家,可以和fastqc結(jié)合使用酥郭,用來清洗原始數(shù)據(jù)华坦。
- https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
-
Trimmomatic
- A flexible read trimming tool for Illumina NGS data
- 專門清洗illumina測(cè)序數(shù)據(jù)的工具
- http://www.usadellab.org/cms/index.php?page=trimmomatic
-
khmer
- working with DNA shotgun sequencing data from genomes, transcriptomes, metagenomes, and single cells.
- 可以對(duì)原始測(cè)序數(shù)據(jù)進(jìn)行過濾等
- http://khmer.readthedocs.io/en/v2.1.1/user/scripts.htm
BED格式相關(guān)
- bedops
- 玩轉(zhuǎn)bed格式文件,速度比bedtools快
- the fast, highly scalable and easily-parallelizable genome analysis toolkit
- https://bedops.readthedocs.io/en/latest/index.html
- bedtools
- 最知名的bed文件相關(guān)工具不从,但是和samtools并非出自一家
- a powerful toolset for genome arithmetic
- http://bedtools.readthedocs.io/en/latest/index.html
SAM/BAM
- samtools
- 有這一個(gè)就夠了
- Utilities for the Sequence Alignment/Map (SAM) format
- http://www.htslib.org/doc/samtools.html
SNP(VCF/BCF)格式相關(guān)
- GATK
- 使用率最高的軟件
- bcftools
- 對(duì)vcf格式的文件進(jìn)行各種操作
- utilities for variant calling and manipulating VCFs and BCFs
- http://www.htslib.org/doc/bcftools.html
- vcftools
- 和bcftools類似
- snpEFF
- Genetic variant annotation and effect prediction toolbox
- 適合用來進(jìn)行snp注釋
- 用法 http://snpeff.sourceforge.net/SnpEff_manual.html
- http://snpeff.sourceforge.net/
- 也可以注釋ChIP-seq
- 支持非編碼注釋惜姐,如組蛋白修飾
- samtools mpileup
- Utilities for the Sequence Alignment/Map (SAM) format
- http://www.htslib.org/doc/samtools.html
ChIP-seq/motif
peak calling
-
MACS
- Model-based Analysis of ChIP-Seq
- 主要用于組蛋白修飾產(chǎn)生的narrow peaks(H3K4me3 and H3K9/27ac)
- transcription factors which are usually associated with sharp and solated peaks
- http://liulab.dfci.harvard.edu/MACS/README.html
-
MACS2
- MACS的升級(jí)版本,也可以用來找broad peak
-
SICER
- 出來懟MACS椿息,主要用來找一些比較寬的peak,類似于H3K9me3 和 H3K36me3歹袁。
- highly recommended for a practical ChIP-seq experiment design and can be used to account for local biases resulting from read mappability, DNA repeats, local GC content
- https://www.genomatix.de/online_help/help_regionminer/sicer.html
-
后續(xù)分析可能會(huì)用到的工具
-
MAnorm
large sequences alignment
長(zhǎng)序列比對(duì)常用的幾個(gè)軟件
- MUMer
- rapid alignment of very large DNA and amino acid sequences
- http://mummer.sourceforge.net/examples/
- http://mummer.sourceforge.net/manual/
- GMAP
- GMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences
- http://research-pub.gene.com/gmap/
- BLAT
- Blat produces two major classes of alignments:at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts;at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.
- https://genome.ucsc.edu/goldenpath/help/blatSpec.html
short reads alignment
短序列比寝优,二代測(cè)序數(shù)據(jù)比對(duì)
- BWA
- Burrows-Wheeler Alignment Tool
- mapping low-divergent sequences against a large reference genome
- It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.
- http://bio-bwa.sourceforge.net/bwa.shtml
- https://github.com/lh3/bwa
- GSNAP:
- Genomic Short-read Nucleotide Alignment Program
- http://research-pub.gene.com/gmap/
- Bowtie
- works best when aligning short reads to large genomes
- not yet report gapped alignments
- http://bowtie-bio.sourceforge.net/manual.shtml
- Bowtie2
- 和上一代的區(qū)別在于支持gapped alignments
- ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences
- supports gapped, local, and paired-end alignment modes
- http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#reporting
- HISAT2
- Tophat的繼任者条舔,基于HISAT和Bowtie2
- HISAT2的速度比STAR快一些
- http://ccb.jhu.edu/software/hisat2/manual.shtml
- STAR
- Spliced Transcripts Alignment to a Reference
- https://github.com/alexdobin/STAR
- https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
genome guide assemble
- stringtie
- highly efficient assembler of RNA-Seq alignments into potential transcripts
- 對(duì)于可變剪切的發(fā)現(xiàn)相對(duì)準(zhǔn)確
- https://ccb.jhu.edu/software/stringtie/
- Cufflinks
- 基本不用了
- IDP
- Isoform Detection and Prediction tool
- gmap+hisat2,也就是長(zhǎng)短序列比對(duì)相結(jié)合,效果不錯(cuò)
- https://www.healthcare.uiowa.edu/labs/au/IDP/IDP_manual.asp
de novo assemble/gene prediction
下面幾個(gè)軟件結(jié)合起來就是一個(gè)從組裝到注釋再到計(jì)算拼接效率的過程
拼接
- trintiy
- 傾向于預(yù)測(cè)長(zhǎng)的可變剪接
- 新版本從之前的過度預(yù)測(cè)越來越傾向于有所保留
- 比較耗資源乏矾,一般1個(gè)CPU最好分配6G-10G
- 可以有參或者無參轉(zhuǎn)錄組拼接
- https://github.com/trinityrnaseq/trinityrnaseq/wiki
- oases
- 通常得到的N50比較高
- 檢測(cè)低表達(dá)的基因有一定優(yōu)勢(shì)
- De novo transcriptome assembler for very short reads
- https://github.com/dzerbino/oases
注釋
- PASA(內(nèi)包括BLAT和GMAP)
- 得到拼接好的fasta文件后可以用pasa進(jìn)行基因結(jié)構(gòu)預(yù)測(cè)
- Gene Structure Annotation and Analysis Using PASA
- http://pasapipeline.github.io/
- Maker
- 基因預(yù)測(cè)
- can be used for de novo annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics
- http://www.yandell-lab.org/software/maker.html
質(zhì)量檢測(cè)
- TransRate
- 專業(yè)的拼接質(zhì)量評(píng)估軟件孟抗,有三種評(píng)估模式。
- reference free quality assessment of de novo transcriptome assemblies
- http://hibberdlab.com/transrate/
- DETONATE
- DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation
- https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0553-5
- BUSCO
- 它的評(píng)估模式和上面兩個(gè)不太一樣
- based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs
Estimating transcript abundance
可以分為基于比對(duì)和不基于比對(duì)兩種钻心,其中RSEM和eXpress是基于比對(duì)的凄硼,另外兩種是基于比對(duì)的。
- RSEM
- RNA-Seq by Expectation-Maximization
- https://deweylab.github.io/RSEM/README.html
- eXpress
- quantifying the abundances of a set of target sequences from sampled subsequences
- https://pachterlab.github.io/eXpress/overview.html
- kallisto
- 快到飛起
- 豐度估計(jì)中樣本特異性和讀長(zhǎng)偏好性低
- quantifying abundances of transcripts from RNA-Seq data
- https://pachterlab.github.io/kallisto/
- salmon
- 也是很快
- quantifying the expression of transcripts using RNA-seq data
- https://combine-lab.github.io/salmon/
Read count
- htseq-count
- 數(shù)read, 有它就夠了
Difference expression
和之前的步驟對(duì)應(yīng)捷沸,這里也可以分為基于read數(shù)和基于組裝以及不急于比對(duì)三類工具摊沉。
-
limma
- 用于分析芯片數(shù)據(jù)
- Linear Models for Microarray Data
- http://bioconductor.org/packages/release/bioc/html/limma.html
-
DEseq
-
DEseq2
- 效果在幾個(gè)工具中相對(duì)好
- http://bioconductor.org/packages/release/bioc/html/DESeq2.html
-
DEGseq
- Identify Differentially Expressed Genes from RNA-seq data
- http://www.bioconductor.org/packages/2.6/bioc/html/DEGseq.html
-
edgeR
- Empirical Analysis of Digital Gene Expression Data in R
- http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
-
Ballgown
- 準(zhǔn)確度有時(shí)不是很好
- facilitate flexible differential expression analysis of RNA-Seq data
- organize, visualize, and analyze the expression measurements for your transcriptome assembly.
- https://github.com/alyssafrazee/ballgown
-
sleuth
- 用來配合kallisto使用
- https://pachterlab.github.io/sleuth/about
Data visualization
數(shù)據(jù)可視化的工具可以分為本地版本和在線版本
- IGV
- 本地展示分析結(jié)果的不二選擇
- Integrative Genomics Viewer
- http://software.broadinstitute.org/software/igv/
- jbrowse
- 公開展示數(shù)據(jù)或者給合作者分享時(shí)的不二選擇,快且好看痒给。
- http://jbrowse.org/code/JBrowse-1.10.2/docs/tutorial/
- DEIVA
- 差異表達(dá)的可視化在線工具
- Interactive Visual Analysis of differential gene expression test results
- http://hypercubed.github.io/DEIVA/
- Heatmapper
- 用來話各種熱圖的在線工具
- expression-based heat maps
- pairwise distance maps
- correlation maps
- http://www.heatmapper.ca/
- START
- 基于shinny的一套R(shí)NA-seq數(shù)據(jù)可視化工具
- visualize RNA-seq data starting with count data
- https://kcvi.shinyapps.io/START/
幾個(gè)神奇的網(wǎng)站
- biostars
- R book
- python guide
- bioptyhon
- Rosalind
- bioinformatics tools
- data visualistion catalogue
暫時(shí)就寫這么多说墨,還有一些自己平時(shí)也很少用的就不放進(jìn)來給他人增加負(fù)擔(dān)了骏全,后面再補(bǔ)充。