還在利用hisat, tophat這些耳熟能詳?shù)能浖ead比對到基因組(轉(zhuǎn)錄組)上棒掠,然后統(tǒng)計(jì)每個基因的count數(shù)么?試試這些不需要比對屁商,速度更快的工具吧烟很。
- Salmon(Patro et al. 2016),
- Sailfish (Patro, Mount, and Kingsford 2014)
- kallisto (Bray et al. 2016)
- RSEM(B. Li and Dewey 2011)
這里就介紹一下Salmon工具,文章發(fā)表在nature method蜡镶,如下是摘要
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA–seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
誰能給我解釋下:
- dual-phase parallel interence algorithm
- feature-rich bias modesl
算法這個東西雾袱,目前還不是我能力范圍所能掌控的。反正他說他能讓豐度預(yù)測更加準(zhǔn)確官还,讓后續(xù)的差異表達(dá)分析更加敏感芹橡。作者認(rèn)為目前的工具都缺少樣本特異性誤差模型,然后我用了新的模型彌補(bǔ)了這個缺陷望伦。(好吧林说,你最棒了)
existing methods for transcriptome-wide abundance estimation—both alignment-based and alignment-free—lack sample-specific bias models rich enough to capture important effects like fragment GC-content bias. When correction is not applied, these biases can lead to undesired effects, for example, a loss of false discovery rate (FDR) control in differential expression studies5.existing methods for transcriptome-wide abundance estimation—both alignment-based and alignment-free—lack sample-specific bias models rich enough to capture important effects like fragment GC-content bias. When correction is not applied, these biases can lead to undesired effects, for example, a loss of false discovery rate (FDR) control in differential expression studies5.
最后放一張Salmon的流程圖,大家意會吧屯伞。

安裝
Salmon非常貼心的提供了二進(jìn)制版本腿箩,所以可以在https://github.com/COMBINE-lab/salmon/releases 下載最新的版本,當(dāng)然你可以選擇下載source code劣摇,然后自己編譯珠移。
我習(xí)慣把軟件放到家目錄下的biosoft文件夾下,然后把程序的路徑存放到PATH中末融,最后的效果如下:
$ salmon -h
Salmon v0.8.2
Usage: salmon -h|--help or
salmon -v|--version or
salmon -c|--cite or
salmon [--no-version-check] <COMMAND> [-h | options]
Commands:
cite Show salmon citation information
index Create a salmon index
quant Quantify a sample
swim Perform super-secret operation
簡單說明
Salmon提供了官方文檔http://salmon.readthedocs.io/en/latest/, 所以哪里不懂就去這里先看钧惧,如果解決不了問題,就去Google(不知道如何上Google勾习,請百度如何上Google)浓瞪。
Salmon的輸入數(shù)據(jù)可以分為兩種:
- 參考轉(zhuǎn)錄組(記住是轉(zhuǎn)錄組,而不是全基因組)和你的測序結(jié)果(FASTA/FASTQ格式)
- 已比對文件(SAM/BAM)和參考轉(zhuǎn)錄組(記住是轉(zhuǎn)錄組语卤,而不是全基因組追逮,好了酪刀,你別說了,我記住了)
根據(jù)輸入數(shù)據(jù)不同钮孵,分為兩種模式
- quasi-mapping-based mode
- alignment-based mode
以Salmon自帶的測試數(shù)據(jù)為例(解壓sample_data.tar.tgz可得):
$ ls
reads_1.fastq reads_2.fastq transcripts.fasta
第一步:要建立參考轉(zhuǎn)錄組的索引(記住是轉(zhuǎn)錄組骂倘,而不是全基因組,好了巴席!你別說了历涝,我記住了)
salmon index -t transcripts.fasta -i transcripts_index --type quasi -k 31
# 參數(shù)說明
-t: 輸入的參考轉(zhuǎn)錄組名(記住是轉(zhuǎn)錄組,而不是全基因組漾唉,好了荧库!你別說了,我記住了)
-i: 輸出index的名字
-type: 索引類型赵刑,分為fmd, quasi, 建議quasi
-k: k-mers的長度分衫。read長度大于75bp, 作者建議31
第二步: 對RNA測序結(jié)果進(jìn)行定量
salmon quant -i transcripts_index -l A -1 reads_1.fastq -2 reads_2.fastq -o transcripts_quant
# 參數(shù)解釋
-i: 輸入之前的index路徑
-l/--libType: 文庫類型般此,請看后續(xù)的文檔
-1: read1蚪战,支持壓縮文件
-2: read2,支持壓縮文件
-o: 輸出目錄
文庫具體說明铐懊,見官方文檔:http://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype邀桑。 盡管你可以用-A
程序自動決定, 但是了解不同的文庫類型可以幫助你理解-A是如何發(fā)揮功能科乎。
輸出的結(jié)果如下:
$ ls
aux_info cmd_info.json lib_format_counts.json libParams logs quant.sf
$ head quant.sf
Name Length EffectiveLength TPM NumReads
NM_001168316 2283 2105.89 12603.4 160.897
NM_174914 2385 2207.89 112095 1500.32
NR_031764 1853 1675.89 10117.2 102.785
NM_004503 1681 1503.89 36217.5 330.184
NM_006897 1541 1363.89 80309.5 664
NM_014212 2037 1859.89 4878.13 55
NM_014620 2300 2122.89 45827 589.754
NM_017409 1959 1781.89 4351.06 47
NM_017410 2396 2218.89 3122.42 42
第三步: 將結(jié)果導(dǎo)入R
接著可以用DESeq2的txmport
將處理結(jié)果導(dǎo)入R壁畸,
txi.salmon <- tximport('quant.sf', type = "salmon", tx2gene = tx2gene)
現(xiàn)實(shí)中的數(shù)據(jù)
演示的數(shù)據(jù)來自于發(fā)表在Nature commmunication 上的一篇文章 “Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin”。原文用RNA-Seq的方式研究在開花階段,芽分生組織在不同時期的基因表達(dá)變化茅茂。
原文的流程是: TopHat -> SummarizeOverlaps -> Deseq2 -> AmiGO
其中比對的參考基因組為TAIR10 ver.24 捏萍,并且屏蔽了ribosomal RNA
regions (2:3471–9557; 3:14,197,350–14,203,988)。Deseq2只計(jì)算至少在一個時間段的FPKM的count > 1 的基因玉吁。
數(shù)據(jù)存放在http://www.ebi.ac.uk/arrayexpress/, ID為E-MTAB-5130照弥。
實(shí)驗(yàn)設(shè)計(jì): 4個時間段(0,1,2,3),分別有4個生物學(xué)重復(fù)进副,一共有16個樣品这揣。
數(shù)據(jù)下載
首先下載數(shù)據(jù)說明文件:
$ wget http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt
然后根據(jù)數(shù)據(jù)說明文件提供的FTP鏈接下載
$ head -n1 E-MTAB-5130.sdrf.txt | tr '\t' '\n' | nl | grep URI
33 Comment[FASTQ_URI]
$ tail -n +2 E-MTAB-5130.sdrf.txt | cut -f 33 | xargs -i wget {}
根據(jù)下載速度,你們可以選擇去吃吃飯影斑,還是睡睡覺给赞。
下載完RNA-Seq數(shù)據(jù)后,我們還需要下載一個擬南芥cDNA序列(記住是轉(zhuǎn)錄組矫户,而不是全基因組片迅,好了,你別說了皆辽,我記住了)
$ curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gz
然后用Salmon建立索引:
salmon index -t athal.fa.gz -i athal_index
數(shù)據(jù)定量
由于樣本一共有16個柑蛇,不可能一條條輸入命令芥挣,所以我們寫一個腳本:
#! /bin/bash
for fn in ERR1698{194..209};
do
samp=`basename ${fn}`
echo "Processin sample ${sampe}"
salmon quant -i athal_index -l A \
-1 ${samp}_1.fastq.gz \
-2 ${samp}_2.fastq.gz \
-p 8 -o quants/${samp}_quant
done
根據(jù)你電腦的配置,你可以選擇吃下午茶耻台,還是選擇睡個午覺空免。
數(shù)據(jù)導(dǎo)入R
當(dāng)然你完全不必真的去睡午覺,我們可以程序運(yùn)行的時候準(zhǔn)備一下tximport
所需要的輸入文件盆耽。
tximport
可以糾正不同樣本基因長度的潛在改變(比如說differential isoform usage)蹋砚;能夠用于導(dǎo)入 (Salmon, Sailfish, kallisto)程序的輸出文件;能夠避免丟棄那些比對到多個基因的同源序列摄杂,從而提高靈敏度
In particular, the tximport pipeline offers the following benefits: (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013), (ii) some of the upstream quantification methods (Salmon, Sailfish, kallisto) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015)
雖然tximport
的參數(shù)看起來很多坝咐,但其實(shí)需要我們準(zhǔn)備的就是兩個files
和tx2gene
tximport(files, type = c("none", "salmon", "sailfish", "kallisto", "rsem"),
txIn = TRUE, txOut = FALSE, countsFromAbundance = c("no", "scaledTPM",
"lengthScaledTPM"), tx2gene = NULL, varReduce = FALSE,
dropInfReps = FALSE, ignoreTxVersion = FALSE, geneIdCol, txIdCol,
abundanceCol, countsCol, lengthCol, importer = NULL)
files
存放的是salmon的輸出文件,所以我們需要根據(jù)文件存放位置析恢,進(jìn)行聲明
dir <- "C:/Users/Xu/Desktop/"
list.files(dir)
sample <- paste0("ERR1698",seq(194,209),"_quant")
files <- file.path(dir,"quants",sample,"quant.sf")
names(files) <- paste0("sample",c(1:16))
all(file.exists(files))
然后我們還要準(zhǔn)備一個基因名和轉(zhuǎn)錄本名字相關(guān)的數(shù)據(jù)框
library(AnnotationHub)
ah <- AnnotationHub()
ath <- query(ah,'thaliana')
ath_tx <- ath[['AH52247']]
columns(ath_tx)
k <- keys(ath_tx,keytype = "GENEID")
df <- select(ath_tx, keys=k, keytype = "GENEID",columns = "TXNAME")
tx2gene <- df[,2:1] # TXID在前墨坚, GENEID在后
如果你電腦跑的夠快,基本上這個時候就可以導(dǎo)入數(shù)據(jù)了氮昧。
# install.packages("readr")
# install.packages("rsjon")
library("tximport")
library("readr")
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
# reading in files with read_tsv
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# summarizing abundance
# summarizing counts
# summarizing length
names(txi)
[1] "abundance" "counts" "length"
[4] "countsFromAbundance
head(txi$lenght)
head(txi$counts)
由于后續(xù)要用DESeq2包做差異表達(dá)分析框杜,所以需要用DESeqDataSetFromTximport
這個函數(shù)浦楣,當(dāng)然你還需要說明你的實(shí)驗(yàn)設(shè)計(jì)
library("DESeq2")
sampleTable <- data.frame(condition = factor(
rep(c("Day0","Day1","Day2","Day3"),each=4)))
rownames(sampleTable) <- colnames(txi$counts)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
總結(jié)
- 使用Salmon對樣本的轉(zhuǎn)錄水平進(jìn)行定量
- 使用tximport導(dǎo)入數(shù)據(jù)
- 使用DESeqDataSetFromTximport將數(shù)據(jù)導(dǎo)入DESeq2