IsoformSwitchAnalyzeR包是2019年才發(fā)布的一個分析包,更充分地發(fā)揮了RNA-seq數(shù)據(jù)的價值。學(xué)習(xí)筆記參考官方文檔凿滤,示例文件參考R包內(nèi)嵌數(shù)據(jù);回校后再利用資源庶近,根據(jù)筆記實操一下吧~
一翁脆、簡單介紹
1、IsoformSwitch
One of these examples is the pyruvate kinase(丙酮酸激酶). In normal adult homeostasis(體內(nèi)平衡), cells use the adult isoform (M1), which supports oxidative phosphorylation(氧化磷酸化). However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis(有氧糖酵解), one of the hallmarks(標志) of cancer. Such shifts in isoform usage are termed ‘isoformisoform switching’
如上所述鼻种,IsoformSwitch就是同一基因產(chǎn)生的轉(zhuǎn)錄本發(fā)生變化(量與質(zhì))導(dǎo)致基因功能改變反番,從而最終導(dǎo)致對機體的影響。
2叉钥、IsoformSwitchAnalyze
簡單理解就是比較兩組樣本的基因轉(zhuǎn)錄本差異與否罢缸。這是區(qū)別于差異基因分析的一種分析思路。從某些角度投队,這是有很大意義的枫疆。比如兩組樣本的某一基因表達量(gene expression)差異較小,但是該基因所產(chǎn)生的轉(zhuǎn)錄本情況可能發(fā)生大的改變(highly significant switch in isoform usage)敷鸦,在差異基因分析時就會被忽略息楔。
3寝贡、大致流程workflow
如下圖,可以理解成三個步驟--
- (1)導(dǎo)入數(shù)據(jù)--將前期基礎(chǔ)數(shù)據(jù)導(dǎo)入R中值依,其中最關(guān)鍵的是Isoform quantification數(shù)據(jù)圃泡。
- (2)對樣本isoform從多角度進行注釋分析;
- (3)利用轉(zhuǎn)錄本注釋信息愿险,進行組間IsoformSwitch分析颇蜡,并將結(jié)果可視化。
步驟一:數(shù)據(jù)導(dǎo)入
1拯啦、Isoform quantification
即RNA-seq中reads比對到轉(zhuǎn)錄本中的情況澡匪,分析軟件有很多,這里推薦使用Salmon(high speed and increased accuracy)褒链。分析后唁情,直接將結(jié)果導(dǎo)入到R中即可。importIsoformExpression()
在導(dǎo)入的同時會對counts進行標準化甫匹,計算abundance甸鸟,更有比較意義。
library(IsoformSwitchAnalyzeR)
salmonQuant <- importIsoformExpression(
parentDir = system.file("extdata/", package="IsoformSwitchAnalyzeR"),
addIsofomIdAsColumn = TRUE
)
值得注意的是因為記得salmon分析會將每個樣本數(shù)據(jù)結(jié)果儲存單獨一個文件夾兵迅,因此指定
parentDir=
即可抢韭。
此外這里因為利用的是包內(nèi)嵌數(shù)據(jù);當(dāng)使用自己的數(shù)據(jù)時恍箭,直接指定路徑/文件即可刻恭,下面示例文件操作也是類似的。
2扯夭、實驗分組設(shè)計表
data.frame格式鳍贾,要有兩列:sampleID and condition
myDesign <- data.frame(
sampleID = colnames(salmonQuant$abundance)[-1],
condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)
如下圖,示例實驗為三組condition交洗,每組兩個重復(fù)骑科。
3、GTF文件
主要提供以下兩方面信息:
- The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene.
即轉(zhuǎn)錄本的外顯子信息构拳,以及起源基因咆爽。 - the CoDing Sequence regions as the as the ORF regions.
其實CDS本質(zhì)與OPF是兩個概念,此時可以一同理解置森,就是轉(zhuǎn)錄本的編碼序列斗埂。
4、fasta文件
It is highly recommended that you supply the transcript fasta file while constructing the switchAnalyzeRlist. This will result in more accurate and much faster analysis in the entire workflow.
上述GTF與fasta注釋文件均可在Ensembl genome browser 99網(wǎng)站下載凫海,如下圖蜜笤。
5、導(dǎo)入合并
準備好上述四個文件盐碱,就可以使用importRdata()
函數(shù)進行導(dǎo)入合并了:并進行了以下分析:
- 分析轉(zhuǎn)錄本abundance把兔,得到基因表達情況沪伙;
- 根據(jù)分組,summarize gene/isoform expression/usage县好;
- 根據(jù)上一步围橡,計算組間差異(dIF);
- 為每個轉(zhuǎn)錄本缕贡,注釋genomic annotation and the nucleotide sequences翁授。
此時組間差異計算主要是
dIF
,the difference in isoform fraction。IF
即isoform fraction晾咪,用來評價該 isoform usage收擦;計算公式為isoform_exp / gene_exp。因此dIF
=IF2 - IF1谍倦,用于評價差異大小(effect size)塞赂。
sar1 <- importRdata(
isoformCountMatrix = salmonQuant$counts,
isoformRepExpression = salmonQuant$abundance,
designMatrix = myDesign,
isoformExonAnnoation = system.file("extdata/example.gtf.gz" , package="IsoformSwitchAnalyzeR"),
isoformNtFasta = system.file("extdata/example_isoform_nt.fasta.gz", package="IsoformSwitchAnalyzeR")
)
注意一下,此時sar1初步匯總結(jié)果仍可理解成:觀測/行名為所有的轉(zhuǎn)錄本昼蛀。
head(sar1,2)
head(sar1$isoformFeatures,2)
head(sar1$exons,2)
head(sar1$ntSequence,2)
6宴猾、初篩一下數(shù)據(jù)
這里的過濾主要基于low Gene/Isoform expression篩選,減少接下來不必要的運算時間叼旋。
sar2 <- preFilter(
switchAnalyzeRlist = sar1,
geneExpressionCutoff = 10,
isoformExpressionCutoff = 3,
removeSingleIsoformGenes = TRUE #默認為TRUE仇哆,刪除只有一個轉(zhuǎn)錄本的基因
)
#返回:The filtering removed 373 ( 34.16% of ) transcripts. There is now 719 isoforms left
-
preFilter
默認參數(shù)geneExpressionCutoff = 1
、isoformExpressionCutoff = 0
sar2 <- preFilter(sar1)
#返回:The filtering removed 273 ( 25% of ) transcripts. There is now 819 isoforms left
步驟二:添加注釋
1夫植、注釋Q值
這一步主要計算Switches的Q值與dIF讹剔,判斷isoform的顯著性與效應(yīng)大小
- The alpha argument indicating the FDR corrected P-value (Q-value) cutoff.
- The dIFcutoff argument indicating the minimum (absolute) change in isoform usage (dIF).
由于dIF者前期已經(jīng)計算,所以這里主要是計算Q值详民。有不同的的方法/算法延欠,推薦DEXSeq。
DEXSeq was originally designed for testing for differential exon usage but have recently been shown to perform exceptionally well for differential isoform usage.
sar3 <- isoformSwitchTestDEXSeq(
switchAnalyzeRlist = sar2,
reduceToSwitchingGenes=TRUE
)
-
reduceToSwitchingGenes=TRUE
參數(shù)設(shè)置篩選genes which each contains at least one differential used isoform, as indicated by the alpha and dIFcutoff cutoffs.
head(sar3,2)
2阐斜、注釋ORF(選做)
由于一般提供GTF文件會提供CDS數(shù)據(jù),所以不需要再添加诀紊。由于本次示例GTF文件好像未提供相關(guān)信息谒出。不過也沒關(guān)系,因為也可以利用現(xiàn)有的數(shù)據(jù)集構(gòu)建每個轉(zhuǎn)錄本的ORF邻奠。(utilizes the genomic coordinates of each transcript to extract the transcript nucleotide sequence from a reference genome )
sar4 <- analyzeORF(
sar3,
orfMethod = "longest", #有four different methods 笤喳,詳見官方文檔
showProgress=FALSE
)
head(sar4$orfAnalysis, 3)
3、生成轉(zhuǎn)錄本核苷酸與對應(yīng)氨基酸序列
sar5 <- extractSequence(sar4,
filterAALength=TRUE,
alsoSplitFastaFile=TRUE,
pathToOutput = '~/li/test/01',
writeToFile=TRUE)
For easier usage of the Pfam and SignalP web servers碌宴,進行以下設(shè)置:
-
filterAALength
參數(shù)表示氨基酸的有效長度為5~1000杀狡; -
alsoSplitFastaFile
參數(shù)表示對 the number of sequences。
4贰镣、External Sequence Analysis
(1)根據(jù)轉(zhuǎn)錄本核苷酸序列(Nucleotide/nt Sequences)
- predict whether an isoform is coding or not(CPAT/CPC2二選一)
(2)根據(jù)轉(zhuǎn)錄本氨基酸序列( Amino Acid/AA Sequences)
- Prediction of protein domains蛋白結(jié)構(gòu)域(Pfam)呜象;
- Prediction of Signal Peptides信號肽(SignalP)膳凝;
- Prediction of Intrinsically Disordered Regions,IDR內(nèi)在無序區(qū)域(IUPred2A/NetSurfP 2二選一)
根據(jù)所得序列文件,進行上述四種角度的轉(zhuǎn)錄本深入分析恭陡,網(wǎng)站鏈接見官方文檔蹬音。分析得到結(jié)果后,將結(jié)果注釋到sar5中休玩,方法可參考下述代碼:
- CPAT/CPC2
sar5.1 <- analyzeCPAT(
switchAnalyzeRlist = sar5,
pathToCPATresultFile = system.file("extdata/cpat_results.txt", package = "IsoformSwitchAnalyzeR"),
codingCutoff = 0.725, # the coding potential cutoff we suggested for human
removeNoncodinORFs = TRUE # 存疑V!拴疤!
)
#返回:Added coding potential to 350 (74.47%) transcripts
### Add CPC2 analysis
exampleSwitchListAnalyzed <- analyzeCPC2(
switchAnalyzeRlist = exampleSwitchListAnalyzed,
pathToCPC2resultFile = system.file("extdata/cpc2_result.txt", package = "IsoformSwitchAnalyzeR"),
codingCutoff = 0.725, # the coding potential cutoff we suggested for human
removeNoncodinORFs = TRUE # because ORF was predicted de novo
)
- Pfam
sar5.2 <- analyzePFAM(
switchAnalyzeRlist = sar5.1,
pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"),
showProgress=FALSE
)
#Added domain information to 102 (74.64%) transcripts
- SignalP
sar5.3 <- analyzeSignalP(
switchAnalyzeRlist = sar5.2,
pathToSignalPresultFile = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR")
)
#Added signal peptide information to 30 (6.38%) transcripts
- IUPred2A/NetSurfP
sar5.4 <- analyzeIUPred2A(
switchAnalyzeRlist = sar5.3,
pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz", package = "IsoformSwitchAnalyzeR"),
showProgress = FALSE
)
# Added IDR information to 69 (14.68%) transcripts
友情提示:if you had to submit your data as multiple runs at the Pfam or SignalP website you can just supply a vector of strings indicating the path to each of the resulting files to the functions above and IsoformSwitchAnalyzeR will read them all and integrate them.
5周偎、注釋Alternative Splicing可變剪切類型
基于 the exon structure of all isoforms in a given gene (with isoform switching)
sar6 <- analyzeAlternativeSplicing(
switchAnalyzeRlist = sar5.4,
quiet=TRUE
)
tmp=as.data.frame(sar6$AlternativeSplicingAnalysis)
table(sar6$AlternativeSplicingAnalysis$IR )
如下表格顯示纯续,61 isoforms contain a single intron retention (IR) and 19 isoforms each contain two or more intron retentions.
6、注釋Switch Consequences
- These isoforms are then divided into the isoforms that increase their contribution to gene expression (positive dIF values larger than dIFcutoff) and the isoforms that decrease their contribution (negative dIF values smaller than -dIFcutoff).
- The isoforms with increased contribution are then (in a pairwise manner) compared to the isoform with decreasing contribution.
- In each of these comparisons the isoforms compared are analyzed for differences in their annotation (controlled by the consequencesToAnalyze parameter).
cIn <- c('intron_retention','coding_potential','NMD_status','domains_identified','ORF_seq_similarity')
sar7 <- analyzeSwitchConsequences(
sar6,
consequencesToAnalyze = cIn,
dIFcutoff = 0.4, # very high cutoff for fast runtimes
showProgress=FALSE
)
sar7.df=as.data.frame(sar7$switchConsequence)
sar7.df=sar7$switchConsequence
結(jié)合sar7.df及上述定義,我個人的理解是首先篩選基因表達的多個轉(zhuǎn)錄本中既存在dIF>0.4的轉(zhuǎn)錄本拓瞪,也存在dIF< -0.4的轉(zhuǎn)錄本。然后將同一基因的這些轉(zhuǎn)錄本的注釋進行比較井辆,觀察借跪,這些注釋是否存在不同。
sar7
如下圖所示愧薛,步驟二就是不斷增加Feature analyzed條目的過程晨炕。到這里步驟二就大致完成了。之后的步驟就是結(jié)合這些前期數(shù)據(jù)進行最后結(jié)論性得分析毫炉、匯總了瓮栗。
值得注意的是 IsoformSwitchAnalyzeR包還提供了兩個便捷函數(shù):part1,part2,快速實現(xiàn)上述步驟二所有的注釋分析
data("exampleSwitchList")
exampleSwitchList #已完成了前期的數(shù)據(jù)導(dǎo)入
exampleSwitchList <- isoformSwitchAnalysisPart1(
switchAnalyzeRlist = exampleSwitchList,
dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data
# pathToOutput = 'path/to/where/output/should/be/'
outputSequences = FALSE, # change to TRUE whan analyzing your own data
prepareForWebServers = FALSE # change to TRUE if you will use webservers for external sequence analysis
)
# 利用產(chǎn)生的序列信息瞄勾,進行外部軟件分析费奸,得到相應(yīng)結(jié)果
exampleSwitchList <- isoformSwitchAnalysisPart2(
switchAnalyzeRlist = exampleSwitchList,
dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data
n = 10, # if plotting was enabled, it would only output the top 10 switches
removeNoncodinORFs = TRUE, # Because ORF was predicted de novo
pathToCPC2resultFile = system.file("extdata/cpc2_result.txt" , package = "IsoformSwitchAnalyzeR"),
pathToPFAMresultFile = system.file("extdata/pfam_results.txt" , package = "IsoformSwitchAnalyzeR"),
pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz" , package = "IsoformSwitchAnalyzeR"),
pathToSignalPresultFile = system.file("extdata/signalP_results.txt" , package = "IsoformSwitchAnalyzeR"),
outputPlots = FALSE # keeps the function from outputting the plots from this example
)
- 如下結(jié)果(Feature analyzed),上述利用
isoformSwitchAnalysisPart1
进陡,isoformSwitchAnalysisPart2
兩個函數(shù)可直接完成所有的分析愿阐,還是挺方便的~