RNA-seq學(xué)習(xí)：No.7IsoformSwitchAnalyzeR包（1）

IsoformSwitchAnalyzeR包是2019年才發(fā)布的一個分析包，更充分地發(fā)揮了RNA-seq數(shù)據(jù)的價值。學(xué)習(xí)筆記參考官方文檔凿滤，示例文件參考R包內(nèi)嵌數(shù)據(jù)；回校后再利用資源庶近，根據(jù)筆記實操一下吧~

一翁脆、簡單介紹

1、IsoformSwitch

One of these examples is the pyruvate kinase（丙酮酸激酶）. In normal adult homeostasis（體內(nèi)平衡）, cells use the adult isoform (M1), which supports oxidative phosphorylation（氧化磷酸化）. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis（有氧糖酵解）, one of the hallmarks（標志） of cancer. Such shifts in isoform usage are termed ‘isoformisoform switching’

如上所述鼻种，IsoformSwitch就是同一基因產(chǎn)生的轉(zhuǎn)錄本發(fā)生變化（量與質(zhì)）導(dǎo)致基因功能改變反番，從而最終導(dǎo)致對機體的影響。

2叉钥、IsoformSwitchAnalyze

簡單理解就是比較兩組樣本的基因轉(zhuǎn)錄本差異與否罢缸。這是區(qū)別于差異基因分析的一種分析思路。從某些角度投队，這是有很大意義的枫疆。比如兩組樣本的某一基因表達量(gene expression)差異較小，但是該基因所產(chǎn)生的轉(zhuǎn)錄本情況可能發(fā)生大的改變(highly significant switch in isoform usage)敷鸦，在差異基因分析時就會被忽略息楔。

3寝贡、大致流程workflow

如下圖，可以理解成三個步驟--

（1）導(dǎo)入數(shù)據(jù)--將前期基礎(chǔ)數(shù)據(jù)導(dǎo)入R中值依，其中最關(guān)鍵的是Isoform quantification數(shù)據(jù)圃泡。
（2）對樣本isoform從多角度進行注釋分析；
（3）利用轉(zhuǎn)錄本注釋信息愿险，進行組間IsoformSwitch分析颇蜡，并將結(jié)果可視化。

workflow

步驟一：數(shù)據(jù)導(dǎo)入

1拯啦、Isoform quantification

即RNA-seq中reads比對到轉(zhuǎn)錄本中的情況澡匪，分析軟件有很多，這里推薦使用Salmon(high speed and increased accuracy)褒链。分析后唁情，直接將結(jié)果導(dǎo)入到R中即可。importIsoformExpression()在導(dǎo)入的同時會對counts進行標準化甫匹，計算abundance甸鸟，更有比較意義。

library(IsoformSwitchAnalyzeR)
salmonQuant <- importIsoformExpression(
    parentDir = system.file("extdata/", package="IsoformSwitchAnalyzeR"),
    addIsofomIdAsColumn = TRUE
)

值得注意的是因為記得salmon分析會將每個樣本數(shù)據(jù)結(jié)果儲存單獨一個文件夾兵迅，因此指定parentDir=即可抢韭。
此外這里因為利用的是包內(nèi)嵌數(shù)據(jù)；當(dāng)使用自己的數(shù)據(jù)時恍箭，直接指定路徑/文件即可刻恭，下面示例文件操作也是類似的。

2扯夭、實驗分組設(shè)計表

data.frame格式鳍贾，要有兩列：sampleID and condition

myDesign <- data.frame(
    sampleID = colnames(salmonQuant$abundance)[-1],
    condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)

如下圖，示例實驗為三組condition交洗，每組兩個重復(fù)骑科。

myDesign

3、GTF文件

主要提供以下兩方面信息：

The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene.
即轉(zhuǎn)錄本的外顯子信息构拳，以及起源基因咆爽。
the CoDing Sequence regions as the as the ORF regions.
其實CDS本質(zhì)與OPF是兩個概念，此時可以一同理解置森，就是轉(zhuǎn)錄本的編碼序列斗埂。

4、fasta文件

It is highly recommended that you supply the transcript fasta file while constructing the switchAnalyzeRlist. This will result in more accurate and much faster analysis in the entire workflow.

上述GTF與fasta注釋文件均可在Ensembl genome browser 99網(wǎng)站下載凫海，如下圖蜜笤。

GTF與fasta下載

5、導(dǎo)入合并

準備好上述四個文件盐碱，就可以使用importRdata()函數(shù)進行導(dǎo)入合并了：并進行了以下分析：

分析轉(zhuǎn)錄本abundance把兔，得到基因表達情況沪伙；
根據(jù)分組，summarize gene/isoform expression/usage县好；
根據(jù)上一步围橡，計算組間差異(dIF)；
為每個轉(zhuǎn)錄本缕贡，注釋genomic annotation and the nucleotide sequences翁授。

此時組間差異計算主要是dIF,the difference in isoform fraction。IF即isoform fraction晾咪，用來評價該 isoform usage收擦；計算公式為isoform_exp / gene_exp。因此dIF=IF2 - IF1谍倦，用于評價差異大小(effect size)塞赂。

sar1 <- importRdata(
    isoformCountMatrix   = salmonQuant$counts,
    isoformRepExpression = salmonQuant$abundance,
    designMatrix         = myDesign,
    isoformExonAnnoation = system.file("extdata/example.gtf.gz"             , package="IsoformSwitchAnalyzeR"),
    isoformNtFasta       = system.file("extdata/example_isoform_nt.fasta.gz", package="IsoformSwitchAnalyzeR")
)

注意一下，此時sar1初步匯總結(jié)果仍可理解成：觀測/行名為所有的轉(zhuǎn)錄本昼蛀。

head(sar1,2)
head(sar1$isoformFeatures,2)
head(sar1$exons,2)
head(sar1$ntSequence,2)

6宴猾、初篩一下數(shù)據(jù)

這里的過濾主要基于low Gene/Isoform expression篩選，減少接下來不必要的運算時間叼旋。

sar2 <- preFilter(
  switchAnalyzeRlist = sar1,
  geneExpressionCutoff = 10,
  isoformExpressionCutoff = 3,
  removeSingleIsoformGenes = TRUE  #默認為TRUE仇哆，刪除只有一個轉(zhuǎn)錄本的基因
)
#返回：The filtering removed 373 ( 34.16% of ) transcripts. There is now 719 isoforms left

preFilter默認參數(shù) geneExpressionCutoff = 1、isoformExpressionCutoff = 0

sar2 <- preFilter(sar1)
#返回：The filtering removed 273 ( 25% of ) transcripts. There is now 819 isoforms left

步驟二：添加注釋

1夫植、注釋Q值

這一步主要計算Switches的Q值與dIF讹剔，判斷isoform的顯著性與效應(yīng)大小

The alpha argument indicating the FDR corrected P-value (Q-value) cutoff.
The dIFcutoff argument indicating the minimum (absolute) change in isoform usage (dIF).
由于dIF者前期已經(jīng)計算，所以這里主要是計算Q值详民。有不同的的方法/算法延欠，推薦DEXSeq。

DEXSeq was originally designed for testing for differential exon usage but have recently been shown to perform exceptionally well for differential isoform usage.

sar3 <- isoformSwitchTestDEXSeq(
  switchAnalyzeRlist = sar2,
  reduceToSwitchingGenes=TRUE
)

reduceToSwitchingGenes=TRUE參數(shù)設(shè)置篩選genes which each contains at least one differential used isoform, as indicated by the alpha and dIFcutoff cutoffs.

head(sar3,2)

head(sar3,2)

2阐斜、注釋ORF（選做）

由于一般提供GTF文件會提供CDS數(shù)據(jù)，所以不需要再添加诀紊。由于本次示例GTF文件好像未提供相關(guān)信息谒出。不過也沒關(guān)系，因為也可以利用現(xiàn)有的數(shù)據(jù)集構(gòu)建每個轉(zhuǎn)錄本的ORF邻奠。(utilizes the genomic coordinates of each transcript to extract the transcript nucleotide sequence from a reference genome )

sar4 <- analyzeORF(
    sar3,
    orfMethod = "longest",  #有four different methods 笤喳，詳見官方文檔
    showProgress=FALSE
)
head(sar4$orfAnalysis, 3)

head(sar4$orfAnalysis, 3)

3、生成轉(zhuǎn)錄本核苷酸與對應(yīng)氨基酸序列

sar5 <- extractSequence(sar4,
                        filterAALength=TRUE,
                        alsoSplitFastaFile=TRUE,
                        pathToOutput = '~/li/test/01',
                        writeToFile=TRUE)

For easier usage of the Pfam and SignalP web servers碌宴，進行以下設(shè)置：

filterAALength參數(shù)表示氨基酸的有效長度為5~1000杀狡；
alsoSplitFastaFile參數(shù)表示對 the number of sequences。

4贰镣、External Sequence Analysis

（1）根據(jù)轉(zhuǎn)錄本核苷酸序列(Nucleotide/nt Sequences)

predict whether an isoform is coding or not（CPAT/CPC2二選一）

（2）根據(jù)轉(zhuǎn)錄本氨基酸序列( Amino Acid/AA Sequences)

Prediction of protein domains蛋白結(jié)構(gòu)域（Pfam）呜象；
Prediction of Signal Peptides信號肽（SignalP）膳凝；
Prediction of Intrinsically Disordered Regions,IDR內(nèi)在無序區(qū)域（IUPred2A/NetSurfP 2二選一）

根據(jù)所得序列文件，進行上述四種角度的轉(zhuǎn)錄本深入分析恭陡，網(wǎng)站鏈接見官方文檔蹬音。分析得到結(jié)果后，將結(jié)果注釋到sar5中休玩，方法可參考下述代碼：

CPAT/CPC2

sar5.1 <- analyzeCPAT(
  switchAnalyzeRlist   = sar5,
  pathToCPATresultFile = system.file("extdata/cpat_results.txt", package = "IsoformSwitchAnalyzeR"),
  codingCutoff         = 0.725, # the coding potential cutoff we suggested for human
  removeNoncodinORFs   = TRUE   # 存疑Ｖ！拴疤！
)
#返回：Added coding potential to 350 (74.47%) transcripts
### Add CPC2 analysis
exampleSwitchListAnalyzed <- analyzeCPC2(
    switchAnalyzeRlist   = exampleSwitchListAnalyzed,
    pathToCPC2resultFile = system.file("extdata/cpc2_result.txt", package = "IsoformSwitchAnalyzeR"),
    codingCutoff         = 0.725, # the coding potential cutoff we suggested for human
    removeNoncodinORFs   = TRUE   # because ORF was predicted de novo
)

Pfam

sar5.2 <- analyzePFAM(
  switchAnalyzeRlist   = sar5.1,
  pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"),
  showProgress=FALSE
)
#Added domain information to 102 (74.64%) transcripts

SignalP

sar5.3 <- analyzeSignalP(
  switchAnalyzeRlist       = sar5.2,
  pathToSignalPresultFile  = system.file("extdata/signalP_results.txt", package = "IsoformSwitchAnalyzeR")
)
#Added signal peptide information to 30 (6.38%) transcripts

IUPred2A/NetSurfP

sar5.4 <- analyzeIUPred2A(
  switchAnalyzeRlist        = sar5.3,
  pathToIUPred2AresultFile = system.file("extdata/iupred2a_result.txt.gz", package = "IsoformSwitchAnalyzeR"),
  showProgress = FALSE
)
# Added IDR information to 69 (14.68%) transcripts

友情提示：if you had to submit your data as multiple runs at the Pfam or SignalP website you can just supply a vector of strings indicating the path to each of the resulting files to the functions above and IsoformSwitchAnalyzeR will read them all and integrate them.

5周偎、注釋Alternative Splicing可變剪切類型

基于 the exon structure of all isoforms in a given gene (with isoform switching)

sar6 <- analyzeAlternativeSplicing(
    switchAnalyzeRlist = sar5.4,
    quiet=TRUE
)
tmp=as.data.frame(sar6$AlternativeSplicingAnalysis)
table(sar6$AlternativeSplicingAnalysis$IR )

如下表格顯示纯续，61 isoforms contain a single intron retention (IR) and 19 isoforms each contain two or more intron retentions.

table

6、注釋Switch Consequences

These isoforms are then divided into the isoforms that increase their contribution to gene expression (positive dIF values larger than dIFcutoff) and the isoforms that decrease their contribution (negative dIF values smaller than -dIFcutoff).
The isoforms with increased contribution are then (in a pairwise manner) compared to the isoform with decreasing contribution.
In each of these comparisons the isoforms compared are analyzed for differences in their annotation (controlled by the consequencesToAnalyze parameter).

cIn <- c('intron_retention','coding_potential','NMD_status','domains_identified','ORF_seq_similarity')
sar7 <- analyzeSwitchConsequences(
  sar6,
  consequencesToAnalyze = cIn, 
  dIFcutoff = 0.4, # very high cutoff for fast runtimes
  showProgress=FALSE
)
sar7.df=as.data.frame(sar7$switchConsequence)
sar7.df=sar7$switchConsequence

結(jié)合sar7.df及上述定義，我個人的理解是首先篩選基因表達的多個轉(zhuǎn)錄本中既存在dIF>0.4的轉(zhuǎn)錄本拓瞪，也存在dIF< -0.4的轉(zhuǎn)錄本。然后將同一基因的這些轉(zhuǎn)錄本的注釋進行比較井辆，觀察借跪，這些注釋是否存在不同。

sar7.df

sar7

如下圖所示愧薛，步驟二就是不斷增加Feature analyzed條目的過程晨炕。到這里步驟二就大致完成了。之后的步驟就是結(jié)合這些前期數(shù)據(jù)進行最后結(jié)論性得分析毫炉、匯總了瓮栗。

sar7

值得注意的是 IsoformSwitchAnalyzeR包還提供了兩個便捷函數(shù)：part1,part2，快速實現(xiàn)上述步驟二所有的注釋分析

data("exampleSwitchList")
exampleSwitchList    #已完成了前期的數(shù)據(jù)導(dǎo)入
exampleSwitchList <- isoformSwitchAnalysisPart1(
    switchAnalyzeRlist   = exampleSwitchList,
    dIFcutoff            = 0.3, # Cutoff for finding switches - set high for short runtime in example data
    # pathToOutput = 'path/to/where/output/should/be/'
    outputSequences      = FALSE, # change to TRUE whan analyzing your own data 
    prepareForWebServers = FALSE  # change to TRUE if you will use webservers for external sequence analysis
)
# 利用產(chǎn)生的序列信息瞄勾，進行外部軟件分析费奸，得到相應(yīng)結(jié)果
exampleSwitchList <- isoformSwitchAnalysisPart2(
  switchAnalyzeRlist        = exampleSwitchList, 
  dIFcutoff                 = 0.3,   # Cutoff for finding switches - set high for short runtime in example data
  n                         = 10,    # if plotting was enabled, it would only output the top 10 switches
  removeNoncodinORFs        = TRUE,  # Because ORF was predicted de novo
  pathToCPC2resultFile      = system.file("extdata/cpc2_result.txt"         , package = "IsoformSwitchAnalyzeR"),
  pathToPFAMresultFile      = system.file("extdata/pfam_results.txt"        , package = "IsoformSwitchAnalyzeR"),
  pathToIUPred2AresultFile  = system.file("extdata/iupred2a_result.txt.gz"  , package = "IsoformSwitchAnalyzeR"),
  pathToSignalPresultFile   = system.file("extdata/signalP_results.txt"     , package = "IsoformSwitchAnalyzeR"),
  outputPlots               = FALSE  # keeps the function from outputting the plots from this example
)

如下結(jié)果(Feature analyzed)，上述利用isoformSwitchAnalysisPart1进陡，isoformSwitchAnalysisPart2兩個函數(shù)可直接完成所有的分析愿阐，還是挺方便的~

exampleSwitchList

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市趾疚，隨后出現(xiàn)的幾起案子缨历，更是在濱河造成了極大的恐慌，老刑警劉巖糙麦，帶你破解...
沈念sama閱讀 216,324評論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件辛孵，死亡現(xiàn)場離奇詭異，居然都是意外死亡赡磅，警方通過查閱死者的電腦和手機魄缚，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,356評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來焚廊，“玉大人冶匹，你說我怎么就攤上這事习劫。” “怎么了徙硅？”我有些...
開封第一講書人閱讀 162,328評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵榜聂，是天一觀的道長。經(jīng)常有香客問我嗓蘑，道長须肆，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,147評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任桩皿，我火速辦了婚禮豌汇，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘泄隔。我一直安慰自己拒贱，他們只是感情好，可當(dāng)我...
茶點故事閱讀 67,160評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布佛嬉。她就那樣靜靜地躺著逻澳，像睡著了一般。火紅的嫁衣襯著肌膚如雪暖呕。梳的紋絲不亂的頭發(fā)上斜做，一...
開封第一講書人閱讀 51,115評論 1贊 296
城市分裂傳說
那天，我揣著相機與錄音湾揽，去河邊找鬼瓤逼。笑死，一個胖子當(dāng)著我的面吹牛库物，可吹牛的內(nèi)容都是我干的霸旗。我是一名探鬼主播，決...
沈念sama閱讀 40,025評論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼戚揭，長吁一口氣：“原來是場噩夢啊……” “哼诱告！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起民晒，我...
開封第一講書人閱讀 38,867評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤精居，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后镀虐，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體箱蟆，經(jīng)...
沈念sama閱讀 45,307評論 1贊 310
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡沟绪，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,528評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年刮便，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片绽慈。...
茶點故事閱讀 39,688評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡恨旱，死狀恐怖辈毯，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情搜贤，我是刑警寧澤谆沃，帶...
沈念sama閱讀 35,409評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站仪芒，受9級特大地震影響唁影，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜掂名，卻給世界環(huán)境...
茶點故事閱讀 41,001評論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一据沈、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧饺蔑，春花似錦锌介、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,657評論 0贊 22
一樁弒父案孔祸，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至发皿，卻和暖如春崔慧，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背雳窟。一陣腳步聲響...
開封第一講書人閱讀 32,811評論 1贊 268
情欲美人皮
我被黑心中介騙來泰國打工尊浪，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人封救。一個月前我還...
沈念sama閱讀 47,685評論 2贊 368
代替公主和親
正文我出身青樓拇涤，卻偏偏與公主長得像，于是被迫代替她去往敵國和親誉结。傳聞我的和親對象是個殘疾皇子鹅士，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 44,573評論 2贊 353