寫在前面:
整個(gè)流程是首次從軟件安裝-數(shù)據(jù)下載-上游分析-下游分析钦椭。
代碼沒毛病捡鱼,奇怪的是salmon比對(duì)后憨愉,mapping~15%(不正常)匆骗,一般來說期望mapping應(yīng)該是70-90%。這導(dǎo)致在下游分析時(shí)椒涯,找到的差異表達(dá)基因很少柄沮,雖然火山圖勉強(qiáng)畫出,但是這些基因功能注釋及信號(hào)通路沒辦法繼續(xù)废岂。因此我決定再分析一次祖搓,對(duì)其中的原因進(jìn)行探究,比較兩次分析的過程和結(jié)果湖苞。
1. 準(zhǔn)備工作之軟件安裝
1.1 下載miniconda來安裝軟件
#下載miniconda
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
#更改鏡像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes
1.2 先創(chuàng)建一個(gè)相對(duì)的環(huán)境,使用python2
conda create rna python=2
下載軟件
conda install -y fastqc trim-galore bwa bowtie salmon subread sra-tools
數(shù)據(jù)來源
Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin
2. 下載數(shù)據(jù)及參考基因組
2.1 下載序列
創(chuàng)建文件夾放數(shù)據(jù)
mkdir -p project/rna
獲取樣本信息
wget http://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt
獲得樣本下載鏈接拯欧,并放入id.txt的文檔
tail -n +2 E-MTAB-5130.sdrf.txt | cut -f 33 >id.txt
寫腳本批量下載
cat >download.sh
cat id.txt|while read id;do (wget ${id});done
nohup bash download.sh &
2.2 下載參考基因組
#創(chuàng)建文件夾
mkdir reference && cd reference
#下載gff3(注釋基因組)和gtf(注釋基因)
nohup wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.42.gff3.gz &
nohup wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.42.gtf.gz &
下載參考基因組dna和cdna,我下載到了本地的财骨。所以這里沒代碼镐作。注意的是:擬南芥中沒有primary assemble, 閱讀READMRE得知:如果沒有primary assemle就用toplevel
查閱其他類型的參考序列含義
一般參考基因組要下載primary assemble
dna_rm:有屏蔽的基因組隆箩,用RepeatMasker tool可以檢測到的分散的重復(fù)序列及復(fù)雜性較低的區(qū)域该贾,用“N”代替這些堿基
dna_sm:軟屏蔽,所有重復(fù)以及復(fù)雜性低的序列用小寫字母代替
mt 線粒體
**pt **葉綠體
toplevel :所有組裝到的序列標(biāo)簽信息捌臊,也包含沒有組裝到的染色體上的染色體杨蛋,區(qū)域以及用N代替單倍型/批次區(qū)域
primary assemble :除不包含單倍型及批次序列外與toplevel相同。用來序列相似性分析的最佳選擇理澎,單倍型或是批次序列的存在會(huì)對(duì)相似性結(jié)果產(chǎn)生影響逞力。
構(gòu)建索引
# salmon index --help 來查看salmon構(gòu)建索引的方法
# -t [--transcripts] arg 轉(zhuǎn)錄本fasta 文件
# 注意,其他軟件都是利用參考基因組(dna)的fa文件建立索引糠爬,這個(gè)軟件用的cdna的fa數(shù)據(jù)建立索引
# -i [--index] arg salmon的索引
# k-mers 長度31寇荧,大約大于等于75bp
salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i athal_index_salmon -k 31
不需要比對(duì)直接定量分析
#將下載鏈接信息中的ID抽出來,
#-d '/'指定分隔符為‘/‘
#-f8提取第8列
#ID號(hào)本身是以順序排列所以不用sort执隧,直接uniq
#然后定向到一個(gè)文本
cat id.txt |cut -d"/" -f8|uniq >sample.txt
mkdir quant_salmon && cd quant_salmon
#寫腳本
# salmon index 中的參數(shù):
#-l 字符串類型描述文庫類型揩抡;
#-i salmon index(索引);
#-1 文件中包含#1匹配殴玛;
#-2 文件中包含#2匹配捅膘;
#--validateMappings產(chǎn)生不對(duì)齊比對(duì),最佳或臨近對(duì)齊滚粟;
-p 線程數(shù) 默認(rèn)為2
#-o 輸出路徑
cat >quant_salmon.sh
index=/Users/wudandan/project/rna/reference/athal_index_salmon/
quant=/Users/wudandan/project/rna/quant_salmon/
cat sample.txt|while read id; do salmon quant -i $index -l IU -1 ${id}_1.fastq.gz -2 ${id}_2.fastq.gz -p 1 -o $quant/${id}_quant
done
nohup bash quant_salmon.sh &
在-MTAB-5130.sdrf.txt中提取分組信息寻仗,為DESeq2分析準(zhǔn)備
tail -n +2 E-MTAB-5130.sdrf.txt | cut -f 32,36|uniq >sampleTable.txt
下述流程在Rstudio中完成
1. 配置鏡像,下載軟件凡壤,下載Bioconductor,加載安裝包
rm(list = ls())
options()$repos
options()$BioC_mirror
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options()$repos
options()$BioC_mirror
install.packages("tidyverse") ; library(tidyverse)
install.packages("optparse") ; library(optparse)
install.packages("UpSetR") ; library(UpSetR)
install.packages("rjson") ; library(rjson)
# https://bioconductor.org/packages/release/bioc/html/GEOquery.html
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("tximport","KEGG.db","DESeq2","edgeR" ,"org.At.tair.db","pheatmap","AnnotationHub"),ask = F,update = F)
BiocManager::install(c("clusterProfiler","limma","DOSE","GenomicFeatures","RColorBrewer"),ask = F,update = F)
2.準(zhǔn)備quant files和 基因名-轉(zhuǎn)錄本名字相關(guān)的數(shù)據(jù)框tx2genes
rm(list = ls())
options(stringsAsFactors = F)
# 得到salmon_quant路徑
getwd()
dir=getwd()
# *.sf文件為得到表達(dá)矩陣信息
files=list.files(pattern = "*sf",dir,recursive = T)
files=file.path(dir,files)
all(file.exists(files))
# 安裝Bioconductor包中的Arabidopsis thaliana對(duì)基因組注釋包
BiocManager::install("TxDb.Athaliana.BioMart.plantsmart28", version = "3.8")
library(TxDb.Athaliana.BioMart.plantsmart28)
ls('package:TxDb.Athaliana.BioMart.plantsmart28')
a=TxDb.Athaliana.BioMart.plantsmart28
# 查看包有哪些列
columns(a)
# keys返回這個(gè)數(shù)據(jù)包可以當(dāng)作關(guān)鍵字查找的列署尤,
# keytypes返回的列等于或少于columns返回的結(jié)果耙替,不是所有的列都可以當(dāng)作對(duì)象查找
k=keys(a,keytype = "GENEID")
# select可以根據(jù)你提供的key取查找注釋數(shù)據(jù)庫,返回你需要的columns信息
df=select(a, keys=k, keytype = "GENEID",columns = "TXNAME")
# 檢查得到的矩陣曹体,得到列名為“GENEID”和“TXNAME”的兩列
head(df)
# 將“TXNAME”放在第一列俗扇,“GENEID”放在第二列
tx2gene=df[,2:1]
head(tx2gene)
#### 與上面的步驟一樣,都為得到Arabidopsis thaliana對(duì)基因組注釋包箕别,提取注釋信息铜幽,推薦這種,可以安裝數(shù)據(jù)庫中已包含的所有物種都基因注釋包串稀。
#### 安裝AnnotationHub注釋信息數(shù)據(jù)庫
# BiocManager::install('AnnotationHub')
#### 加載并創(chuàng)建AnnotationHub對(duì)象
# library(AnnotationHub)
# ah <- AnnotationHub()
#### 用query來查找數(shù)據(jù)庫中'Arabidopsis thaliana'的相關(guān)信息
# ath <- query(ah,'Arabidopsis thaliana')
#### 獲得'Arabidopsis thaliana'最新注釋ID為AH52247
# ath_tx <- ath[['AH52247']]
#### 與上述相同
# columns(ath_tx)
# k <- keys(ath_tx,keytype = "GENEID")
# df <- select(ath_tx, keys=k, keytype = "GENEID",columns = "TXNAME")
# tx2gene <- df[,2:1]
3. 將salmon_count得到的數(shù)據(jù)合并作為inputdata
# tximport
library('tximport')
library('readr')
txi=tximport(files ,type = "salmon",tx2gene = tx2gene)
head(txi)
head(txi$length)
files
# 加載stringr包,為了得到將txi的列名定義為樣本名
library(stringr)
# 以'\'為分隔符除抛,將含有quant.sf的路徑分割,并提第七列---取樣本名(ERR1698194_quant)
t1=sapply(strsplit(files,'\\/'),function(x)x[7])
# txi 的列counts的列名為樣本名,
colnames(txi$counts)=sapply(strsplit(t1,'_'),function(x)x[1])
tmp=txi$counts
head(tmp)
# 1為行母截,2為列到忽,將列都轉(zhuǎn)化為整數(shù)
exprSet=apply(tmp,2, as.integer)
rownames(exprSet)=rownames(tmp)
head(exprSet)
dim(exprSet)
save(exprSet,file=paste0('quants-exprSet.Rdata'))
表達(dá)矩陣有問題
最后發(fā)現(xiàn)是txilength都沒有列名喘漏,于是想當(dāng)然的加上與txi$counts一樣的列名
colnames(txi$abundance)=colnames(txi$counts)
colnames(txi$length)=colnames(txi$counts)
再構(gòu)建dds就不會(huì)報(bào)錯(cuò)了
dds<-DESeqDataSetFromTximport(txi,sampleTable,design = ~group_list)
均一化
suppressMessages(dds2 <- DESeq(dds))
##直接用DESeq函數(shù)即可
## 下面的代碼如果你不感興趣不需要運(yùn)行,免得誤導(dǎo)你
## 就是看看normalization前面的數(shù)據(jù)分布差異
rld <- rlogTransformation(dds2) ## 得到經(jīng)過DESeq2軟件normlization的表達(dá)矩陣华烟!
exprSet_new=assay(rld)
par(cex = 0.7)
n.sample=ncol(exprSet)
if(n.sample>40) par(cex = 0.5)
cols <- rainbow(n.sample*1.2)
par(mfrow=c(2,2))
boxplot(exprSet, col = cols,main="expression value",las=2)
boxplot(exprSet_new, col = cols,main="expression value",las=2)
hist(exprSet)
hist(exprSet_new)
制定比較矩陣翩迈,共三組,分別用day0與其他三組進(jìn)行比較垦江,得到DESeq差異分析結(jié)果
resultsNames(dds2)
res0vs1 <- results(dds2, contrast = c("group_list","day1","day0"))
res0vs2 <- results(dds2, contrast = c("group_list","day2","day0"))
res0vs3 <- results(dds2, contrast = c("group_list","day3","day0"))
從DESeq差異分析結(jié)果中提取矩陣,包含:mean帽馋,
log2FoldChange,IfcSE, stat, pvalue, padj搅方,用來畫圖
resOrdered1 <- res0vs1[order(res0vs1$padj),]
resOrdered1=as.data.frame(resOrdered1)
head(resOrdered1)
resOrdered2 <- res0vs2[order(res0vs2$padj),]
resOrdered2=as.data.frame(resOrdered2)
head(resOrdered1)
resOrdered3 <- res0vs3[order(res0vs3$padj),]
resOrdered3=as.data.frame(resOrdered3)
head(resOrdered3)
先畫熱圖比吭,在畫圖之前要去掉矩陣中的NA,
去掉NA前基因?yàn)?499姨涡,去掉后只含有表達(dá)量的基因數(shù)量減少衩藤。
resOrdered1=na.omit(resOrdered1)
choose_gene1=rownames(resOrdered1)
choose_matrix1=exprSet[choose_gene1,]
#歸一化
choose_matrix1=t(scale(t(choose_matrix1)))
#將熱圖保存在本地
pheatmap(choose_matrix1,filename = "DEG1_all_heatmap.png")
resOrdered2=na.omit(resOrdered2)
choose_gene2=rownames(resOrdered2)
choose_matrix2=exprSet[choose_gene2,]
choose_matrix2=t(scale(t(choose_matrix2)))
pheatmap(choose_matrix2,filename = "DEG2_all_heatmap.png")
resOrdered3=na.omit(resOrdered3)
choose_gene3=rownames(resOrdered3)
choose_matrix3=exprSet[choose_gene3,]
choose_matrix3=t(scale(t(choose_matrix3)))
pheatmap(choose_matrix3,filename = "DEG3_all_heatmap.png")
DEG1_all_heatmap.png與DEG2_all_heatmap.png類似涛漂,DEG3_all_heatmap.png略有不同
畫個(gè)火山圖
library("org.At.tair.db")
library("KEGG.db")
library("clusterProfiler")
library(ggplot2)
resOrdered1$gene_id=rownames(resOrdered1)
id2symbol=toTable(org.At.tairSYMBOL)
#rownames會(huì)變?yōu)閿?shù)字
resOrdered1=merge(resOrdered1,id2symbol,by='gene_id')
DEG=resOrdered1
DEG_analysis <- function(DEG,prefix="test"){
colnames(DEG)=c('gene_id' ,'baseMean','logFC','lfcSE','stat','pvalue' , 'P.Value' , 'symbol')
## 下面我都用padj赏表,拋棄了pvalue
## 我不想修改我以前的代碼,所以我更改了這個(gè)列名
############################################################
############ DEG filter #############################
############################################################
## please keep in mind that I only keep the genes with symbol.
DEG$symbol=as.character(DEG$symbol)
#作用匈仗?瓢剿??
DEG_filter=DEG[nchar(DEG$symbol)>1,]
DEG_filter=DEG_filter[!is.na(DEG_filter$symbol),]
############################################################
############ volcano plot #############################
############################################################
logFC_Cutof <- with(DEG_filter,mean(abs( logFC)) + 2*sd(abs( logFC)) )
logFC_Cutof = 0
## 這里我不準(zhǔn)備用logFC來挑選差異基因悠轩,僅僅是用padj即可
DEG_filter$change = as.factor(ifelse(DEG_filter$P.Value < 0.05 &
abs(DEG_filter$logFC) > logFC_Cutof,
ifelse(DEG_filter$logFC > logFC_Cutof ,'UP','DOWN'),'NOT'))
this_tile <- paste0('Cutoff for logFC is ',round(logFC_Cutof,3),
'\nThe number of up gene is ',nrow(DEG_filter[DEG_filter$change =='UP',]) ,
'\nThe number of down gene is ',nrow(DEG_filter[DEG_filter$change =='DOWN',])
)
g_volcano = ggplot(data=DEG_filter, aes(x=logFC, y=-log10(P.Value), color=change)) +
geom_point(alpha=0.4, size=1.75) +
theme_set(theme_set(theme_bw(base_size=20)))+
xlab("log2 fold change") + ylab("-log10 p-value") +
ggtitle( this_tile ) +
theme(plot.title = element_text(size=15,hjust = 0.5))+
scale_colour_manual(values = c('blue','black','red')) ## corresponding to the levels(res$change)
print(g_volcano)
https://github.com/shenmengyuan/RNA_seq_Biotrainee
比對(duì)建索引參考:
http://www.reibang.com/p/071c1757ded1
salmon_quant:
https://salmon.readthedocs.io/en/latest/salmon.html#quantifying-in-mapping-based-mode
用tximport將轉(zhuǎn)錄組數(shù)據(jù)導(dǎo)入Rstudio:
https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#import-transcript-level-estimates
注釋包:
http://www.reibang.com/p/ae94178918bc
擬南芥數(shù)據(jù)庫:
https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release