關于RNAseq
Nature重磅綜述 |關于RNA-seq巫湘,你想知道的都在這
準備工作
先在服務器安裝conda胯努。安裝好通過conda進一步安裝所需要的軟件谢揪。
通過conda安裝fastqc(質(zhì)控), hisat2(比對), multiqc(質(zhì)控合并), samtools(sam文件轉(zhuǎn)bam文件), subread(featureCounts集合在其中)第租。建議先創(chuàng)建一個虛擬環(huán)境探膊。
SRA toolkit軟件(用于下載和處理NCBI的數(shù)據(jù))下載睬愤,在官網(wǎng)下載自己合適的版本(使用conda安裝時失敗了)。
參考代碼:
conda create -n RNAseq #新建一個名為RNAseq的conda環(huán)境
conda activate RNAseq #激活并進入新建的RNAseq環(huán)境
conda install fastqc hisat2 samtools subread multiqc trimmomatic #安裝所需要的軟件
數(shù)據(jù)準備
數(shù)據(jù)下載
在GEO數(shù)據(jù)庫下載PRJNA639275數(shù)據(jù)集统锤,這個數(shù)據(jù)集共有34個樣本(16個新冠患者,17個正常樣本)炭庙,每個樣本測序兩次饲窿。
prefetch SRRXXXXXX -O output # output更換為路徑
prefetch -O output --option-file SRR_Acc_list.txt #批量下載
數(shù)據(jù)處理
sra轉(zhuǎn)fastq格式
for name in SRR120078*; do fastq-dump
--split-files #雙端測序時需要加這個參數(shù)
--gzip #加上-gzip可以使fq文件壓縮,節(jié)約空間
-O /path to save/ # -O 指定輸出位置
/path to input/${name}/*sra; done #遍歷處理所有下載的數(shù)據(jù)
如果是自己的測序文件則需要對比md5碼來檢查數(shù)據(jù)完整性
md5sum *gz > md5.txt && md5sum -c md5.txt
# *gz 任意以gz結(jié)尾的文件
# > 運行結(jié)果保存至
# && 連接符焕蹄,前一命令完成繼續(xù)執(zhí)行下一命令
質(zhì)量評估與質(zhì)控
multiqc是合并報告的一個Python包逾雄,報告解讀
操作說明見官方用戶手冊
multiqc只能識別 *report.html
方法一:使用fastp質(zhì)控
個人感覺fastp比trimmomatic要方便
單端:fastp -i raw.fq -o clean.fq
雙端:fastp -i raw_1.fq -I raw_2.fq -o clean_1.fq -O clean_2.fq
參數(shù):
-l
36 #過濾后的最短序列長度,默認為15
-j
xxx.json 過濾報告文件
-h
xxx.report.html 過濾報告文件
單端測序
for name in *fq;do fastp \
-i ./${name}/*gz \
-o./fastp_clean/clean_${name}.gz \
-l 36 \
-h ./fastp_reports/${name}.html;
done
multiqc ./fastp_reports/ -o ./multiqc_reports/
雙端測序
fastp \
-i R1_fastq.gz \
-I R2_fastq.gz \
-o R1_output \
-O R2_output\
-l 36 \
-j jsonfile \
-h htmlfile
方法二:使用fastqc質(zhì)量評估和trimmomatic質(zhì)控
fastqc -o fastqc_output ./rawdata/*gz
multiqc fastqc_output -o output
# 對所有gz結(jié)尾文件進行質(zhì)控并對當前目錄下所有質(zhì)控報告進行合并
trimmomatic
參考用法說明
trimmomatic SE -phred33 \
input.fq.gz output_fq.gz \
ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
MINLEN:36 #單端
trimmomatic PE -phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
MINLEN:36 #雙端
hisat2比對
準備工作
在官網(wǎng)直接下載對應的基因組索引文件然后解壓即可生成8個索引文件。
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar -zxvf grch38_genome.tar.gz
#grch38/
#grch38/genome.5.ht2
#grch38/genome.2.ht2
#grch38/make_grch38.sh
#grch38/genome.3.ht2
#grch38/genome.4.ht2
#grch38/genome.7.ht2
#grch38/genome.1.ht2
#grch38/genome.6.ht2
#grch38/genome.8.ht2
或者根據(jù)不同需求在genecode下載基因組fa文件和注釋GTF文件,然后在通過hisat2-build去建立索引鸦泳。
比對
hisat2 -p 8 \
-x ref/grch38/genome \
-1 sample_1.fq.gz -2 sample_2.fq.gz \
-S sample.sam \
--summary-file sample.hisat2.summary
-p
: 線程數(shù)
-x
: 基因組索引前綴银锻。所下的基因組索引為多個文件,索引前綴到genome為止做鹰。
-1/-2
: fastq輸入文件击纬。當輸入為單端測序時使用-U 指定輸入。
單端或雙端的輸入都可使用逗號分隔輸入多個樣本誊垢,或使用多次-1 -2 / -U 指定多個輸入掉弛。
e.g.:-U
sample1.fq.gz,sample2.fq.gz -U sample3.fq.gz
-S
: 輸出sam文件路徑。
--summary-file
: 生成summary文件喂走。
sam轉(zhuǎn)bam
使用samtools將sam文件轉(zhuǎn)換為bam文件殃饿,bam文件為二進制文件,能大大壓縮體積
samtools sort -@ 8 -o sample.bam sample.sam
# sort:排序
# -@:線程數(shù)
# -o:輸出bam文件
計數(shù)
試用featureCounts進行計數(shù)芋肠,這一步需要基因組注釋文件乎芳,在genecode下載GTF文件即可。
featureCounts -p -T 8 -a ref/annotation.gtf.gz -o featurecounts.out \
[sample1.bam sample2.bam ...] *.bam
# -p: 此參數(shù)雙端測序時使用
# -T: 線程數(shù)
# -a: 基因組注釋文件
# -o: 輸出文件
# 最后為bam文件帖池,可指定多個輸入
DESeq2差異表達分析
Ensembl ID和gene symbol相互轉(zhuǎn)換
選擇對應物種奈惑,左側(cè)Attributes只選擇Gene stable ID 和 Gene name,然后點擊左上角的resuls睡汹,最后點擊GO導出文件肴甸。
根據(jù)參考文件在計數(shù)文件中添加symbol name。
或用R包實現(xiàn)囚巴。
R語言實現(xiàn)ID轉(zhuǎn)換
BiocManager::install("org.Mm.eg.db") #Mm:小鼠原在,Hs:人
library('org.Mm.eg.db')
matrix <- read.csv("./featurecounts.txt",sep="\t",stringsAsFactors = F,header = F)
colnames(matrix) <- matrix[2,]
matrix <- matrix[-1:-2,]
rownames(matrix) <- 1:nrow(matrix)
for (i in 1:nrow(matrix)){
matrix[i,"Geneid"] <- strsplit(matrix[i,"Geneid"], split = "\\.")[[1]][1]
}
mycounts <- matrix[,-2:-6]
colnames(mycounts)[1] <- "ensembl_id" #列名需要更改
g2s=toTable(org.Mm.egSYMBOL)
g2e=toTable(org.Mm.egENSEMBL)
b=merge(mycounts,g2e,by="ensembl_id",all.x=T)
d=merge(b,g2s,by="gene_id",all.x=T)
d=d[order(d$ensembl_id),]
table(d$symbol)[table(d$symbol)>1]
d=d[!duplicated(d$symbol),] #去重
d=d[match(mycounts$ensembl_id,d$ensembl_id),]
mycounts$symbol <- d$symbol
mycounts<-subset(mycounts,symbol!="NA") #去缺失值
rownames(mycounts) <- mycounts$symbol
R分析前需要準備的兩個文件:
featurecounts計數(shù)文件與分組文件
代碼部分
DEG分析
library(DESeq2)
setwd("E:/R/DEseq2")
mycounts <- read.csv("counts.csv",header= TRUE,row.names = 2)
mycounts <- mycounts[rowSums(mycounts[-1])!=0,]
mycounts <- mycounts[-1]
mymeta <- read.csv("mymeta heathly_vs_covid.csv",stringsAsFactors = T)
colnames(mycounts)==mymeta$id
dds <- DESeqDataSetFromMatrix(countData = mycounts, colData = mymeta,desig=~dex)
dds <- DESeq(dds)
res <- results(dds,contrast = c("dex","covid","Healthy"))
res <- data.frame(res)
DEG <- res[order(res$padj),] #按照padj排序
write.csv(DEG,"DEG.csv",quote = F)
save(DEG,file = "DEG_results.Rdata") #保存R數(shù)據(jù)方便下次繼續(xù)使用
得到的DEG結(jié)果
heatmap
library(pheatmap)
choose_gene=head(rownames(DEG),1000)
choose_matrix=mycounts[choose_gene,]
choose_matrix <- log2(choose_matrix+1)
choose_matrix=t(scale(t(choose_matrix)))
group_list <-c(rep("covid",17),rep("Healthy",19))
ac <- data.frame(dex=group_list)
rownames(ac) <- colnames(choose_matrix)
pheatmap(choose_matrix,filename = "heatmap.png", show_rownames = F, show_colnames = F,
annotation_col = ac, display_numbers = FALSE)
volcano
library(ggpubr)
library(ggthemes)
DEG$change =as.factor(ifelse(DEG$pvalue < 0.05 & DEG$log2FoldChange > 2,'UP',
ifelse(DEG$pvalue < 0.05 & DEG$log2FoldChange < -2,'DOWN',
"NOT")))
DEG$logp = -log10(DEG$pvalue)
table(DEG$change)
DEG$label <- ""
DEG <- DEG[order(DEG$pvalue),]
up_genes <- head(row.names(DEG)[which(DEG$change == "UP")],10)
down_genes <- head(row.names(DEG)[which(DEG$change =="DOWN")],10)
DEG_top10 <- c(as.character(up_genes),as.character(down_genes))
DEG$label[match(DEG_top10,row.names(DEG))] <- DEG_top10
ggscatter(DEG, x = "log2FoldChange", y = "logp",
color = "change",
palette = c("#2f5688","#BBBBBB","#CC0000"),
size = 1,
label = DEG$label,
font.label = 8,
repel = T,
xlab = "log2FoldChange",
ylab ="-log10(adjust p value)") + theme_base() +
geom_hline(yintercept = 1.5,linetype="dashed")+
geom_vline(xintercept = c(-1,1),linetype="dashed")
ggsave("volcano.png",dpi=600)
GO And KEGG
#####install packages#####
BiocManager::install("clusterProfiler")
BiocManager::install("topGO")
BiocManager::install("Rgraphviz")
BiocManager::install("pathview")
BiocManager::install("org.Hs.eg.db")
library(clusterProfiler)
library(topGO)
library(Rgraphviz)
library(pathview)
library(org.Hs.eg.db)
DEG_vector <- (DEG$padj < 0.005 & DEG$log2FoldChange > 3)|(DEG$padj < 0.005 & DEG$log2FoldChange < -3) #Padj根據(jù)需要改為pvalue,所有數(shù)值可以根據(jù)需求更改彤叉。
DEG.sign <- DEG[DEG_vector,]
DEG.entrez_id = mapIds(x = org.Hs.eg.db,
keys = rownames(DEG.sign),
keytype = "SYMBOL",
column = "ENTREZID")
DEG.sign$ENTREZID <- DEG.entrez_id
gene_up <- DEG.sign[DEG.sign$change == 'UP','ENTREZID']
gene_down <- DEG.sign[DEG.sign$change == 'DOWN','ENTREZID']
source('kegg_and_go_up_and_down.R') #提前準備好kegg_and_go_up_and_down.R文件
run_go(gene_up,gene_down,pro='convid_VS_healthy')
run_kegg(gene_up,gene_down,pro='convid_VS_healthy')
kegg_and_go_up_and_down.R
(via jimmy)
## KEGG pathway analysis
### 做KEGG數(shù)據(jù)集超幾何分布檢驗分析庶柿,重點在結(jié)果的可視化及生物學意義的理解。
run_kegg <- function(gene_up,gene_down,geneList=F,pro='test'){
gene_up=unique(gene_up)
gene_down=unique(gene_down)
gene_diff=unique(c(gene_up,gene_down))
### over-representation test
# 下面把3個基因集分開做超幾何分布檢驗
# 上調(diào)基因集秽浇。
kk.up <- enrichKEGG(gene = gene_up,
organism = 'hsa',
#universe = gene_all,
pvalueCutoff = 0.5,
qvalueCutoff = 0.5)
head(kk.up)[,1:6]
kk=kk.up
dotplot(kk)
kk=DOSE::setReadable(kk, OrgDb='org.Hs.eg.db',keyType = "ENTREZID")
write.csv(kk@result,paste0("./kegg_results/",pro,'_kk.up.csv'))
# 下調(diào)基因集浮庐。
kk.down <- enrichKEGG(gene = gene_down,
organism = 'hsa',
#universe = gene_all,
pvalueCutoff = 0.5,
qvalueCutoff =0.5)
head(kk.down)[,1:6]
kk=kk.down
dotplot(kk)
kk=DOSE::setReadable(kk, OrgDb='org.Hs.eg.db',keyType = "ENTREZID")
write.csv(kk@result,paste0("./kegg_results/",pro,'_kk.down.csv'))
# 上下調(diào)合并后的基因集。
kk.diff <- enrichKEGG(gene = gene_diff,
organism = 'hsa',
pvalueCutoff = 0.5)
head(kk.diff)[,1:6]
kk=kk.diff
dotplot(kk)
kk=DOSE::setReadable(kk, OrgDb='org.Hs.eg.db',keyType = "ENTREZID")
write.csv(kk@result,paste0("./kegg_results/",pro,'_kk.diff.csv'))
kegg_diff_dt <- as.data.frame(kk.diff)
kegg_down_dt <- as.data.frame(kk.down)
kegg_up_dt <- as.data.frame(kk.up)
down_kegg<-kegg_down_dt[kegg_down_dt$pvalue<0.1,];down_kegg$group=-1
up_kegg<-kegg_up_dt[kegg_up_dt$pvalue<0.1,];up_kegg$group=1
#畫圖設置, 這個圖很丑柬焕,大家可以自行修改审残。
g_kegg=kegg_plot(up_kegg,down_kegg)
print(g_kegg)
ggsave(g_kegg,filename = paste0("./kegg_results/",pro,'_kegg_up_down.png'))
if(geneList){
### GSEA
## GSEA算法跟上面的使用差異基因集做超幾何分布檢驗不一樣。
kk_gse <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 1000,
minGSSize = 20,
pvalueCutoff = 0.9,
verbose = FALSE)
head(kk_gse)[,1:6]
gseaplot(kk_gse, geneSetID = rownames(kk_gse[1,]))
gseaplot(kk_gse, 'hsa04110',title = 'Cell cycle')
kk=DOSE::setReadable(kk_gse, OrgDb='org.Hs.eg.db',keytype='ENTREZID')
tmp=kk@result
write.csv(kk@result,paste0(pro,'_kegg.gsea.csv'))
# 這里如果找不到顯著下調(diào)的通路击喂,可以選擇調(diào)整閾值维苔,或者其它。
down_kegg<-kk_gse[kk_gse$pvalue<0.01 & kk_gse$enrichmentScore < 0,];down_kegg$group=-1
up_kegg<-kk_gse[kk_gse$pvalue<0.01 & kk_gse$enrichmentScore > 0,];up_kegg$group=1
g_kegg=kegg_plot(up_kegg,down_kegg)
print(g_kegg)
ggsave(g_kegg,filename = paste0("./kegg_results/",pro,'_kegg_gsea.png'))
}
}
### GO database analysis
### 做GO數(shù)據(jù)集超幾何分布檢驗分析懂昂,重點在結(jié)果的可視化及生物學意義的理解介时。
run_go <- function(gene_up,gene_down,pro='test'){
gene_up=unique(gene_up)
gene_down=unique(gene_down)
gene_diff=unique(c(gene_up,gene_down))
g_list=list(gene_up=gene_up,
gene_down=gene_down,
gene_diff=gene_diff)
# 因為go數(shù)據(jù)庫非常多基因集,所以運行速度會很慢。
if(T){
go_enrich_results <- lapply( g_list , function(gene) {
lapply( c('BP','MF','CC') , function(ont) {
cat(paste('Now process ',ont ))
ego <- enrichGO(gene = gene,
#universe = gene_all,
OrgDb = org.Hs.eg.db,
ont = ont ,
pAdjustMethod = "BH",
pvalueCutoff = 0.5,
qvalueCutoff = 0.5,
readable = TRUE)
print( head(ego) )
return(ego)
})
})
save(go_enrich_results,file =paste0(pro, '_go_enrich_results.Rdata'))
}
# 只有第一次需要運行沸柔,就保存結(jié)果哈循衰,下次需要探索結(jié)果,就載入即可褐澎。
load(file=paste0(pro, '_go_enrich_results.Rdata'))
n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC')
for (i in 1:3){
for (j in 1:3){
fn=paste0("./go_enrich_results/",pro, '_dotplot_',n1[i],'_',n2[j],'.png')
cat(paste0(fn,'\n'))
# png(fn,res=150,width = 1080)
# print( dotplot(go_enrich_results[[i]][[j]] ))
#dev.off()
dotplot(go_enrich_results[[i]][[j]] )
ggsave(fn)
}
}
}
kegg_plot <- function(up_kegg,down_kegg){
dat=rbind(up_kegg,down_kegg)
colnames(dat)
dat$pvalue = -log10(dat$pvalue)
dat$pvalue=dat$pvalue*dat$group
dat=dat[order(dat$pvalue,decreasing = F),]
g_kegg<- ggplot(dat, aes(x=reorder(Description,order(pvalue, decreasing = F)), y=pvalue, fill=group)) +
geom_bar(stat="identity") +
scale_fill_gradient(low="blue",high="red",guide = FALSE) +
scale_x_discrete(name ="Pathway names") +
scale_y_continuous(name ="log10P-value") +
coord_flip() + theme_bw()+theme(plot.title = element_text(hjust = 0.5))+
ggtitle("Pathway Enrichment")
}