RNA-seq小考核

教程對應(yīng)B站:【生信技能樹】轉(zhuǎn)錄組測序數(shù)據(jù)分析
鏈接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
前提了解Linux基本操作和R語言

題目

鏈接:http://www.bio-info-trainee.com/3920.html

準(zhǔn)備工作

1、參考基因組及注釋文件下載地址
參考基因組存儲網(wǎng)站三個:ENSEMBL、UCSC倍踪、NCBI

基因組注釋文件:gencode數(shù)據(jù)庫烦秩、ENSEMBL

注釋文件可以告訴我們基因組每條染色體上哪些序列是編碼蛋白的基因,哪些是非編碼基因倔毙,外顯子、內(nèi)含子、UTR位置等等瘫拣。

閱讀文獻(xiàn)

2、找到文章的測序數(shù)據(jù)
2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing()
文章研究了乳腺癌小鼠模型的768個間充質(zhì)細(xì)胞轉(zhuǎn)錄組的單細(xì)胞RNA測序(scRNA-seq)告喊,定義了三個不同的CAF(癌癥相關(guān)成纖維細(xì)胞)亞群麸拄。


3派昧、下載測序數(shù)據(jù)
GEO鏈接GSE111229

  • GEO Platform (GPL)
  • GEO Sample (GSM)
  • GEO Series (GSE)
  • GEO Dataset (GDS)

關(guān)于GEO數(shù)據(jù)庫:一篇文章可以有一個或者多個GSE數(shù)據(jù)集,一個GSE里面可以有一個或者多個GSM樣本拢切。多個研究的GSM樣本可以根據(jù)研究目的整合為一個GDS蒂萎,不過GDS本身用的很少。而每個數(shù)據(jù)集都有著自己對應(yīng)的芯片/平臺淮椰,就是GPL五慈。

原始數(shù)據(jù)鏈接SRP133642

關(guān)于SRA數(shù)據(jù)庫:用于存儲二代測序的原始數(shù)據(jù),包括454/illumina/SOLiD/IonTorrent/Helicos/CompleteGenomics

  • Studies 研究課題 SRP
  • Experiments 實驗設(shè)計 SRX
  • Runs 測序結(jié)果集 SRR
  • Samples 樣本信息 SRS
# 以下練習(xí)需要6個樣本主穗,我們從 SRR_Acc_List.txt 中選擇6個保存為 list.txt
cat list.txt |while read id ;do (prefetch  ${id} &);done


sra->fastq->bam->counts

4泻拦、任意挑選6個樣本走標(biāo)準(zhǔn)的RNA-SEQ上游流程

# source activate rna 創(chuàng)建小環(huán)境運行
(rna) yxu 21:16:47 ~/project/rnatest
$ tree -L 1
.
├── 1.sra_data
├── 2.raw_fq
├── 3.fastq_qc
├── 4.mapping
└── 5.counts

5 directories, 0 files

sra->fastq

(rna) yxu 21:29:34 ~/project/rnatest/1.sra_data
$ ll
total 29M
-rw-rw-r-- 1 yxu yxu 4.3M Mar 14 09:39 SRR6791080.sra
-rw-rw-r-- 1 yxu yxu 8.8M Mar 14 09:39 SRR6791081.sra
-rw-rw-r-- 1 yxu yxu 5.1M Mar 14 09:39 SRR6791082.sra
-rw-rw-r-- 1 yxu yxu 5.9M Mar 14 09:39 SRR6791083.sra
-rw-rw-r-- 1 yxu yxu 1.8M Mar 14 09:39 SRR6791084.sra
-rw-rw-r-- 1 yxu yxu 3.4M Mar 14 09:39 SRR6791085.sra
# 查看文件大小和網(wǎng)站上的大小是否一致

## step1: fastq-dump
ls ~/public/project/RNA/airway/sra/*  |while read id;do ( nohup fastq-dump --gzip --split-3 -O ./ ${id} & );done

(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ ll
total 47M
-rw-rw-r-- 1 yxu yxu 6.8M May 14 23:50 SRR6791080.fastq.gz
-rw-rw-r-- 1 yxu yxu  15M May 14 23:51 SRR6791081.fastq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 23:51 SRR6791082.fastq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 23:51 SRR6791083.fastq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 23:52 SRR6791084.fastq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 23:52 SRR6791085.fastq.gz

## step2: check quality of sequence reads 
ls *gz |xargs fastqc -t 10
multiqc ./ 
## step3:filter the bad quality reads and remove adaptors
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ cat qc.sh
cat id | while read id; do (trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 -o ./ ${id});done
# id為6個樣本的id號文件,自己創(chuàng)建

(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ nohup bash qc.sh &

(rna) yxu 21:53:50 ~/project/rnatest/3.fastq_qc/clean
$ ll *gz
-rw-rw-r-- 1 yxu yxu 6.7M May 14 00:33 SRR6791080_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu  15M May 14 00:33 SRR6791081_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 00:33 SRR6791082_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 00:33 SRR6791083_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 00:33 SRR6791084_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 00:33 SRR6791085_trimmed.fq.gz


fastq->bam

(rna) yxu 21:59:51 ~/project/rnatest/3.fastq_qc/clean
$ cat mapping.sh
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do hisat2 -p 10 -x /home/yxu/refernce/mm10/index/mm10/genome -U ${id}_trimmed.fq.gz |samtools sort -@ 2 -o ~/project/rnatest/4.mapping/${id}.bam;done

(rna) yxu 22:01:25 ~/project/rnatest/4.mapping
$ ls *bam
SRR6791080.bam  SRR6791082.bam  SRR6791084.bam
SRR6791081.bam  SRR6791083.bam  SRR6791085.bam

(rna) yxu 22:05:40 ~/project/rnatest/4.mapping
$ samtools view SRR6791080.bam | less -SN | head -n 10
# 查看bam文件

(rna) yxu 22:08:56 ~/project/rnatest/4.mapping
$ cat stat.sh
# 處理bam文件
ls *.bam |xargs -i samtools index {}
ls *.bam |while read id ;do ( nohup samtools flagstat -@ 1 $id >  $(basename ${id} ".bam").flagstat  & );done
(rna) yxu 23:00:05 ~/project/rnatest/4.mapping/stat
$ cat SRR6791080.flagstat
198223 + 0 in total (QC-passed reads + QC-failed reads)
268 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
134843 + 0 mapped (68.03% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
# 統(tǒng)計輸入文件的相關(guān)數(shù)據(jù)并將這些數(shù)據(jù)輸出至屏幕顯示


bam->counts

(rna) yxu 23:02:30 ~/project/rnatest/5.counts
$ cat counts.sh
featureCounts -T 5 -p -t exon -g gene_id -a /home/yxu/refernce/mm10/Mus_musculus.GRCm38.96.chr.gtf -o  ~/project/rnatest/5.counts/counts /home/yxu/project/rnatest/4.mapping/*.bam
# 得到counts矩陣


表達(dá)矩陣的多種形式

5忽媒、理解RNA-SEQ上游流程得到的表達(dá)矩陣的多種形式

轉(zhuǎn)錄組數(shù)據(jù)定量的方式争拐,都是對表達(dá)量進(jìn)行標(biāo)準(zhǔn)化的方法。
標(biāo)準(zhǔn)化的對象就是基因長度與測序深度晦雨。

  • counts矩陣
    每個基因比對到 reads數(shù)量
  • rpm矩陣
    去除了每個細(xì)胞測序數(shù)據(jù)量(文庫大小)差異

  • rpkm矩陣
    去除了基因長度效益

  • tpm矩陣

參考資料:A short script to calculate RPKM and TPM from featureCounts output

差異分析

6架曹、任取6個樣本表達(dá)矩陣隨意分成2組走差異分析代碼
代碼參考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq
需要匯總PCA,heatmap,火山圖,MA圖金赦,CV圖等等

rm(list = ls())
options(stringAsFactors = F)
# 測試數(shù)據(jù)表達(dá)矩陣下載
suppressMessages(library(GEOquery))
gse111229 <- getGEO("GSE111229", destdir=".", AnnotGPL = F, getGPL = F) 
Gset <- gse111229[[1]]
pdata <- pData(Gset)
gse111229 <- gse111229$GSE111229_series_matrix.txt.gz
exprSet = as.data.frame(exprs(gse111229))
# 讀取counts矩陣
mydata <- read.table("counts", header = TRUE,quote = '\t',skip = 1)
sampleNames <- c('SRR6791080','SRR6791081','SRR6791082','SRR6791083','SRR6791084','SRR6791085')
names(mydata)[7:12] <- sampleNames
head(mydata)
# 取Geneid和后6列樣本
countMatrix <- as.matrix(mydata[7:12])
rownames(countMatrix) <- mydata$Geneid
head(countMatrix)
##                    SRR6791080 SRR6791081 SRR6791082 SRR6791083 SRR6791084
## ENSMUSG00000102693          0          0          0          0          0
## ENSMUSG00000064842          0          0          0          0          0
## ENSMUSG00000051951          0          0          0          0          0
## ENSMUSG00000102851          0          0          0          0          0
## ENSMUSG00000103377          0          0          0          0          0
## ENSMUSG00000104017          0          0          0          0          0
##                    SRR6791085
## ENSMUSG00000102693          0
## ENSMUSG00000064842          0
## ENSMUSG00000051951          0
## ENSMUSG00000102851          0
## ENSMUSG00000103377          0
## ENSMUSG00000104017          0
save(countMatrix,file = "expr.Rdata")
# 將六個樣本分兩組
group_list <- c('A','A','A','B','B','B')
condition <- factor(group_list)


  • DEGseq
## step1: 把count矩陣轉(zhuǎn)化為 DESeq2 的數(shù)據(jù)格式
suppressMessages(library(DESeq2))
dds <- DESeqDataSetFromMatrix(countMatrix, DataFrame(condition), design= ~ condition)

# 過濾掉那些 count 結(jié)果為0的數(shù)據(jù)音瓷,這些基因沒有表達(dá)
dds <- dds[rowSums(counts(dds)) > 1,]

## step2: 歸一化
suppressMessages(dds2 <- DESeq(dds)) 
rld <- rlogTransformation(dds2)
exprSet_new=assay(rld)

resultsNames(dds2)
## [1] "Intercept"        "condition_B_vs_A"

## step3:提取差異分析結(jié)果
res <- results(dds2)
write.table(res,"result.csv", sep = ",", row.names = TRUE)
summary(res)
## 
## out of 9251 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 4, 0.043%
## LFC < 0 (down)     : 0, 0%
## outliers [1]       : 478, 5.2%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# MA圖
# M表示log fold change,衡量基因表達(dá)量變化夹抗,上調(diào)還是下調(diào)绳慎。A表示每個基因的count的均值
suppressMessages(library(geneplotter))
plotMA(res, main="DESeq2", ylim=c(-1,1))
unnamed-chunk-11-1.png

# Heatmap圖
sum(res$padj < 0.1, na.rm=TRUE)

suppressMessages(library("pheatmap"))
select <- order(rowMeans(counts(dds2,normalized=TRUE)),decreasing=TRUE)[1:1000]
nt <- normTransform(dds2) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select,]
df <- as.data.frame(colData(dds2))
# pdf('heatmap1000.pdf',width = 6, height = 7)  
pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,
         cluster_cols=TRUE, annotation_col=df)
# dev.off()

P-value:p值,是統(tǒng)計學(xué)檢驗變量漠烧,代表差異顯著性杏愤,一般認(rèn)為P < 0.05 為顯著, P <0.01為非常顯著已脓。其含義為:由抽樣誤差導(dǎo)致樣本間差異的概率小于0.05 或0.01珊楼。
Padj:p adjust,轉(zhuǎn)錄組測序的差異表達(dá)分析是對大量的基因表達(dá)值進(jìn)行的獨立統(tǒng)計假設(shè)檢驗度液,存在假陽性問題厕宗,因此引入Padj對顯著性P值(P adjust)進(jìn)行校正。Padj是對P-value的再判斷堕担,篩選更為嚴(yán)格已慢。
Fold Change:適用于兩分組分析,表示兩樣本(組)間表達(dá)量的比值霹购。
log2FoldChange:對Fold Change取log2佑惠,一般默認(rèn)表達(dá)相差2倍以上是有意義的,可以根據(jù)情況適當(dāng)放寬至1.5/1.2,但最好不要低于1.2倍膜楷。

unnamed-chunk-11-2.png

# 火山圖
suppressMessages(library(ggplot2))
res$color <- ifelse(res$padj<0.05 & abs(res$log2FoldChange)>= 2,ifelse(res$log2FoldChange > 2,'red','blue'),'gray')
color <- c(red = "red",gray = "gray",blue = "blue")
res = as.data.frame(res)

#pdf('火山圖.pdf',width = 6, height = 7)
ggplot(res, aes(log2FoldChange, -log10(padj), col = color)) +
  geom_point() +
  theme_bw() +
  scale_color_manual(values = color) +
  labs(x="log2 (fold change)",y="-log10 (q-value)") +
  geom_hline(yintercept = -log10(0.05), lty=4,col="grey",lwd=0.6) +
  geom_vline(xintercept = c(-1, 1), lty=4,col="grey",lwd=0.6) +
  theme(legend.position = "none",
        panel.grid=element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14))
#dev.off()
unnamed-chunk-12-1.png
# 聚類分析
clust = t(countMatrix)
rownames(clust) <- colnames(countMatrix)
clust_dist <- dist(clust,method="euclidean")
hc <-hclust(clust_dist,"ward.D")
suppressMessages(library(factoextra))
#pdf('hc.pdf',width = 6, height = 7)
fviz_dend(hc, k = 4,
          cex = 0.5,
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, rect = TRUE)
#dev.off()
unnamed-chunk-12-2.png


  • edge2
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- c('A','A','A','B','B','B')
table(group_list)
## group_list
## A B 
## 3 3

suppressMessages(library(edgeR))
d <- DGEList(counts=exprSet,group=factor(group_list))
keep <- rowSums(cpm(d)>1) >= 2
table(keep)
## keep
## FALSE  TRUE 
## 49082  6368
d <- d[keep, , keep.lib.sizes=FALSE]
d$samples$lib.size <- colSums(d$counts)

## 歸一化旭咽,使用edgeR中的calcNormFactor函數(shù),用TMMf方法
d <- calcNormFactors(d)
d$samples
# 增加一列$norm.factors
##            group lib.size norm.factors
## SRR6791080     A    80451    0.9744241
## SRR6791081     A   271535    0.6072700
## SRR6791082     A   128607    1.0075648
## SRR6791083     B   148799    1.0608086
## SRR6791084     B    22442    1.0858409
## SRR6791085     B    55050    1.4561091

dge=d
design <- model.matrix(~0+factor(group_list))
rownames(design)<-colnames(dge)
colnames(design)<-levels(factor(group_list))
dge=d
dge <- estimateGLMCommonDisp(dge,design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast=c(1,-1)) 
nrDEG=topTags(lrt, n=nrow(dge))
nrDEG=as.data.frame(nrDEG)
head(nrDEG)
##                         logFC    logCPM       LR       PValue       FDR
## ENSMUSG00000031740   8.686641 10.805348 13.45857 0.0002438890 0.5066751
## ENSMUSG00000064080  10.544813  9.816362 12.28010 0.0004578135 0.5066751
## ENSMUSG00000019929   9.785441 10.890102 12.26568 0.0004613642 0.5066751
## ENSMUSG00000022360 -10.732952  9.994101 12.01815 0.0005268487 0.5066751
## ENSMUSG00000024909   9.312234  8.625223 11.83857 0.0005801650 0.5066751
## ENSMUSG00000031328   6.021571 10.642223 11.11854 0.0008546883 0.5066751
edgeR_DEG =nrDEG 
nrDEG=edgeR_DEG[,c(1,5)]
colnames(nrDEG)=c('log2FoldChange','pvalue') 
save(nrDEG, edgeR_DEG, file = "edgeR.Rdata")


  • limma
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- factor(c('A','A','A','B','B','B'))
table(group_list)

## step1: 處理group_list
suppressMessages(library(limma))
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
design
##            A B
## SRR6791080 1 0
## SRR6791081 1 0
## SRR6791082 1 0
## SRR6791083 0 1
## SRR6791084 0 1
## SRR6791085 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(group_list)`
## [1] "contr.treatment"

## step2: 處理表達(dá)矩陣赌厅,過濾掉矩陣表達(dá)量太低或為0的
dge <- DGEList(counts=exprSet)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=3)
logCPM[1:4,1:4]

comp='A-B'
v <- voom(dge,design,plot=TRUE, normalize="quantile")
unnamed-chunk-14-1.png
fit <- lmFit(v, design)
cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)

## step3: 處理結(jié)束穷绵,拿數(shù)據(jù)
tempOutput = topTable(fit2, coef=comp, n=Inf)
DEG_limma_voom = na.omit(tempOutput)
head(DEG_limma_voom)
##                        logFC  AveExpr         t      P.Value    adj.P.Val
## ENSMUSG00000053907  4.405659 4.572770 119.25502 2.268083e-08 0.0008054679
## ENSMUSG00000025403  4.911621 4.829758 112.21979 2.905204e-08 0.0008054679
## ENSMUSG00000032714  5.335097 5.027842  70.25404 1.955055e-07 0.0011633706
## ENSMUSG00000033066  1.822959 3.285393  51.92294 6.690066e-07 0.0011633706
## ENSMUSG00000020520  4.507704 4.635493  44.98478 1.198824e-06 0.0011633706
## ENSMUSG00000032387 -6.577639 5.681617 -38.75511 2.197276e-06 0.0011633706
##                            B
## ENSMUSG00000053907 -23.06391
## ENSMUSG00000025403 -23.44447
## ENSMUSG00000032714 -25.87765
## ENSMUSG00000033066 -27.17210
## ENSMUSG00000020520 -28.02201
## ENSMUSG00000032387 -28.98618


nrDEG=DEG_limma_voom[,c(1,4)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DEG_limma_voom, file = "limma.Rdata")


7、挑選差異分析結(jié)果的統(tǒng)計學(xué)顯著上調(diào)下調(diào)基因集
在R里面察蹲,對統(tǒng)計學(xué)顯著上調(diào)下調(diào)基因集请垛,進(jìn)行GO/KEGG等數(shù)據(jù)庫的超幾何分布檢驗分析催训,原理參考:https://mp.weixin.qq.com/s/M6CRe39xmQ_lSQqeM99kow

基因富集分析(gene set enrichment analysis)是在一組基因或蛋白中找到一類過表達(dá)的基因或蛋白洽议。

# 跳躍學(xué)習(xí)一下如何用Bioconductor對基因組注釋
# 參考鏈接: http://www.reibang.com/p/ae94178918bc
# 乖巧的安裝:BiocManager::install("org.Mm.eg.db")

rm(list = ls())
options(stringsAsFactors = F)
suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))
load("limma.Rdata")
colnames(DEG_limma_voom)
## [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val" "B"
nrDEG = DEG_limma_voom

## step1: nrDEG添加一列change, 顯示上下調(diào)
logFC_cutoff <- with( nrDEG, mean( abs( logFC ) ) + 2 * sd( abs( logFC ) ) )
logFC_cutoff
## [1] 1.994444
logFC_cutoff = 1.2
nrDEG$change = as.factor(ifelse( nrDEG$P.Value < 0.05 & abs(nrDEG$logFC) > logFC_cutoff,ifelse( nrDEG$logFC > logFC_cutoff , 'UP', 'DOWN' ), 'NOT' ))
table(nrDEG$change)
##  DOWN   NOT    UP 
##   741 54063   646

## step2: 提取ENSG的ID和GENE
keytypes(org.Mm.eg.db)
nrDEG$ENSEMBL <- rownames(nrDEG)
df <- bitr(rownames(nrDEG), fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Mm.eg.db )
head(df)

## step3: 將GENE合并到nrDEG數(shù)據(jù)框內(nèi)
nrDEG$SYMBOL = rownames(nrDEG)
nrDEG = merge(nrDEG, df, by='ENSEMBL')
head(nrDEG)

## step4: 提取上調(diào)和下調(diào)的基因集
gene_up = nrDEG[ nrDEG$change == 'UP', 'ENTREZID' ] 
gene_down = nrDEG[ nrDEG$change == 'DOWN', 'ENTREZID' ]
gene_diff = c(gene_up, gene_down)

gene_all = as.character(nrDEG[ ,'ENTREZID'] )

geneList = nrDEG$logFC
names(geneList) = nrDEG$ENTREZID
geneList = sort(geneList, decreasing = T )
gene_list = as.list(geneList)
save(gene_list,file = "gene_list.txt")

## KEGG pathway analysis
kk.up <- enrichKEGG(gene =  gene_up,
                    organism = 'mmu',
                    universe =  gene_all,
                    pvalueCutoff  =  0.8,
                    qvalueCutoff  =  0.8)
kk.down <- enrichKEGG(gene = gene_down,
                      organism =  'mmu',
                      universe = gene_all,
                      pvalueCutoff  = 0.05)


suppressMessages(library(ggplot2))
kegg_down_dt <- as.data.frame( kk.down )
kegg_up_dt <- as.data.frame( kk.up )
down_kegg <- kegg_down_dt[ kegg_down_dt$pvalue < 0.05, ]
down_kegg$group = -1
up_kegg <- kegg_up_dt[ kegg_up_dt$pvalue < 0.05, ]
up_kegg$group = 1
  
dat = rbind( up_kegg, down_kegg )
dat$pvalue = -log10( dat$pvalue )
dat$pvalue = dat$pvalue * dat$group
  
dat = dat[ order( dat$pvalue, decreasing = F ), ]
save(dat,file = 'KEGG_enrich_results.Rdata')
load(file = 'KEGG_enrich_results.Rdata')  
ggplot(dat, aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) + 
    geom_bar( stat = "identity" ) + 
    scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) +
    scale_x_discrete( name = "Pathway names" ) +
    scale_y_continuous( name = "log10P-value" ) +
    coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
    ggtitle( "Pathway Enrichment" ) 
unnamed-chunk-17-1.png

## GO database analysis
g_list=list(gene_up=gene_up,
              gene_down=gene_down,
              gene_diff=gene_diff)
go_enrich_results <- lapply(g_list , function(gene) {
      lapply( c('BP','MF','CC') , function(ont) {
        cat(paste('Now process ',ont ))
        ego <- enrichGO(gene= gene,
                        universe      = gene_all,
                        OrgDb         = org.Mm.eg.db,
                        ont           = ont ,
                        pAdjustMethod = "BH",
                        pvalueCutoff  = 0.99,
                        qvalueCutoff  = 0.99,
                        readable      = TRUE)
        
        print( head(ego) )
        return(ego)
      })
  })
save(go_enrich_results,file = 'go_enrich_results.Rdata')
# OrgDb: 物種注釋數(shù)據(jù)庫
# ont: 是BP(Biological Process), CC(Cellular Component), MF(Molecular Function),一個基因的功能可以從生物學(xué)過程漫拭,所屬細(xì)胞部分亚兄,和分子功能三個角度定義。
load(file = 'go_enrich_results.Rdata')
n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC') 
for (i in 1:3){
  for (j in 1:3){
    fn=paste0('GO_dotplot_',n1[i],'_',n2[j],'.png')
    #cat(paste0(fn,'\n'))
    png(fn,res=150,width = 1080)
    print(dotplot(go_enrich_results[[i]][[j]]))
    dev.off()}}

dotplot(go_enrich_results[[1]][[1]],showCategory = 10,font.size = 6)
unnamed-chunk-19-1.png

8采驻、直接對任取6個樣本表達(dá)矩陣做GSVA分析
閱讀了一些關(guān)于GSVA算法和R包的介紹:http://www.reibang.com/p/6337e0dadc17
基因集合變異分析(Gene Set Variation Analysis审胚,GSVA)
參考代碼:https://github.com/jmzeng1314/GEO

gsva函數(shù)接受的是一個純粹的表達(dá)矩陣matrix和一個純粹基因集合list。

rm(list = ls())
options(stringsAsFactors = F)
# BiocManager::install('GSVA')
suppressMessages(library(GSVA))

load("expr.Rdata")
exprSet <- countMatrix
dim(exprSet)
## [1] 55450     6
group_list <- c('A','A','A','B','B','B')
table(group_list)

suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))

## step1: 整理檢查數(shù)據(jù)
g2s = toTable(org.Mm.egSYMBOL)
g2e = toTable(org.Mm.egENSEMBL)
a = as.data.frame(exprSet)


# 基因集下載:http://software.broadinstitute.org/gsea/msigdb/index.jsp    
d = 'MsigDB/'
gmts <- list.files(d,pattern = 'all')

# 感覺自己對這部分理解還不夠礼旅,留個問題

自己沒有可研究的數(shù)據(jù)膳叨,從文獻(xiàn)中挑選任意6個樣本研究,過程中可能有理解偏差痘系,如有問題菲嘴,懇請大佬指導(dǎo)~~

更多學(xué)習(xí)資源:
生信技能樹公益視頻合輯
生信技能樹簡書賬號
生信工程師入門最佳指南
生信技能樹全球公益巡講
招學(xué)徒
...
你的宣傳能讓數(shù)以萬計的初學(xué)者找到他們的家,技能樹平臺一定不會辜負(fù)每一個熱愛學(xué)習(xí)和分享的同道中人 ??

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末汰翠,一起剝皮案震驚了整個濱河市龄坪,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌复唤,老刑警劉巖健田,帶你破解...
    沈念sama閱讀 216,372評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異佛纫,居然都是意外死亡妓局,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評論 3 392
  • 文/潘曉璐 我一進(jìn)店門呈宇,熙熙樓的掌柜王于貴愁眉苦臉地迎上來好爬,“玉大人,你說我怎么就攤上這事攒盈〉志校” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評論 0 353
  • 文/不壞的土叔 我叫張陵型豁,是天一觀的道長僵蛛。 經(jīng)常有香客問我尚蝌,道長,這世上最難降的妖魔是什么充尉? 我笑而不...
    開封第一講書人閱讀 58,157評論 1 292
  • 正文 為了忘掉前任飘言,我火速辦了婚禮,結(jié)果婚禮上驼侠,老公的妹妹穿的比我還像新娘姿鸿。我一直安慰自己,他們只是感情好倒源,可當(dāng)我...
    茶點故事閱讀 67,171評論 6 388
  • 文/花漫 我一把揭開白布苛预。 她就那樣靜靜地躺著,像睡著了一般笋熬。 火紅的嫁衣襯著肌膚如雪热某。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,125評論 1 297
  • 那天胳螟,我揣著相機(jī)與錄音昔馋,去河邊找鬼。 笑死糖耸,一個胖子當(dāng)著我的面吹牛秘遏,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播嘉竟,決...
    沈念sama閱讀 40,028評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼邦危,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了周拐?” 一聲冷哼從身側(cè)響起铡俐,我...
    開封第一講書人閱讀 38,887評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎妥粟,沒想到半個月后审丘,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,310評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡勾给,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,533評論 2 332
  • 正文 我和宋清朗相戀三年滩报,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片播急。...
    茶點故事閱讀 39,690評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡脓钾,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出桩警,到底是詐尸還是另有隱情可训,我是刑警寧澤,帶...
    沈念sama閱讀 35,411評論 5 343
  • 正文 年R本政府宣布,位于F島的核電站握截,受9級特大地震影響飞崖,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜谨胞,卻給世界環(huán)境...
    茶點故事閱讀 41,004評論 3 325
  • 文/蒙蒙 一固歪、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧胯努,春花似錦牢裳、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至恬汁,卻和暖如春伶椿,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背氓侧。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評論 1 268
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留导狡,地道東北人约巷。 一個月前我還...
    沈念sama閱讀 47,693評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像旱捧,于是被迫代替她去往敵國和親独郎。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,577評論 2 353

推薦閱讀更多精彩內(nèi)容