教程對應(yīng)B站:【生信技能樹】轉(zhuǎn)錄組測序數(shù)據(jù)分析
鏈接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
前提了解Linux基本操作和R語言
題目
鏈接:http://www.bio-info-trainee.com/3920.html
準(zhǔn)備工作
1、參考基因組及注釋文件下載地址
參考基因組存儲網(wǎng)站三個:ENSEMBL、UCSC倍踪、NCBI
基因組注釋文件:gencode數(shù)據(jù)庫烦秩、ENSEMBL
注釋文件可以告訴我們基因組每條染色體上哪些序列是編碼蛋白的基因,哪些是非編碼基因倔毙,外顯子、內(nèi)含子、UTR位置等等瘫拣。
閱讀文獻(xiàn)
2、找到文章的測序數(shù)據(jù)
2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing()
文章研究了乳腺癌小鼠模型的768個間充質(zhì)細(xì)胞轉(zhuǎn)錄組的單細(xì)胞RNA測序(scRNA-seq)告喊,定義了三個不同的CAF(癌癥相關(guān)成纖維細(xì)胞)亞群麸拄。
3派昧、下載測序數(shù)據(jù)
GEO鏈接:GSE111229
- GEO Platform (GPL)
- GEO Sample (GSM)
- GEO Series (GSE)
- GEO Dataset (GDS)
關(guān)于GEO數(shù)據(jù)庫:一篇文章可以有一個或者多個GSE數(shù)據(jù)集,一個GSE里面可以有一個或者多個GSM樣本拢切。多個研究的GSM樣本可以根據(jù)研究目的整合為一個GDS蒂萎,不過GDS本身用的很少。而每個數(shù)據(jù)集都有著自己對應(yīng)的芯片/平臺淮椰,就是GPL五慈。
原始數(shù)據(jù)鏈接:SRP133642
關(guān)于SRA數(shù)據(jù)庫:用于存儲二代測序的原始數(shù)據(jù),包括454/illumina/SOLiD/IonTorrent/Helicos/CompleteGenomics
- Studies 研究課題 SRP
- Experiments 實驗設(shè)計 SRX
- Runs 測序結(jié)果集 SRR
- Samples 樣本信息 SRS
# 以下練習(xí)需要6個樣本主穗,我們從 SRR_Acc_List.txt 中選擇6個保存為 list.txt
cat list.txt |while read id ;do (prefetch ${id} &);done
sra->fastq->bam->counts
4泻拦、任意挑選6個樣本走標(biāo)準(zhǔn)的RNA-SEQ上游流程
# source activate rna 創(chuàng)建小環(huán)境運行
(rna) yxu 21:16:47 ~/project/rnatest
$ tree -L 1
.
├── 1.sra_data
├── 2.raw_fq
├── 3.fastq_qc
├── 4.mapping
└── 5.counts
5 directories, 0 files
sra->fastq
(rna) yxu 21:29:34 ~/project/rnatest/1.sra_data
$ ll
total 29M
-rw-rw-r-- 1 yxu yxu 4.3M Mar 14 09:39 SRR6791080.sra
-rw-rw-r-- 1 yxu yxu 8.8M Mar 14 09:39 SRR6791081.sra
-rw-rw-r-- 1 yxu yxu 5.1M Mar 14 09:39 SRR6791082.sra
-rw-rw-r-- 1 yxu yxu 5.9M Mar 14 09:39 SRR6791083.sra
-rw-rw-r-- 1 yxu yxu 1.8M Mar 14 09:39 SRR6791084.sra
-rw-rw-r-- 1 yxu yxu 3.4M Mar 14 09:39 SRR6791085.sra
# 查看文件大小和網(wǎng)站上的大小是否一致
## step1: fastq-dump
ls ~/public/project/RNA/airway/sra/* |while read id;do ( nohup fastq-dump --gzip --split-3 -O ./ ${id} & );done
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ ll
total 47M
-rw-rw-r-- 1 yxu yxu 6.8M May 14 23:50 SRR6791080.fastq.gz
-rw-rw-r-- 1 yxu yxu 15M May 14 23:51 SRR6791081.fastq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 23:51 SRR6791082.fastq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 23:51 SRR6791083.fastq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 23:52 SRR6791084.fastq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 23:52 SRR6791085.fastq.gz
## step2: check quality of sequence reads
ls *gz |xargs fastqc -t 10
multiqc ./
## step3:filter the bad quality reads and remove adaptors
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ cat qc.sh
cat id | while read id; do (trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 -o ./ ${id});done
# id為6個樣本的id號文件,自己創(chuàng)建
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ nohup bash qc.sh &
(rna) yxu 21:53:50 ~/project/rnatest/3.fastq_qc/clean
$ ll *gz
-rw-rw-r-- 1 yxu yxu 6.7M May 14 00:33 SRR6791080_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 15M May 14 00:33 SRR6791081_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 00:33 SRR6791082_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 00:33 SRR6791083_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 00:33 SRR6791084_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 00:33 SRR6791085_trimmed.fq.gz
fastq->bam
(rna) yxu 21:59:51 ~/project/rnatest/3.fastq_qc/clean
$ cat mapping.sh
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do hisat2 -p 10 -x /home/yxu/refernce/mm10/index/mm10/genome -U ${id}_trimmed.fq.gz |samtools sort -@ 2 -o ~/project/rnatest/4.mapping/${id}.bam;done
(rna) yxu 22:01:25 ~/project/rnatest/4.mapping
$ ls *bam
SRR6791080.bam SRR6791082.bam SRR6791084.bam
SRR6791081.bam SRR6791083.bam SRR6791085.bam
(rna) yxu 22:05:40 ~/project/rnatest/4.mapping
$ samtools view SRR6791080.bam | less -SN | head -n 10
# 查看bam文件
(rna) yxu 22:08:56 ~/project/rnatest/4.mapping
$ cat stat.sh
# 處理bam文件
ls *.bam |xargs -i samtools index {}
ls *.bam |while read id ;do ( nohup samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat & );done
(rna) yxu 23:00:05 ~/project/rnatest/4.mapping/stat
$ cat SRR6791080.flagstat
198223 + 0 in total (QC-passed reads + QC-failed reads)
268 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
134843 + 0 mapped (68.03% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
# 統(tǒng)計輸入文件的相關(guān)數(shù)據(jù)并將這些數(shù)據(jù)輸出至屏幕顯示
bam->counts
(rna) yxu 23:02:30 ~/project/rnatest/5.counts
$ cat counts.sh
featureCounts -T 5 -p -t exon -g gene_id -a /home/yxu/refernce/mm10/Mus_musculus.GRCm38.96.chr.gtf -o ~/project/rnatest/5.counts/counts /home/yxu/project/rnatest/4.mapping/*.bam
# 得到counts矩陣
表達(dá)矩陣的多種形式
5忽媒、理解RNA-SEQ上游流程得到的表達(dá)矩陣的多種形式
轉(zhuǎn)錄組數(shù)據(jù)定量的方式争拐,都是對表達(dá)量進(jìn)行標(biāo)準(zhǔn)化的方法。
標(biāo)準(zhǔn)化的對象就是基因長度與測序深度晦雨。
- counts矩陣
每個基因比對到 reads數(shù)量
rpm矩陣
去除了每個細(xì)胞測序數(shù)據(jù)量(文庫大小)差異rpkm矩陣
去除了基因長度效益tpm矩陣
參考資料:A short script to calculate RPKM and TPM from featureCounts output
差異分析
6架曹、任取6個樣本表達(dá)矩陣隨意分成2組走差異分析代碼
代碼參考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq
需要匯總PCA,heatmap,火山圖,MA圖金赦,CV圖等等
rm(list = ls())
options(stringAsFactors = F)
# 測試數(shù)據(jù)表達(dá)矩陣下載
suppressMessages(library(GEOquery))
gse111229 <- getGEO("GSE111229", destdir=".", AnnotGPL = F, getGPL = F)
Gset <- gse111229[[1]]
pdata <- pData(Gset)
gse111229 <- gse111229$GSE111229_series_matrix.txt.gz
exprSet = as.data.frame(exprs(gse111229))
# 讀取counts矩陣
mydata <- read.table("counts", header = TRUE,quote = '\t',skip = 1)
sampleNames <- c('SRR6791080','SRR6791081','SRR6791082','SRR6791083','SRR6791084','SRR6791085')
names(mydata)[7:12] <- sampleNames
head(mydata)
# 取Geneid和后6列樣本
countMatrix <- as.matrix(mydata[7:12])
rownames(countMatrix) <- mydata$Geneid
head(countMatrix)
## SRR6791080 SRR6791081 SRR6791082 SRR6791083 SRR6791084
## ENSMUSG00000102693 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0
## ENSMUSG00000051951 0 0 0 0 0
## ENSMUSG00000102851 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 0
## ENSMUSG00000104017 0 0 0 0 0
## SRR6791085
## ENSMUSG00000102693 0
## ENSMUSG00000064842 0
## ENSMUSG00000051951 0
## ENSMUSG00000102851 0
## ENSMUSG00000103377 0
## ENSMUSG00000104017 0
save(countMatrix,file = "expr.Rdata")
# 將六個樣本分兩組
group_list <- c('A','A','A','B','B','B')
condition <- factor(group_list)
- DEGseq
## step1: 把count矩陣轉(zhuǎn)化為 DESeq2 的數(shù)據(jù)格式
suppressMessages(library(DESeq2))
dds <- DESeqDataSetFromMatrix(countMatrix, DataFrame(condition), design= ~ condition)
# 過濾掉那些 count 結(jié)果為0的數(shù)據(jù)音瓷,這些基因沒有表達(dá)
dds <- dds[rowSums(counts(dds)) > 1,]
## step2: 歸一化
suppressMessages(dds2 <- DESeq(dds))
rld <- rlogTransformation(dds2)
exprSet_new=assay(rld)
resultsNames(dds2)
## [1] "Intercept" "condition_B_vs_A"
## step3:提取差異分析結(jié)果
res <- results(dds2)
write.table(res,"result.csv", sep = ",", row.names = TRUE)
summary(res)
##
## out of 9251 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 4, 0.043%
## LFC < 0 (down) : 0, 0%
## outliers [1] : 478, 5.2%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# MA圖
# M表示log fold change,衡量基因表達(dá)量變化夹抗,上調(diào)還是下調(diào)绳慎。A表示每個基因的count的均值
suppressMessages(library(geneplotter))
plotMA(res, main="DESeq2", ylim=c(-1,1))
# Heatmap圖
sum(res$padj < 0.1, na.rm=TRUE)
suppressMessages(library("pheatmap"))
select <- order(rowMeans(counts(dds2,normalized=TRUE)),decreasing=TRUE)[1:1000]
nt <- normTransform(dds2) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select,]
df <- as.data.frame(colData(dds2))
# pdf('heatmap1000.pdf',width = 6, height = 7)
pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,
cluster_cols=TRUE, annotation_col=df)
# dev.off()
P-value:p值,是統(tǒng)計學(xué)檢驗變量漠烧,代表差異顯著性杏愤,一般認(rèn)為P < 0.05 為顯著, P <0.01為非常顯著已脓。其含義為:由抽樣誤差導(dǎo)致樣本間差異的概率小于0.05 或0.01珊楼。
Padj:p adjust,轉(zhuǎn)錄組測序的差異表達(dá)分析是對大量的基因表達(dá)值進(jìn)行的獨立統(tǒng)計假設(shè)檢驗度液,存在假陽性問題厕宗,因此引入Padj對顯著性P值(P adjust)進(jìn)行校正。Padj是對P-value的再判斷堕担,篩選更為嚴(yán)格已慢。
Fold Change:適用于兩分組分析,表示兩樣本(組)間表達(dá)量的比值霹购。
log2FoldChange:對Fold Change取log2佑惠,一般默認(rèn)表達(dá)相差2倍以上是有意義的,可以根據(jù)情況適當(dāng)放寬至1.5/1.2,但最好不要低于1.2倍膜楷。
# 火山圖
suppressMessages(library(ggplot2))
res$color <- ifelse(res$padj<0.05 & abs(res$log2FoldChange)>= 2,ifelse(res$log2FoldChange > 2,'red','blue'),'gray')
color <- c(red = "red",gray = "gray",blue = "blue")
res = as.data.frame(res)
#pdf('火山圖.pdf',width = 6, height = 7)
ggplot(res, aes(log2FoldChange, -log10(padj), col = color)) +
geom_point() +
theme_bw() +
scale_color_manual(values = color) +
labs(x="log2 (fold change)",y="-log10 (q-value)") +
geom_hline(yintercept = -log10(0.05), lty=4,col="grey",lwd=0.6) +
geom_vline(xintercept = c(-1, 1), lty=4,col="grey",lwd=0.6) +
theme(legend.position = "none",
panel.grid=element_blank(),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
#dev.off()
# 聚類分析
clust = t(countMatrix)
rownames(clust) <- colnames(countMatrix)
clust_dist <- dist(clust,method="euclidean")
hc <-hclust(clust_dist,"ward.D")
suppressMessages(library(factoextra))
#pdf('hc.pdf',width = 6, height = 7)
fviz_dend(hc, k = 4,
cex = 0.5,
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, rect = TRUE)
#dev.off()
- edge2
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- c('A','A','A','B','B','B')
table(group_list)
## group_list
## A B
## 3 3
suppressMessages(library(edgeR))
d <- DGEList(counts=exprSet,group=factor(group_list))
keep <- rowSums(cpm(d)>1) >= 2
table(keep)
## keep
## FALSE TRUE
## 49082 6368
d <- d[keep, , keep.lib.sizes=FALSE]
d$samples$lib.size <- colSums(d$counts)
## 歸一化旭咽,使用edgeR中的calcNormFactor函數(shù),用TMMf方法
d <- calcNormFactors(d)
d$samples
# 增加一列$norm.factors
## group lib.size norm.factors
## SRR6791080 A 80451 0.9744241
## SRR6791081 A 271535 0.6072700
## SRR6791082 A 128607 1.0075648
## SRR6791083 B 148799 1.0608086
## SRR6791084 B 22442 1.0858409
## SRR6791085 B 55050 1.4561091
dge=d
design <- model.matrix(~0+factor(group_list))
rownames(design)<-colnames(dge)
colnames(design)<-levels(factor(group_list))
dge=d
dge <- estimateGLMCommonDisp(dge,design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast=c(1,-1))
nrDEG=topTags(lrt, n=nrow(dge))
nrDEG=as.data.frame(nrDEG)
head(nrDEG)
## logFC logCPM LR PValue FDR
## ENSMUSG00000031740 8.686641 10.805348 13.45857 0.0002438890 0.5066751
## ENSMUSG00000064080 10.544813 9.816362 12.28010 0.0004578135 0.5066751
## ENSMUSG00000019929 9.785441 10.890102 12.26568 0.0004613642 0.5066751
## ENSMUSG00000022360 -10.732952 9.994101 12.01815 0.0005268487 0.5066751
## ENSMUSG00000024909 9.312234 8.625223 11.83857 0.0005801650 0.5066751
## ENSMUSG00000031328 6.021571 10.642223 11.11854 0.0008546883 0.5066751
edgeR_DEG =nrDEG
nrDEG=edgeR_DEG[,c(1,5)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, edgeR_DEG, file = "edgeR.Rdata")
- limma
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- factor(c('A','A','A','B','B','B'))
table(group_list)
## step1: 處理group_list
suppressMessages(library(limma))
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
design
## A B
## SRR6791080 1 0
## SRR6791081 1 0
## SRR6791082 1 0
## SRR6791083 0 1
## SRR6791084 0 1
## SRR6791085 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(group_list)`
## [1] "contr.treatment"
## step2: 處理表達(dá)矩陣赌厅,過濾掉矩陣表達(dá)量太低或為0的
dge <- DGEList(counts=exprSet)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=3)
logCPM[1:4,1:4]
comp='A-B'
v <- voom(dge,design,plot=TRUE, normalize="quantile")
fit <- lmFit(v, design)
cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
## step3: 處理結(jié)束穷绵,拿數(shù)據(jù)
tempOutput = topTable(fit2, coef=comp, n=Inf)
DEG_limma_voom = na.omit(tempOutput)
head(DEG_limma_voom)
## logFC AveExpr t P.Value adj.P.Val
## ENSMUSG00000053907 4.405659 4.572770 119.25502 2.268083e-08 0.0008054679
## ENSMUSG00000025403 4.911621 4.829758 112.21979 2.905204e-08 0.0008054679
## ENSMUSG00000032714 5.335097 5.027842 70.25404 1.955055e-07 0.0011633706
## ENSMUSG00000033066 1.822959 3.285393 51.92294 6.690066e-07 0.0011633706
## ENSMUSG00000020520 4.507704 4.635493 44.98478 1.198824e-06 0.0011633706
## ENSMUSG00000032387 -6.577639 5.681617 -38.75511 2.197276e-06 0.0011633706
## B
## ENSMUSG00000053907 -23.06391
## ENSMUSG00000025403 -23.44447
## ENSMUSG00000032714 -25.87765
## ENSMUSG00000033066 -27.17210
## ENSMUSG00000020520 -28.02201
## ENSMUSG00000032387 -28.98618
nrDEG=DEG_limma_voom[,c(1,4)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DEG_limma_voom, file = "limma.Rdata")
7、挑選差異分析結(jié)果的統(tǒng)計學(xué)顯著上調(diào)下調(diào)基因集
在R里面察蹲,對統(tǒng)計學(xué)顯著上調(diào)下調(diào)基因集请垛,進(jìn)行GO/KEGG等數(shù)據(jù)庫的超幾何分布檢驗分析催训,原理參考:https://mp.weixin.qq.com/s/M6CRe39xmQ_lSQqeM99kow
基因富集分析(gene set enrichment analysis)是在一組基因或蛋白中找到一類過表達(dá)的基因或蛋白洽议。
# 跳躍學(xué)習(xí)一下如何用Bioconductor對基因組注釋
# 參考鏈接: http://www.reibang.com/p/ae94178918bc
# 乖巧的安裝:BiocManager::install("org.Mm.eg.db")
rm(list = ls())
options(stringsAsFactors = F)
suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))
load("limma.Rdata")
colnames(DEG_limma_voom)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
nrDEG = DEG_limma_voom
## step1: nrDEG添加一列change, 顯示上下調(diào)
logFC_cutoff <- with( nrDEG, mean( abs( logFC ) ) + 2 * sd( abs( logFC ) ) )
logFC_cutoff
## [1] 1.994444
logFC_cutoff = 1.2
nrDEG$change = as.factor(ifelse( nrDEG$P.Value < 0.05 & abs(nrDEG$logFC) > logFC_cutoff,ifelse( nrDEG$logFC > logFC_cutoff , 'UP', 'DOWN' ), 'NOT' ))
table(nrDEG$change)
## DOWN NOT UP
## 741 54063 646
## step2: 提取ENSG的ID和GENE
keytypes(org.Mm.eg.db)
nrDEG$ENSEMBL <- rownames(nrDEG)
df <- bitr(rownames(nrDEG), fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Mm.eg.db )
head(df)
## step3: 將GENE合并到nrDEG數(shù)據(jù)框內(nèi)
nrDEG$SYMBOL = rownames(nrDEG)
nrDEG = merge(nrDEG, df, by='ENSEMBL')
head(nrDEG)
## step4: 提取上調(diào)和下調(diào)的基因集
gene_up = nrDEG[ nrDEG$change == 'UP', 'ENTREZID' ]
gene_down = nrDEG[ nrDEG$change == 'DOWN', 'ENTREZID' ]
gene_diff = c(gene_up, gene_down)
gene_all = as.character(nrDEG[ ,'ENTREZID'] )
geneList = nrDEG$logFC
names(geneList) = nrDEG$ENTREZID
geneList = sort(geneList, decreasing = T )
gene_list = as.list(geneList)
save(gene_list,file = "gene_list.txt")
## KEGG pathway analysis
kk.up <- enrichKEGG(gene = gene_up,
organism = 'mmu',
universe = gene_all,
pvalueCutoff = 0.8,
qvalueCutoff = 0.8)
kk.down <- enrichKEGG(gene = gene_down,
organism = 'mmu',
universe = gene_all,
pvalueCutoff = 0.05)
suppressMessages(library(ggplot2))
kegg_down_dt <- as.data.frame( kk.down )
kegg_up_dt <- as.data.frame( kk.up )
down_kegg <- kegg_down_dt[ kegg_down_dt$pvalue < 0.05, ]
down_kegg$group = -1
up_kegg <- kegg_up_dt[ kegg_up_dt$pvalue < 0.05, ]
up_kegg$group = 1
dat = rbind( up_kegg, down_kegg )
dat$pvalue = -log10( dat$pvalue )
dat$pvalue = dat$pvalue * dat$group
dat = dat[ order( dat$pvalue, decreasing = F ), ]
save(dat,file = 'KEGG_enrich_results.Rdata')
load(file = 'KEGG_enrich_results.Rdata')
ggplot(dat, aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) +
geom_bar( stat = "identity" ) +
scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) +
scale_x_discrete( name = "Pathway names" ) +
scale_y_continuous( name = "log10P-value" ) +
coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
ggtitle( "Pathway Enrichment" )
## GO database analysis
g_list=list(gene_up=gene_up,
gene_down=gene_down,
gene_diff=gene_diff)
go_enrich_results <- lapply(g_list , function(gene) {
lapply( c('BP','MF','CC') , function(ont) {
cat(paste('Now process ',ont ))
ego <- enrichGO(gene= gene,
universe = gene_all,
OrgDb = org.Mm.eg.db,
ont = ont ,
pAdjustMethod = "BH",
pvalueCutoff = 0.99,
qvalueCutoff = 0.99,
readable = TRUE)
print( head(ego) )
return(ego)
})
})
save(go_enrich_results,file = 'go_enrich_results.Rdata')
# OrgDb: 物種注釋數(shù)據(jù)庫
# ont: 是BP(Biological Process), CC(Cellular Component), MF(Molecular Function),一個基因的功能可以從生物學(xué)過程漫拭,所屬細(xì)胞部分亚兄,和分子功能三個角度定義。
load(file = 'go_enrich_results.Rdata')
n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC')
for (i in 1:3){
for (j in 1:3){
fn=paste0('GO_dotplot_',n1[i],'_',n2[j],'.png')
#cat(paste0(fn,'\n'))
png(fn,res=150,width = 1080)
print(dotplot(go_enrich_results[[i]][[j]]))
dev.off()}}
dotplot(go_enrich_results[[1]][[1]],showCategory = 10,font.size = 6)
8采驻、直接對任取6個樣本表達(dá)矩陣做GSVA分析
閱讀了一些關(guān)于GSVA算法和R包的介紹:http://www.reibang.com/p/6337e0dadc17
基因集合變異分析(Gene Set Variation Analysis审胚,GSVA)
參考代碼:https://github.com/jmzeng1314/GEO
gsva函數(shù)接受的是一個純粹的表達(dá)矩陣matrix和一個純粹基因集合list。
rm(list = ls())
options(stringsAsFactors = F)
# BiocManager::install('GSVA')
suppressMessages(library(GSVA))
load("expr.Rdata")
exprSet <- countMatrix
dim(exprSet)
## [1] 55450 6
group_list <- c('A','A','A','B','B','B')
table(group_list)
suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))
## step1: 整理檢查數(shù)據(jù)
g2s = toTable(org.Mm.egSYMBOL)
g2e = toTable(org.Mm.egENSEMBL)
a = as.data.frame(exprSet)
# 基因集下載:http://software.broadinstitute.org/gsea/msigdb/index.jsp
d = 'MsigDB/'
gmts <- list.files(d,pattern = 'all')
# 感覺自己對這部分理解還不夠礼旅,留個問題
自己沒有可研究的數(shù)據(jù)膳叨,從文獻(xiàn)中挑選任意6個樣本研究,過程中可能有理解偏差痘系,如有問題菲嘴,懇請大佬指導(dǎo)~~
更多學(xué)習(xí)資源:
生信技能樹公益視頻合輯
生信技能樹簡書賬號
生信工程師入門最佳指南
生信技能樹全球公益巡講
招學(xué)徒
...
你的宣傳能讓數(shù)以萬計的初學(xué)者找到他們的家,技能樹平臺一定不會辜負(fù)每一個熱愛學(xué)習(xí)和分享的同道中人 ??