這一章來介紹NGS的下游分析
富集分析
簡單的講廉白,富集分析就是一種統(tǒng)計分析的手段垂睬,用來篩選功能相類的一組基因是否富集中差異表達(dá)的基因中鉴竭。當(dāng)人們拿到了差異表達(dá)的基因時干毅,很多時候因為差異表達(dá)的基因數(shù)量很多,面對這么多的基因涂圆,人們不知道如何找到合適的突破口進(jìn)行下游的驗證實驗们镜。于是從一堆差異表達(dá)的基因中找出有意義的基因進(jìn)行RT-PCR驗證以及基因敲除驗證就需要使用到富集分析了
選擇正確的庫(library, 或者說是universe gene id)。正確的庫的大小的選擇決定著分析的結(jié)果是否可靠
1.GO富集
示例代碼:
library(clusterProfiler)
#加載人類的庫
library(org.Hs.eg.db)
data(geneList, package="DOSE")
#篩選genelist
gene <- names(geneList)[abs(geneList) > 2]
#轉(zhuǎn)換geneID
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
#功能富集
ego <- enrichGO(gene = gene,
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
readable = TRUE)
2.KEGG
#kegg富集
kk <- enrichKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 0.05)
熱圖分析
熱圖(heatmap)润歉。熱圖是一種三維數(shù)據(jù)轉(zhuǎn)二維的表達(dá)手段模狭,它通過使用不同的顏色來表達(dá)本應(yīng)該在第三維上顯示的數(shù)據(jù)。熱圖的方便的于踩衩,它可以在一張二維圖像上快速的通過顏色的差異來了解組內(nèi)的一致性及差異組間的差別
熱圖的軟件包有很多嚼鹉,常用的有pheatmap, ggplot2以及ComplexHeatmap
以ComplexHeatmap:
## 首先, 我們先生成一個隨機(jī)數(shù)據(jù)。
set.seed(1)
mat <- matrix(runif(36), nrow=6,
dimnames=list(paste("row", letters[1:6], sep="_"),
paste("col", LETTERS[1:6], sep="_")))
library(ComplexHeatmap)
當(dāng)然ComplexHeatmap還有一些高級用法
生成兩個熱圖驱富,并對行列進(jìn)行注釋:
ha_column <-
HeatmapAnnotation(
df = data.frame(## data.frame, 行數(shù)與mat的列數(shù)一致锚赤。
type1 = sample(c("a", "b"),
ncol(mat),
replace = TRUE)),
col = list(type1 = c("a" = "red", "b" = "blue"))
## 顏色list, 名字與df中的列字一致,
## 元素名與df列中的出現(xiàn)的元素一致褐鸥。
)
ha_row <- rowAnnotation( ## 對行進(jìn)行注釋
df = data.frame(type2 = sample(c("A", "B"),
nrow(mat),
replace = TRUE)),
col = list(type2 = c("A" = "green", "B" = "cyan")),
width = unit(1, "cm") ##指定寬度线脚,必須是unit,參見grid::unit
)
## 生成兩個Heatmap實例叫榕。這里我們使用了相同的數(shù)據(jù)浑侥。
## 在實際應(yīng)用中,它應(yīng)該是行數(shù)一致的不同數(shù)據(jù)翠霍。
ht1 <- Heatmap(mat, name = "ht1", #name將出現(xiàn)的圖例中
column_title = "Heatmap 1", #title將出現(xiàn)在標(biāo)題中
top_annotation = ha_column) #ha_column將出現(xiàn)在熱圖上方(也可以是下方)
ht2 <- Heatmap(mat, name = "ht2",
column_title = "Heatmap 2")
ht_list <- ht1 + ht2 + ha_row
draw(ht_list)
#把圖例調(diào)來左邊
draw(ht_list, heatmap_legend_side = "left",
show_annotation_legend = FALSE)
當(dāng)然利用HeatmapAnnotation()函數(shù)來進(jìn)行加工也可以锭吨,可以在熱圖的基礎(chǔ)上加點圖,條形圖
1.row_anno_points():點圖
2.row_anno_barplot():條形圖
3.row_anno_boxplot():箱線圖
4.row_anno_histogram():直方圖
5.row_anno_density():密度圖
ha_mix_top <- HeatmapAnnotation(
points = anno_points(runif(ncol(mat))), ## points
barplot = anno_barplot(runif(ncol(mat)), axis=TRUE), ## barplot
histogram = anno_histogram(mat,
gp = gpar(fill = 1:6)), ## histogram
density_line = anno_density(mat, type = "line",
gp = gpar(col = 1:6)), ## density
violin = anno_density(mat, type = "violin", gp = gpar(fill = 6:1)), ## violin
box = anno_boxplot(mat), ## boxplot
heatmap = anno_density(mat, type = "heatmap"), ## heatmap
annotation_height = unit(8, "cm") ## annotation height
)
Heatmap(mat, name = "mat",
top_annotation = ha_mix_top)
如果想只對個別的行或者列標(biāo)示名稱寒匙,其它的不標(biāo)注
mat = matrix(rnorm(10000), nr = 1000)
rownames(mat) = sprintf("%.2f", rowMeans(mat))
subset = sample(1000, 20)
labels = rownames(mat)[subset]
Heatmap(mat, show_row_names = FALSE, show_row_dend = FALSE, show_column_dend = FALSE) +
rowAnnotation(link = row_anno_link(at = subset, labels = labels),
width = unit(1, "cm") + max_text_width(labels))
突變分析
常用于TCGA分析數(shù)據(jù)
我們先構(gòu)建對象
mat <- read.table(textConnection(
",s1,s2,s3
g1,snv1;indel,snv1;snv2,indel
g2,,snv2;indel,snv1
g3,snv1,,indel;snv1;snv2"),
row.names = 1, header = TRUE,
sep = ",", stringsAsFactors = FALSE)
mat <- as.matrix(mat)
mat
數(shù)據(jù)結(jié)構(gòu)長這樣
畫熱圖來查看各種突變的占比:
library(ComplexHeatmap)
col <- c(snv1 = "#008000", snv2 = "#008000", indel = "blue")
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
alter_fun = list(
indel = function(x, y, w, h)
grid.rect(x, y, w*0.9, h*0.9,
gp = gpar(fill = col["indel"],
col = NA)) , #must be draw first
snv1 = function(x, y, w, h)
grid.rect(x, y-.15*h, w*0.9, h*0.2,
gp = gpar(fill = col["snv1"],
col = NA)),
snv2 = function(x, y, w, h)
grid.rect(x, y+.15*h, w*0.9, h*0.2,
gp = gpar(fill = col["snv2"],
col = NA))
), col = col)
上面的例子有兩個重要的概念,一個是get_type躏将,一個是alter_fun锄弱。兩者都接受一個函數(shù),其中g(shù)et_type是如何將mat中的每個單元如何分解成突變類型祸憋,alter_fun是如何對每一種突變類型進(jìn)行繪圖会宪。 get_type接受一個參數(shù)x,這個x將會是mat中的每一個值蚯窥。比如這里的mat[g1, s1]就是"snv1;indel"掸鹅。而alter_fun將會是一個list of function塞帐,list中的每個元素名稱都將是可能出現(xiàn)在get_type得到的值中。list中的每一個函數(shù)有四個參數(shù)巍沙,x,y,w,h葵姥,它們分別指heatmap中對應(yīng)mat中的每個單元格的x, y的坐標(biāo)以及單元格的高度和寬度。
接下來我們看下如何畫瀑布圖:
#建立對象
mat <- read.delim(file.path(system.file("extdata", "tcga_lung_adenocarcinoma_provisional_ras_raf_mek_jnk_signalling.txt",
package = "ComplexHeatmap")),
stringsAsFactors=FALSE, row.names=1)
mat[is.na(mat)] <- " "
mat <- mat[, -ncol(mat)]
mat = t(as.matrix(mat))
##假設(shè)有三種突變類型“MUT”句携,“AMP”榔幸,“HOMDEL”
#寫出類型
alter_fun <- list(
background = function(x, y, w, h) {
grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "#CCCCCC", col = NA))
},
HOMDEL = function(x, y, w, h) {
grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "blue", col = NA))
},
AMP = function(x, y, w, h) {
grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "red", col = NA))
},
MUT = function(x, y, w, h) {
grid.rect(x, y, w-unit(0.5, "mm"), h*0.33, gp = gpar(fill = "#008000", col = NA))
}
)
#定義顏色
col = c("MUT" = "#008000", "AMP" = "red", "HOMDEL" = "blue")
#讀取行名信息
sample_order <- scan(system.file("extdata", "sample_order.txt",
package = "ComplexHeatmap"),
what = "character")
#構(gòu)建對象
ht <- oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
alter_fun = alter_fun, col = col,
row_order = NULL, column_order = sample_order,
remove_empty_columns = TRUE,
column_title = "OncoPrint for TCGA Lung Adenocarcinoma, genes in Ras Raf MEK JNK signalling",
heatmap_legend_param =
list(title = "Alternations",
at = c("AMP", "HOMDEL", "MUT"),
labels = c("Amplification",
"Deep deletion",
"Mutation"),
nrow = 1, title_position = "leftcenter"))
#畫圖
draw(ht, heatmap_legend_side = "bottom")
這里有另外一位大佬寫的,挺不錯的矮嫉,歡迎大家參閱
http://www.reibang.com/p/deb17d0bcd58
基因網(wǎng)絡(luò)
在實驗中削咆,我們常常會將RNAseq的結(jié)果和ChIPseq的結(jié)果結(jié)合起來,以利于提高下游的驗證率蠢笋,在這個過程中拨齐,可以使用GeneNetworkBuilder來查找高驗證率的目標(biāo)基因。GeneNetworkBuilder需要兩個實驗輸入昨寞,一個是全基因 組的表達(dá)圖譜奏黑,一個是目標(biāo)蛋白的目標(biāo)基因。GeneNetworkbuilder還需要該物種的調(diào)控網(wǎng)絡(luò)數(shù)據(jù)编矾,該數(shù)據(jù)是為了聯(lián)接實驗數(shù)據(jù)迷失的目標(biāo)基因熟史。
library(GeneNetworkBuilder)
library(simpIntLists)
i <- findInteractionList("human", "Official")
i <- lapply(i, function(.ele) cbind(from=as.character(.ele$name), to=as.character(.ele$interactors)))
i <- do.call(rbind, i)
set.seed(123)
## generate a random ChIP-seq binding table
rootgene <- sample(i[, 1], 1)
TFbindingTable <- i[i[, 1] == rootgene, ]
interactionmap <- i
# build network
sifNetwork<-buildNetwork(TFbindingTable=TFbindingTable,
interactionmap=interactionmap, level=2)
ID=unique(as.character(sifNetwork))
## create a random expression data
expressionData <- data.frame(ID=ID,
logFC=sample(-3:3, length(ID), replace=TRUE),
P.Value=runif(n=length(ID), max=0.25))
## filter network
cifNetwork<-filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork,
exprsData=expressionData, mergeBy="ID",
miRNAlist=character(0),
tolerance=1, cutoffPVal=0.01, cutoffLFC=1)
## polish network
gR<-polishNetwork(cifNetwork)
## browse network
# browseNetwork(gR) # interactive htmlwidget;
## or plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
stop("layouttype must be dot, neato, twopi, circo or fdp")
}
g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
renderGraph(g1)
}
plotNetwork(gR)
查看Chip-seq信號強(qiáng)度
當(dāng)拿到ChIPseq的結(jié)果后,我們可以使用眾多手段來查看reads的真實情況窄俏,比如說使用IGV, UCSC genome browser等蹂匹。但是,當(dāng)大家需要把這個track生成圖片發(fā)表時凹蜈,這些工具提供的圖片輸出有時候無法達(dá)到發(fā)表的要求限寞,于是很多軟件包就因此而生。最常使用到的軟件包是Gviz仰坦。然而履植,我想大家也看出來了,本文夾私的現(xiàn)象非常嚴(yán)重悄晃,所以玫霎,我重點介紹一下trackViewer軟件包。
trackViewer軟件包不但可以輸出高品質(zhì)的track圖妈橄,還可以生成高顏值的lollipop plot庶近。甚至,還在推廣一種名為Dandelion plot的表示SNP以及Indel數(shù)據(jù)的可視化圖形眷蚓。
library(trackViewer)
gr <- parse2GRanges("chr3:108,476,000-108,485,000")
gr # interesting genomic locations
library(GenomicFeatures)# load GenomicFeatures can create TxDb from UCSC
if(interactive()){
mm8KG <- makeTxDbFromUCSC(genome="mm8", tablename="knownGene")
saveDb(mm8KG, "mm8KG.sqlite")
}else{## mm8KG was saved as sqlite file
mm8KG <- loadDb("data/mm8KG.sqlite")
}
library(org.Mm.eg.db) # load annotation database
## create the gene model tracks information
trs <- geneModelFromTxdb(mm8KG, org.Mm.eg.db, gr=gr)
## import data from bedGraph/bigWig/BED ... files, see ?importScore for details
CLIP <- importScore("data/CLIP.bedGraph", format="bedGraph", ranges=gr)
#bedGraph文件格式
control <- importScore("data/control.bedGraph", format="bedGraph", ranges=gr)
knockdown <- importScore("data/knockdown.bedGraph", format="bedGraph", ranges=gr)
## create styles by preset theme
optSty <- optimizeStyle(trackList(trs, knockdown, control, CLIP), theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
## adjust the styles for this track
### rename the trackList for each track
names(trackList)[1:2] <- paste0("Sort1: ", names(trackList)[1:2])
names(trackList)[3] <- "RNA-seq TDP-43 KD"
names(trackList)[4] <- "RNA-seq control"
### change the lab positions for gene model track to bottomleft
setTrackStyleParam(trackList[[1]], "ylabpos", "bottomleft")
setTrackStyleParam(trackList[[2]], "ylabpos", "bottomleft")
### change the color of gene model track
setTrackStyleParam(trackList[[1]], "ylabgp", list(cex=1, col="red"))
setTrackStyleParam(trackList[[2]], "ylabgp", list(cex=1, col="green"))
### remove the xaxis
setTrackViewerStyleParam(viewerStyle, "xaxis", FALSE)
### add a scale bar in CLIP track
setTrackXscaleParam(trackList[[5]], "draw", TRUE)
## plot the tracks
vp <- viewTracks(trackList, gr=gr, viewerStyle=viewerStyle)
### add guide lines to show the range of CLIP-seq signal
addGuideLine(c(108481252, 108481887), vp=vp)
### add arrow mark to show the alternative splicing event
addArrowMark(list(x=c(108483570, 108483570),
y=c(3, 4)),
label=c("Inclusive\nexon", ""),
col=c("blue", "cyan"),
vp=vp, quadrant=1)
trackViewer官方文檔:
http://www.bioconductor.org/packages/release/bioc/vignettes/trackViewer/inst/doc/trackViewer.html
motif分析
motif一般利用MEME Suite或者Homer來進(jìn)行motif搜索鼻种,然后利用motifStack包進(jìn)行可視化
motifStack官方文檔:
http://www.bioconductor.org/packages/release/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html
library(motifStack)
pcm <- read.table(file.path(find.package("motifStack"),
"extdata", "bin_SOLEXA.pcm"))
pcm <- pcm[,3:ncol(pcm)]
rownames(pcm) <- c("A","C","G","T")
motif <- new("pcm", mat=as.matrix(pcm), name="bin_SOLEXA")
plot(motif)
## plot a RNA sequence logo
rna <- pcm
rownames(rna)[4] <- "U"
motif <- new("pcm", mat=as.matrix(rna), name="RNA_motif")
plot(motif)
其中,這個包:https://www.plob.org/article/15953.html也有提到用法
或者我之前推送的:http://www.reibang.com/p/e7983c461635也有部分講解
其中pcm結(jié)尾的文件是個打分矩陣沙热,可在JASPAR數(shù)據(jù)庫 (http://jaspar.genereg.net/) 下載