前言
上一節(jié)软能,我們介紹了如何繪制韋恩圖來顯示集合間的交疊關(guān)系
但是,隨著集合的增多跋核,韋恩圖顯示的關(guān)系會越來越復(fù)雜蹋订,很難一眼看出其中的信息。
今天玫锋,我們要介紹的是,當(dāng)集合數(shù)目較多時,該如何繪制
我們將使用 UpSetR
包來繪制下面這種圖
該圖由三個子圖組成:
- 表示交集大小的柱狀圖(上方)
- 表示集合大小的條形圖(下左)
- 表示集合之間的交疊矩陣(下右),矩陣的列表示每種交集組合叫搁,對應(yīng)于柱狀圖的橫坐標;矩陣的行表示集合惨奕,對應(yīng)于條形圖的縱坐標
通過這樣一張圖,可以展示多個集合之間的交疊關(guān)系卧波,且很容易從圖中看出集合之間的交集信息
那怎么繪制出這樣一張圖呢港令?
基礎(chǔ)
1. 安裝導(dǎo)入
install.packages("UpSetR")
library(UpSetR)
我們使用該包自帶的示例數(shù)據(jù)
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
2. 數(shù)據(jù)
在開始繪制之前咪惠,我們需要知道輸入數(shù)據(jù)的格式。
UpSetR
提供了兩個轉(zhuǎn)換函數(shù) fromList
和 fromExpression
用于格式化數(shù)據(jù)
-
fromList
函數(shù)接受一個list
(每個變量表示一個集合)炭臭,并將其轉(zhuǎn)換為數(shù)據(jù)框袍辞,例如
listInput <- list(
one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
-
fromExpression
函數(shù)接受一個命名向量表達式鞋仍,包含了每個集合的大小,以及交集的大小搅吁,交集的名稱通過&
符號相連威创,例如
expressionInput <- c(
one = 2, two = 1, three = 2,
`one&two` = 1, `one&three` = 4,
`two&three` = 1, `one&two&three` = 2)
根據(jù)上面的數(shù)據(jù),可以繪制如下圖形
upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")
3. 繪制部分集合
在這里谎懦,我們通過設(shè)置 nsets = 6
將集合范圍限制在最大的 6
個集合
upset(movies, nsets = 6,
number.angles = 30,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Genre Intersections",
sets.x.label = "Movies Per Genre",
text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))
同時,可以指定參數(shù),來調(diào)整圖形屬性瓷翻,例如湘今,使用 number.angles
來設(shè)置柱狀圖柱子上方數(shù)字的傾斜角度旗们;使用 point.size
和 line.size
來設(shè)置矩陣點圖中點和線的大胁芾芥炭;mainbar.y.label
和 sets.x.label
可以設(shè)置柱狀圖和條形圖的軸標簽;text.scale
包含 6
個值,用于指定圖上所有文本標簽的大小嫌佑。
text.scale
參數(shù)值的順序為:
- 柱狀圖的軸標簽和刻度
- 條形圖的軸標簽和刻度
- 集合名稱
- 柱子上方表示交集大小的數(shù)值
我們也可以指定需要展示的集合
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45)
)
mb.ratio
用于控制上下圖形所占比例
4. 排序
我們可以設(shè)置 order.by
參數(shù),來對交集進行排序。
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "freq",
decreasing = TRUE
)
freq
默認是升序肯骇,可以使用 decreasing = TRUE
讓其降序排列
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "degree",
decreasing = FALSE
)
degree
默認為降序排序姜钳,設(shè)置 decreasing = FALSE
使其升序排列
也可以同時指定這兩個值
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE)
)
如果想要讓集合按照 sets
參數(shù)中指定的出現(xiàn)的順序排列送滞,可以設(shè)置 keep.order = TRUE
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE),
keep.order = TRUE
)
如果想要顯示交集為空的組合蕊梧,可以設(shè)置 empty.intersections
參數(shù)
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
empty.intersections = "on"
)
查詢
查詢通過 queries
參數(shù)來執(zhí)行十艾,接受一個嵌套的 list
來表示多個查詢條件康吵,每個查詢條件包含四個字段:
-
query
:需要執(zhí)行的查詢 -
params
:查詢參數(shù)列表 -
color
:設(shè)置滿足查詢條件的元素在圖中的顏色 -
active
:如果為TRUE
描滔,柱狀圖顏色將會被覆蓋,為FALSE
則會在柱子上添加帶有隨機擾動的點
例如
1. 內(nèi)置交集查詢
我們使用內(nèi)置的交集查詢:intersects
,用來尋找或顯示特定的交集只泼,并將找到的交集進行上色
upset(movies, queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T),
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T)
)
)
2. 內(nèi)置元素查詢
我們使用 elements
來進行元素查詢扳躬,來展示元素在交集中的分布情況
upset(movies,
queries = list(
list(
query = elements,
params = list("AvgRating", 3.5, 4.1),
color = "blue",
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)
)
)
3. 使用表達式
我們可以為 expression
參數(shù)設(shè)置過濾表達式來提取查詢結(jié)果的子集啰挪。
upset(movies,
queries = list(
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)),
expression = "AvgRating > 3 & Watches > 100"
)
4. 自定義查詢
查詢函數(shù)會應(yīng)用于數(shù)據(jù)的每一行中亡呵,我們可以定義如下查詢函數(shù)
Myfunc <- function(row, release, rating) {
data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}
篩選發(fā)行日期在 release
內(nèi)下硕,且平均評分大于某個值的電影
執(zhí)行查詢
upset(movies,
queries = list(
list(
query = Myfunc,
params = list(c(1970, 1980, 1990, 1999, 2000), 2.5),
color = "blue",
active = T)
)
)
5. 添加查詢圖例
可以使用 query.legend
參數(shù)來指定查詢圖例的位置铸题,top
或 bottom
在查詢條件中涡驮,使用 query.name
來設(shè)置查詢的名稱七蜘,如果為設(shè)置碧库,會自動生成
upset(movies,
query.legend = "top",
queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T,
query.name = "Funny action"),
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T,
query.name = "Emotional action")
)
)
屬性圖
attribute.plots
參數(shù)用于執(zhí)行屬性圖的繪制柜与,包含 3
個字段:
-
gridrows
:設(shè)置屬性圖的空間大小,UpSet plot
默認為100 X 100
嵌灰,如果設(shè)置為50
弄匕,則整個圖形變成150 X 100
-
plots
:圖形列表,每個元素包含4
個參數(shù):-
plot
:返回ggplot
對象的函數(shù) -
x
:圖形的x
軸變量 -
y
:圖形的y
軸變量 -
queries
:是否使用已經(jīng)存在的查詢來覆蓋繪圖數(shù)據(jù)
-
-
ncols
:設(shè)置列數(shù)
1. 內(nèi)置繪圖函數(shù)
我們使用包中自帶的 histogram
函數(shù)來繪制直方圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
active = T)
),
attribute.plots = list(
gridrows = 50,
plots = list(
list(
plot = histogram,
x = "ReleaseDate",
queries = F),
list(
plot = histogram,
x = "AvgRating",
queries = T)
),
ncols = 2
)
)
使用 scatter_plot
函數(shù)繪制散點圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = scatter_plot,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = scatter_plot,
x = "AvgRating",
y = "Watches",
queries = F)
),
ncols = 2),
query.legend = "bottom"
)
2. 自定義繪圖函數(shù)
我們先定義兩個基于 ggplot2
的函數(shù)伞鲫,用于繪制散點圖和密度圖
my_scatter <- function(data, x, y) {
p <- ggplot(data, aes_string(x, y, colour = "color")) +
geom_point() +
scale_colour_identity() +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm")
)
p
}
my_density <- function(data, x, y) {
data$decades <- data[, y] %/% 10 * 10
data <- data[which(data$decades >= 1970), ]
p <- ggplot(data, aes_string(x)) +
geom_density(aes(fill = factor(decades)), alpha = 0.3) +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm"),
legend.key.size = unit(0.4, "cm")
)
p
}
然后應(yīng)用在屬性圖中
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(
plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)
),
ncols = 2)
)
3. 繪制箱線圖
想要繪制箱線圖粘茄,可以使用 boxplot.summary
參數(shù),最多只能同時繪制兩個變量的箱線圖秕脓。
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
當(dāng)然柒瓣,用自定義的方式也能實現(xiàn)
集合元數(shù)據(jù)
set.metadata
參數(shù)可以用來設(shè)置集合的元數(shù)據(jù),包含 3
個字段:
-
data
:數(shù)據(jù)框吠架,第一列為集合名芙贫,后面的列為對應(yīng)的集合屬性 -
ncols
:列數(shù) -
plots
:也是一個list
,每個元素包含4
個字段column
,type
,assign
和colors
column
:data
中用于繪制的列名type
:需要繪制的圖像類型傍药,如果指定的列為數(shù)值型磺平,則可以是hist
和heat
魂仍;如果是布爾型,則可以繪制bool
熱圖拣挪;如果是分類類型(字符串)擦酌,則可以是heat
和text
;如果想在矩陣中繪制菠劝,可以使用matrix_rows
赊舶。assign
:該元數(shù)據(jù)圖分配的列數(shù),如果繪制2
列數(shù)據(jù)赶诊,并分別分配了20
和10
笼平,則UpSet
圖變?yōu)?100 X 130
colors
:元數(shù)據(jù)圖顏色,如果是條形圖舔痪,則會應(yīng)用于整個元數(shù)據(jù)圖寓调;如果是heat
或bool
,則可以設(shè)置一個顏色向量锄码;如果是factor
則沒有colors
參數(shù)夺英,并且圖像為漸變色;如果是text
則可以為每個唯一的字符串設(shè)置一個顏色巍耗,不設(shè)置會自動分配顏色
1. 條形圖
我們?yōu)槊總€集合添加元數(shù)據(jù)屬性秋麸,為每部電影隨機設(shè)置爛番茄的電影評分
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
要繪制條形圖,需要保證對應(yīng)列的數(shù)據(jù)類型必須是數(shù)值型
> str(metadata)
'data.frame': 17 obs. of 2 variables:
$ sets : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
$ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...
我們看到炬太,評分列為 factor
,所以需要先進行轉(zhuǎn)換
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
現(xiàn)在可以繪制元數(shù)據(jù)圖了
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20)
)
)
)
2. 熱圖
我們再構(gòu)造電影的元數(shù)據(jù)驯耻,為電影添加城市屬性亲族,同時確保該列為字符串類型而不是 factor
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
我們繪制兩幅熱圖,一幅指定了顏色可缚,另一幅不指定顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
),
list(
type = "heat",
column = "avgRottenTomatoesScore",
assign = 10)
)
)
)
可以看到霎迫,不指定顏色的熱圖為灰色漸變色
布爾型熱圖
我們?yōu)殡娪疤砑右涣?accepted
信息,值為 0
帘靡、1
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
設(shè)置方式與上面類似
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
如果將 bool
換成 heat
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
會將 0
知给、1
布爾型數(shù)據(jù)視為數(shù)值型,并繪制漸變色
3. 文本
對于城市信息元數(shù)據(jù)描姚,可能顯示文本比熱圖更合適一些
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "text",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
)
)
)
)
4. 在矩陣中應(yīng)用元數(shù)據(jù)
有時候涩赢,我們可能想將元數(shù)據(jù)信息直接體現(xiàn)在 UpSet
圖中,可以設(shè)置 type = "matrix_rows"
轩勘,在矩陣中為不同城市設(shè)置不同的顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
)
)
匯總
最后筒扒,我們將這些圖合并在一起
upset(movies,
# 查詢
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)),
# 元數(shù)據(jù)圖
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")),
list(
type = "text",
column = "Cities",
assign = 5,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
),
# 屬性圖
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)),
ncols = 2),
query.legend = "bottom"
)
代碼:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R
參數(shù)詳情