這是英文版的第6,7,8章
6 Workflow: scripts
- 這章節(jié)其實(shí)沒啥可說的酿傍,只是想起一篇文章池颈,其中有一些小tips
# 看了你安裝的包 installed.packages() # 看下你加載的包 (.packages()) # 批量加載 load_package <- c("ComplexHeatmap","circlize") if (!all(load_package %in% (.packages()))) lapply(load_package, library, character.only = TRUE)
7.1 Introduction
-
Exploratory Data Analysis(EDA)是一個(gè)不斷迭代的過程
- Generate questions about your data.
- Search for answers by visualising, transforming, and modelling your data.
- Use what you learn to refine your questions and/or generate new questions.
關(guān)于EDA的探索些己,《Bioinformatics Data skills》里面有一章節(jié)也講的也挺好(應(yīng)該是Chapter 8: A Rapid Introduction to the R Language)锄禽,而且是關(guān)于生信的。
7.2 Questions
EDA是一個(gè)不斷探索你數(shù)據(jù)的過程磺平,而其中最簡(jiǎn)單的方法就是不斷地問問題
-
You can loosely word these questions as:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
-
To make the discussion easier, let’s define some terms:
A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
-
Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.(這應(yīng)該是tidyverse生態(tài)的核心數(shù)據(jù)格式魂仍,行為觀測(cè),列為變量)
但我們的生信數(shù)據(jù)一般是反過來的拣挪,行為變量(基因)擦酌,列為觀測(cè)(各個(gè)樣本)
7.3 Variation
- Variation is the tendency of the values of a variable to change from measurement to measurement. Every variable has its own pattern of variation, which can reveal interesting information(其實(shí)這話在一定程度上,也就意味著每個(gè)變量都有其特有的分布)媒吗。
像RNA-Seq的count值的分布就是屬于伽馬-泊松混合分布(即二項(xiàng)分布),而RNA-Seq得到的p-value值的分布就是零假設(shè)為真的基因p-value與有差異的基因的p-value的混合分布
The best way to understand that pattern is to visualise the distribution of the variable's values
你也可以用vcd包的goodfit來看下你的數(shù)據(jù)和某一類分布的擬合程度
norm_data <- rpois(1000,lambda = 5) ofit = goodfit(norm_data, "poisson") plot(ofit, xlab = "")
參考:
常見分布的擬合方法
Modern Statistics for Modern Biology_4.4.3
- 可視化你變量的分布取決于你的變量是連續(xù)型變量還是離散型變量
如果是離散型變量的話乙埃,一般就是用geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
高度可以用
diamonds %>% count(cut) #> # A tibble: 5 x 2 #> cut n #> <ord> <int> #> 1 Fair 1610 #> 2 Good 4906 #> 3 Very Good 12082 #> 4 Premium 13791 #> 5 Ideal 21551
計(jì)算出來闸英,然后
library(tidyverse) diamonds %>% count(cut) %>% ggplot(aes(x = cut, y = n)) + geom_bar(stat = "identity")
如果是連續(xù)型變量的話,用geom_histogram
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
You can compute this by hand by combining dplyr::count() and ggplot2::cut_width():
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # … with 5 more rows
# 其實(shí)就是cut_width就是把連續(xù)型變量切割成區(qū)間
> cut_width(diamonds$carat, 0.5) %>%
+ head()
[1] [-0.25,0.25] [-0.25,0.25] [-0.25,0.25] (0.25,0.75]
[5] (0.25,0.75] [-0.25,0.25]
11 Levels: [-0.25,0.25] (0.25,0.75] ... (4.75,5.25]
cut_interval makes n groups with equal range
cut_number makes n groups with (approximately) equal numbers of observations
cut_width makes groups of width width.
- 在用geom_histogram的時(shí)候介袜,你需要不斷調(diào)整你的binwidth或則bins數(shù)目甫何,才能得到一個(gè)合適的圖像。
binwidth太小遇伞,你的圖像就會(huì)很sharp辙喂,很難看到趨勢(shì)。太大的話鸠珠,你的圖像就會(huì)過于smooth巍耗,也很難看到趨勢(shì)。
# 這里是選擇了一部分?jǐn)?shù)據(jù)集渐排,binwidth也變小了
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
除了用geom_hist之外炬太,你還可以用geom_freqpoly
# 用這個(gè)圖的話,你就會(huì)發(fā)現(xiàn)其實(shí)geom_freqpoly就是hist的頂點(diǎn)連線 ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1) + geom_freqpoly(binwidth = 0.1)
# 這里是用不同cut了
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
- 我們可以從可視化中發(fā)現(xiàn)一些趨勢(shì)驯耻,比如我們從下圖中就可以看到有一些亞群數(shù)據(jù)存在亲族。
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
還有像這個(gè)數(shù)據(jù)
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
在生信中,最常見的一個(gè)亞群現(xiàn)象可能就是如果你的樣本污染的話可缚,你的fastQC的GC含量那里霎迫,就會(huì)出現(xiàn)雙峰。這是因?yàn)橐粋€(gè)物種的GC含量會(huì)呈現(xiàn)于類似正態(tài)分布的分布帘靡,但如果你的樣本里面混入了其他物種知给,比如說菌等,那么你的GC含量就會(huì)有雙峰描姚,甚至幾個(gè)峰炼鞠。
另一個(gè)我能想起來的例子缘滥,就是“邪惡的鳶尾花”
# 你會(huì)發(fā)現(xiàn)Width會(huì)出現(xiàn)雙峰
# 其實(shí)是因?yàn)镻etal.width是好幾種花的數(shù)據(jù)
hist(iris$Petal.Width)
Many of the questions above will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable.
- 文中提到了異常數(shù)據(jù)的發(fā)現(xiàn),這讓我想起breakdown point這個(gè)概念谒主。其代表的是一個(gè)估計(jì)器對(duì)于臟數(shù)據(jù)的最大容忍程度朝扼。比如median對(duì)于異常值并不敏感,breakdown point是1/2霎肯。而mean則對(duì)異常值很敏感擎颖,其值是 1/N,N代表了你的數(shù)據(jù)量观游。我個(gè)人理解就是你要改變整個(gè)數(shù)據(jù)的median搂捧,你需要改變一半數(shù)據(jù),而如果你要改變整個(gè)數(shù)據(jù)的mean懂缕,你只需要改變一個(gè)數(shù)據(jù)允跑。這也就是為啥有時(shí)候我們對(duì)于異常值影響很大的相關(guān)性數(shù)據(jù),會(huì)考慮使用Sperman搪柑,而不是Pearson聋丝。或者我們?cè)谧鼍€性回歸的時(shí)候工碾,為了考慮模型的穩(wěn)健性弱睦,會(huì)考慮用median,而不是mean渊额。
純屬瞎想……
參考:
穩(wěn)健回歸(Robustness regression)
Modern Statistics for Modern Biology_8.7.4 Robustness
- 當(dāng)你有很多數(shù)據(jù)的時(shí)候况木,你很難在柱狀圖中看到異常數(shù)據(jù)。比如下圖
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
這時(shí)候你可以考慮Zoom in你的數(shù)據(jù)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.) 注意這個(gè)區(qū)別
Solution的Exercise 7.3.4會(huì)對(duì)此有些解釋
讓我想起了ggforce的縮放
library(ggforce) #> Loading required package: ggplot2 ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point() + facet_zoom(x = Species == "versicolor")
- It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
- Exercise 7.3.4
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
xlim(100, 5000) +
ylim(0, 3000)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 14714 rows containing non-finite values (stat_bin).
#> Warning: Removed 6 rows containing missing values (geom_bar).
7.4 Missing values
- 如果你遇到了一些異常值旬迹,但不想費(fèi)心地處理火惊,可以考慮兩種對(duì)待異常值的方法
一種是把含有異常值的整行都丟棄
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!
另一種是把異常值取代為NA
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ifelse()
has three arguments. The first argumenttest
should be a logical vector. The result will contain the value of the second argument,yes
, whentest
isTRUE
, and the value of the third argument,no
, when it is false. Alternatively to ifelse, usedplyr::case_when()
.case_when()
is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.
- Exercise 7.4.1
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
這個(gè)solution推薦看一下
Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ggplot(diamonds2, aes(x = y)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 9 rows containing non-finite values (stat_bin).
In the geom_bar()
function, NA
is treated as another category. The x
aesthetic in geom_bar()
requires a discrete (categorical) variable, and missing values act like another category.
diamonds %>%
mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
ggplot() +
geom_bar(mapping = aes(x = cut))
In a histogram, the x
aesthetic variable needs to be numeric, and stat_bin()
groups the observations by ranges into bins. Since the numeric value of the NA
observations is unknown, they cannot be placed in a particular bin, and are dropped.
7.5 Covariation
- If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
注意這跟協(xié)變量(covariance)是不一樣的
下面的
- 一個(gè)分類變量、一個(gè)連續(xù)變量
- 兩個(gè)分類變量
- 兩個(gè)連續(xù)變量
都值得去看一看
7.5.1 A categorical and continuous variable
It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of
geom_freqpoly()
is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:geom_freqpoly()的縱坐標(biāo)是數(shù)目奔垦,如果你的分類變量對(duì)應(yīng)的數(shù)目差距很大的話矗晃,你如果放一個(gè)圖里面,那么數(shù)目少那組的自身變化就很難看出來了宴倍。
For example, let’s explore how the price of a diamond varies with its quality:
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
這里紫色的Fair組你就很難看出趨勢(shì)來
It’s hard to see the difference in distribution because the overall counts differ so much:
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
我們可以看到张症,似乎質(zhì)量差的鉆石反而價(jià)格更高
這跟直接geom_density得到的密度圖還是不一樣的。但我也不知道density是怎么出來的
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_density(mapping = aes(colour = cut))
- Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of(這個(gè)解釋甚好):
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
-
A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
看起來上面這個(gè)圖是個(gè)不對(duì)稱的分布
- 從boxplot同樣發(fā)現(xiàn)了質(zhì)量相對(duì)較低的鉆石價(jià)格更高鸵贬,而高質(zhì)量的鉆石價(jià)格偏低
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
- 有個(gè)小插曲俗他,怎么對(duì)連續(xù)型變量在圖中根據(jù)某一順序進(jìn)行排序
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
If you have long variable names, geom_boxplot()
will work better if you flip it 90°. You can do that with coord_flip()
.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
這個(gè)比較有用的應(yīng)該是你畫GO plot的時(shí)候,可以按某個(gè)順序排下來阔逼,這樣圖會(huì)很清爽兆衅。
-
Exercise 7.5.1.2
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
這里算是前面那個(gè)質(zhì)量越差反而價(jià)格越高的解釋吧
這個(gè)來自solution 核心意思就是因?yàn)橘|(zhì)量差的那些鉆石,其個(gè)頭比較大,所以相對(duì)來說羡亩,價(jià)格比較大
如果不想看的話摩疑,可以跳過。
What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat
, clarity
, color
, and cut
. I ignore the dimensions(就是所謂的diamond數(shù)據(jù)里面的x畏铆、y雷袋、z) of the diamond since carat
measures size, and thus incorporates most of the information contained in these variables.
Since both price
and carat
are continuous variables, I use a scatter plot to visualize their relationship.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
However, since there is a large number of points in the data, I will use a boxplot by binning carat
(as suggested in the chapter).
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
一個(gè)小問題就是如果你是ggplot2 3.3.0的話,是跑不出這樣的結(jié)果的
Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.
The variables color
and clarity
are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.
There is a weak negative relationship between color
and price
. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color
are in the wrong order. Before plotting, I will reverse the order of the color
levels so they will be in increasing order of quality on the x-axis. The color
column is an example of a factor variable, which is covered in the “Factors” chapter of R4DS.
diamonds %>%
mutate(color = fct_rev(color)) %>%
ggplot(aes(x = color, y = price)) +
geom_boxplot()
fct_rev這個(gè)命令很奇怪
# 變不回去了 > f <- factor(c("a", "b", "c")) > fct_rev(f) [1] a b c Levels: c b a > fct_rev(f) [1] a b c Levels: c b a
There is also weak negative relationship between clarity
and price
. The scale of clarity goes from I1 (worst) to IF (best).
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price))
For both clarity
and color
, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.
Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.
ggplot(diamonds, aes(x = cut, y = carat)) +
geom_boxplot()
There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).
This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.
Y叔介紹過的翻云覆雨圖
-
Exercise 7.5.1.4
One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using
geom_lv()
to display the distribution of price vs cut. What do you learn?How do you interpret the plots?
基因組數(shù)據(jù)一般就是大量數(shù)據(jù)辞居,個(gè)人感覺如果對(duì)于一些基因組數(shù)據(jù)使用小提琴圖楷怒、箱式圖、點(diǎn)圖等效果都不好瓦灶,因?yàn)闀?huì)是密密麻麻的一團(tuán)鸠删。一般都得過濾之后才會(huì)展現(xiàn)出一些趨勢(shì)。
ggplot(diamonds, aes(x = cut, y = price)) +
geom_lv()
-
Exercise 7.5.1.6
If you have a small dataset, it’s sometimes useful to use
geom_jitter()
to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar togeom_jitter()
. List them and briefly describe what each one does.
這個(gè)很適合實(shí)驗(yàn)上的表型數(shù)據(jù)展現(xiàn)贼陶。ggbeeswarm
There are two methods:
-
geom_quasirandom()
produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated. -
geom_beeswarm()
produces a plot similar to a violin plot, but by offsetting the points.
I’ll use the mpg
box plot example since these methods display individual points, they are better suited for smaller datasets.
ggplot(data = mpg) +
geom_quasirandom(mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
))
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "tukey"
)
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "tukeyDense"
)
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "frowney"
)
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "smiley"
)
ggplot(data = mpg) +
geom_beeswarm(mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
))
7.5.2 Two categorical variables
- To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination(把所有組合個(gè)數(shù)頭統(tǒng)計(jì)出來). One way to do that is to rely on the built-in
geom_count()
:
# geom_count的確是個(gè)好函數(shù)唉
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
# 應(yīng)該是等價(jià)于
diamonds %>%
group_by(cut, color) %>%
mutate(n = n()) %>%
ggplot(mapping = aes(x = cut, y = color, size = n)) +
geom_point()
上面的圖讓我想起了Motif的圖
這個(gè)圖Y叔有講過一點(diǎn):怎樣用HOMER算出的P-value畫出CNS級(jí)別的泡泡圖刃泡?
- 還有一種展現(xiàn)方法:visualise with
geom_tile()
and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
關(guān)于seriation包,搜到了這個(gè)【ComplexHeatmap包學(xué)習(xí)筆記】2.5 Seriation
-
Exercise 7.5.2.1
How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?
其實(shí)核心就是用百分比碉怔,類似于之前的密度圖烘贴。我覺得還是很值得我們學(xué)習(xí)的
To clearly show the distribution of cut
within color
, calculate a new variable prop
which is the proportion of each cut within a color
. This is done using a grouped mutate.
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
scale_fill_viridis(limits = c(0, 1)) # from the viridis colour palette library
Similarly, to scale by the distribution of color
within cut
,
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
scale_fill_viridis(limits = c(0, 1))
I add limit = c(0, 1)
to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.(數(shù)據(jù)集內(nèi)部比較與數(shù)據(jù)集之間的比較)
# 如果你不限定顏色范圍的話,圖例顏色標(biāo)度會(huì)自適應(yīng)你的顏色范圍 diamonds %>% count(color, cut) %>% group_by(cut) %>% mutate(prop = n / sum(n)) %>% ggplot(mapping = aes(x = color, y = cut)) + geom_tile(mapping = aes(fill = prop)) + scale_fill_viridis()
這樣子看起來非常的清楚眨层,但有一個(gè)問題是這個(gè)顏色標(biāo)度只適應(yīng)于你現(xiàn)在的數(shù)據(jù)集庙楚,如果你還要跟其他的同類型數(shù)據(jù)集進(jìn)行比較的話上荡,就要scale_fill_viridis(limits = c(0, 1))了趴樱。這樣不同數(shù)據(jù)集之間才可以比較
geom_tile有個(gè)問題是不能聚類。但Y叔的ggtree卻可以做到聚類酪捡,這一點(diǎn)比patchwork還要好(patchwork可以拼圖叁征,但其只是對(duì)準(zhǔn)坐標(biāo)軸)
-
Exercise 7.5.2.3
Why is it slightly better to use
aes(x = color, y = cut)
rather thanaes(x = cut, y = color)
in the example above?
It’s usually better to use the categorical variable with a larger number of categories(這個(gè)變量里面有很多種) or the longer labels(標(biāo)簽名字很長(zhǎng)) on the y axis. If at all possible, labels should be horizontal because that is easier to read(y軸上的標(biāo)簽也應(yīng)該是垂直,更易讀).
However, switching the order doesn’t result in overlapping labels.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(y = color, x = cut)) +
geom_tile(mapping = aes(fill = n))
Another justification, for switching the order is that the larger numbers are at the top when x = color
and y = cut
, and that lowers the cognitive burden of interpreting the plot.
7.5.3 Two continuous variables
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above).(尤其是你比如說一個(gè)甲基化數(shù)據(jù)逛薇,一個(gè)RNA-Seq數(shù)據(jù)捺疼,畫個(gè)點(diǎn)圖,如果不經(jīng)過篩選過濾永罚,你的數(shù)據(jù)點(diǎn)會(huì)鋪滿整個(gè)頁面啤呼,毫無規(guī)律可言) You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency(透明度)
面對(duì)這種情況一個(gè)辦法就是用:
geom_bin2d()
andgeom_hex()
to bin in two dimensions. -》use a fill color to display how many points fall into each bin
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
當(dāng)然,我覺得用二維核密度估計(jì)圖也是可以的
m <- ggplot(faithful, aes(x = eruptions, y = waiting)) + geom_point() + xlim(0.5, 6) + ylim(40, 110) m + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon")
- Another option is to bin one continuous variable so it acts like a categorical variable(把連續(xù)型變量分bin呢袱,我們之前遇到過了)
# 下面代碼出不來下圖的結(jié)果
# 可能是版本的問題
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
根據(jù)的geom_boxplot的reference來看官扣,ggplot2 3.3.0版本的確不是下圖的結(jié)果。
但我在服務(wù)器的ggplot2 3.2.1跑出來的就是跟書里面一樣的圖
不過這個(gè)問題應(yīng)該只存在于你的x變量是連續(xù)型變量的時(shí)候
One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE (如果設(shè)置了這個(gè)參數(shù)羞福,那么你boxplot的寬度就會(huì)和你的數(shù)據(jù)數(shù)目成正比)
- Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number()
# 同樣地惕蹄,下面代碼出不來下圖的結(jié)果
# 可能是版本的問題
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
-
Exercise 7.5.3.1
Instead of summarizing the conditional distribution with a box plot, you could use a frequency polygon. What do you need to consider when using
cut_width()
vscut_number()
? How does that impact a visualization of the 2d distribution ofcarat
andprice
?
Both cut_width()
and cut_number()
split a variable into groups. When using cut_width()
, we need to choose the width, and the number of bins will be calculated automatically. When using cut_number()
, we need to specify the number of bins, and the widths will be calculated automatically.
In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.
If categorical colors are used, no more than eight colors(之前ggplot2那章提到過,超過8個(gè)顏色就不會(huì)顯示了) should be used in order to keep them distinct. Using cut_number
, I will split carats into quantiles (five groups).
ggplot(
data = diamonds,
mapping = aes(color = cut_number(carat, 5), x = price)
) +
geom_freqpoly() +
labs(x = "Price", y = "Count", color = "Carat")
Alternatively, I could use cut_width
to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.
# boundary = 0稍微可以注意下,如果你不設(shè)置這個(gè)卖陵,似乎就不會(huì)是0-1,1-2了
ggplot(
data = diamonds,
mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
geom_freqpoly() +
labs(x = "Price", y = "Count", color = "Carat")
-
Exercise 7.5.3.2
Visualize the distribution of
carat
, partitioned byprice
.
Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.
ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
geom_boxplot() +
coord_flip() +
xlab("Price")
Plotted with a box plot with 10 equal-width bins of 2,000. The argument boundary = 0
ensures that first bin is $0–2,000.
ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
geom_boxplot(varwidth = TRUE) +
coord_flip() +
xlab("Price")
-
Exercise 7.5.3.4
Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.
There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.
唯一需要注意的是遭顶,用離散型變量去分組
# scale_fill_viridis需要預(yù)先加載library(viridis)
ggplot(diamonds, aes(x = carat, y = price)) +
geom_hex() +
facet_wrap(~cut, ncol = 1) +
scale_fill_viridis()
ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
geom_boxplot()
ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
geom_boxplot()
-
Exercise 7.5.3.5
Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of
x
andy
values, which makes the points outliers even though theirx
andy
values appear normal when examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
In this case, there is a strong relationship between xx and yy. The outliers in this case are not extreme in either xx or yy. A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of xx was an outlier even though it appears to fit the bivariate pattern well.
個(gè)人感覺,geom_hex之類的是展現(xiàn)宏觀趨勢(shì)的泪蔫,對(duì)于少數(shù)點(diǎn)的效果并不好棒旗,當(dāng)然你也可以把bin變得很多,bin_width變得很小鸥滨,但這樣其實(shí)跟散點(diǎn)圖是一樣的了嗦哆。
7.6 Patterns and models
- Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
- Could this pattern be due to coincidence (i.e. random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
- 想到一個(gè)有意思的圖
- A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
這讓我想起對(duì)于這種出現(xiàn)subgroup的散點(diǎn)圖,如果你做相關(guān)性婿滓,很容易出現(xiàn)相關(guān)性很大的情況
下圖來自elife文章《Ten common statistical mistakes to watch out for when writing or reviewing a manuscript》老速,也有翻譯的中文版《論文寫作或?qū)徃鍟r(shí)的十種常見統(tǒng)計(jì)錯(cuò)誤》
其中A-F都是不相關(guān)的數(shù)據(jù)
所以說,我個(gè)人感覺用散點(diǎn)圖做相關(guān)性的時(shí)候凸主,是不能把幾個(gè)重復(fù)放一塊上去的橘券,因?yàn)橹貜?fù)自身就很容易出現(xiàn)subgroup
所以ggstatsplot的上面和右邊的barplot還是有必要的,他可以標(biāo)識(shí)出你的數(shù)據(jù)是否出現(xiàn)了subgroup
- Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
- 這里的模型給我提供了一個(gè)想法
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price
from carat
and then computes the residuals(剩下的部分應(yīng)該就是carat所不能解釋的) (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
library(modelr)
# 之所以這里要log卿吐,我猜是因?yàn)槭且驗(yàn)閏arat和price的量級(jí)不一樣
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
上面描述的就是在你有三個(gè)變量下(price(這個(gè)可以說是響應(yīng)變量了)旁舰,carat,cut(這兩個(gè)是自變量))的建模過程嗡官。其實(shí)可以類比到你有RNA-Seq箭窜、甲基化、H3K27me3數(shù)據(jù)衍腥。我們就可以首先把甲基化和RNA-Seq的數(shù)據(jù)建模磺樱,然后用殘差來對(duì)H3K27me3和RNA-Seq建模。
7.7 ggplot2 calls
- 這里也沒啥好說的婆咸,就是說了些小技巧
# 可以簡(jiǎn)寫
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
--------------------------------------------------------------
# tidyvese整理好的數(shù)據(jù)直接可以導(dǎo)入到ggplot2函數(shù)里面畫圖
# 值得一提的是竹捉,H神在自己的tidyverse style guide可不是這么說的,hhhh
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
# 他提到準(zhǔn)確的寫法應(yīng)該是
# ggplot2的+號(hào)的規(guī)則類似于 %>%尚骄。同時(shí)要注意块差,如果你ggplot2的數(shù)據(jù)來源于dplyr的管道結(jié)果,則只能有一個(gè)層次的縮進(jìn)倔丈。
# 這也是R studio的默認(rèn)寫法
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
7.8 Learning more
-
推薦了三本書
ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd ed 這書有g(shù)ithub倉(cāng)庫(kù)憨闰,如果你足夠厲害的話,可以自己編譯出來pdf版本(反正我編譯了好多次都失敗了)需五。當(dāng)然鹉动,網(wǎng)上也有pdf版本的。中文版的已經(jīng)在2019年出版了警儒,大家感興趣可以去看看训裆。
-
R Graphics Cookbook 中文版叫《R數(shù)據(jù)可視化手冊(cè)》眶根。此書英文版有了最近剛有了第二版R Graphics Cookbook, 2nd edition。而且作者也有一本R的cookbook边琉,也是不錯(cuò)的属百。cookbook有一個(gè)中文網(wǎng)絡(luò)翻譯版:Cookbook for R 中文版
作者的cookbook for R注意和 R Cookbook區(qū)分,雖然后者也是本很好的書
Graphical Data Analysis with R (Chapman & Hall/CRC The R Series) 這書我沒看過变姨,不知道……
8 Workflow: projects
- 這一章也沒啥好說的族扰,我個(gè)人感覺最重要的就是你設(shè)置下圖這個(gè)東西!6ㄅ贰S婧恰!
我個(gè)人感覺砍鸠,一定要設(shè)置這個(gè)@┣狻!爷辱!
如果你有一些很耗時(shí)間跑出來的結(jié)果(比如Seurat跑出來的一些中間結(jié)果)录豺,不舍得退出的時(shí)候自動(dòng)刪除。你可以考慮用save或者saveRDS饭弓。
- 兩個(gè)快捷鍵
- Press Cmd/Ctrl + Shift + F10 to restart RStudio.
- Press Cmd/Ctrl + Shift + S to rerun the current script.
- The most important difference is how you separate the components of the path. Mac and Linux uses slashes (e.g.
plots/diamonds.pdf
) and Windows uses backslashes (e.g.plots\diamonds.pdf
). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
- 我個(gè)人習(xí)慣每次一個(gè)任務(wù)就建一個(gè)Project双饥,而不是setwd, getwd搞來搞去的…… 然后就是我會(huì)在建完project之后就立刻建下面5個(gè)文件夾
# plot放圖
# rawdata放原始數(shù)據(jù)
# result放文本結(jié)果
# script放腳本和函數(shù)
# temp_code放我暫時(shí)丟失的腳本和函數(shù)
.
├── plot
├── rawdata
├── result
├── script
└── temp_code