R_for_Data_Science_Script&EDA&projects

這是英文版的第6,7,8章

6 Workflow: scripts

這章節(jié)其實(shí)沒啥可說的酿傍，只是想起一篇文章池颈，其中有一些小tips

# 看了你安裝的包
installed.packages()

# 看下你加載的包
(.packages())

# 批量加載
load_package <- c("ComplexHeatmap","circlize")
if (!all(load_package %in% (.packages()))) lapply(load_package, library, character.only = TRUE)

來自An efficient way to install and load R packages
原文還有很多東西

7.1 Introduction

Exploratory Data Analysis（EDA）是一個(gè)不斷迭代的過程
1. Generate questions about your data.

Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

關(guān)于EDA的探索些己，《Bioinformatics Data skills》里面有一章節(jié)也講的也挺好（應(yīng)該是Chapter 8: A Rapid Introduction to the R Language）锄禽，而且是關(guān)于生信的。

7.2 Questions

EDA是一個(gè)不斷探索你數(shù)據(jù)的過程磺平，而其中最簡(jiǎn)單的方法就是不斷地問問題
You can loosely word these questions as:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
To make the discussion easier, let’s define some terms:
- A variable is a quantity, quality, or property that you can measure.
- A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
- An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.
- Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.（這應(yīng)該是tidyverse生態(tài)的核心數(shù)據(jù)格式魂仍，行為觀測(cè)，列為變量）
  
  但我們的生信數(shù)據(jù)一般是反過來的拣挪，行為變量（基因）擦酌，列為觀測(cè)（各個(gè)樣本）

7.3 Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. Every variable has its own pattern of variation, which can reveal interesting information（其實(shí)這話在一定程度上，也就意味著每個(gè)變量都有其特有的分布）媒吗。

像RNA-Seq的count值的分布就是屬于伽馬-泊松混合分布（即二項(xiàng)分布），而RNA-Seq得到的p-value值的分布就是零假設(shè)為真的基因p-value與有差異的基因的p-value的混合分布

The best way to understand that pattern is to visualise the distribution of the variable's values

你也可以用vcd包的goodfit來看下你的數(shù)據(jù)和某一類分布的擬合程度
norm_data <- rpois(1000,lambda = 5)
ofit = goodfit(norm_data, "poisson")
plot(ofit, xlab = "")
參考：
常見分布的擬合方法
Modern Statistics for Modern Biology_4.4.3

可視化你變量的分布取決于你的變量是連續(xù)型變量還是離散型變量

如果是離散型變量的話乙埃，一般就是用geom_bar

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

img

高度可以用

diamonds %>% 
  count(cut)
#> # A tibble: 5 x 2
#>   cut           n
#>   <ord>     <int>
#> 1 Fair       1610
#> 2 Good       4906
#> 3 Very Good 12082
#> 4 Premium   13791
#> 5 Ideal     21551

計(jì)算出來闸英，然后

library(tidyverse)
diamonds %>% 
  count(cut) %>% 
  ggplot(aes(x = cut, y = n)) +
  geom_bar(stat = "identity")

如果是連續(xù)型變量的話，用geom_histogram

ggplot(data = diamonds) +
   geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

image

You can compute this by hand by combining dplyr::count() and ggplot2::cut_width():

diamonds %>% 
  count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#>   `cut_width(carat, 0.5)`     n
#>   <fct>                   <int>
#> 1 [-0.25,0.25]              785
#> 2 (0.25,0.75]             29498
#> 3 (0.75,1.25]             15977
#> 4 (1.25,1.75]              5313
#> 5 (1.75,2.25]              2002
#> 6 (2.25,2.75]               322
#> # … with 5 more rows

# 其實(shí)就是cut_width就是把連續(xù)型變量切割成區(qū)間
> cut_width(diamonds$carat, 0.5) %>% 
+   head()
[1] [-0.25,0.25] [-0.25,0.25] [-0.25,0.25] (0.25,0.75] 
[5] (0.25,0.75]  [-0.25,0.25]
11 Levels: [-0.25,0.25] (0.25,0.75] ... (4.75,5.25]

cut_interval makes n groups with equal range

cut_number makes n groups with (approximately) equal numbers of observations

cut_width makes groups of width width.

在用geom_histogram的時(shí)候介袜，你需要不斷調(diào)整你的binwidth或則bins數(shù)目甫何，才能得到一個(gè)合適的圖像。

binwidth太小遇伞，你的圖像就會(huì)很sharp辙喂，很難看到趨勢(shì)。太大的話鸠珠，你的圖像就會(huì)過于smooth巍耗，也很難看到趨勢(shì)。

# 這里是選擇了一部分?jǐn)?shù)據(jù)集渐排，binwidth也變小了
smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

image

除了用geom_hist之外炬太，你還可以用geom_freqpoly

# 用這個(gè)圖的話，你就會(huì)發(fā)現(xiàn)其實(shí)geom_freqpoly就是hist的頂點(diǎn)連線
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)  +
  geom_freqpoly(binwidth = 0.1)

# 這里是用不同cut了
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

image

我們可以從可視化中發(fā)現(xiàn)一些趨勢(shì)驯耻，比如我們從下圖中就可以看到有一些亞群數(shù)據(jù)存在亲族。

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

image

還有像這個(gè)數(shù)據(jù)

ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)

image

在生信中，最常見的一個(gè)亞群現(xiàn)象可能就是如果你的樣本污染的話可缚，你的fastQC的GC含量那里霎迫，就會(huì)出現(xiàn)雙峰。這是因?yàn)橐粋€(gè)物種的GC含量會(huì)呈現(xiàn)于類似正態(tài)分布的分布帘靡，但如果你的樣本里面混入了其他物種知给，比如說菌等，那么你的GC含量就會(huì)有雙峰描姚，甚至幾個(gè)峰炼鞠。

另一個(gè)我能想起來的例子缘滥，就是“邪惡的鳶尾花”

# 你會(huì)發(fā)現(xiàn)Width會(huì)出現(xiàn)雙峰
# 其實(shí)是因?yàn)镻etal.width是好幾種花的數(shù)據(jù)
hist(iris$Petal.Width)

image

Many of the questions above will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable.

文中提到了異常數(shù)據(jù)的發(fā)現(xiàn)，這讓我想起breakdown point這個(gè)概念谒主。其代表的是一個(gè)估計(jì)器對(duì)于臟數(shù)據(jù)的最大容忍程度朝扼。比如median對(duì)于異常值并不敏感，breakdown point是1/2霎肯。而mean則對(duì)異常值很敏感擎颖，其值是 1/N，N代表了你的數(shù)據(jù)量观游。我個(gè)人理解就是你要改變整個(gè)數(shù)據(jù)的median搂捧，你需要改變一半數(shù)據(jù)，而如果你要改變整個(gè)數(shù)據(jù)的mean懂缕，你只需要改變一個(gè)數(shù)據(jù)允跑。這也就是為啥有時(shí)候我們對(duì)于異常值影響很大的相關(guān)性數(shù)據(jù)，會(huì)考慮使用Sperman搪柑，而不是Pearson聋丝。或者我們?cè)谧鼍€性回歸的時(shí)候工碾，為了考慮模型的穩(wěn)健性弱睦，會(huì)考慮用median，而不是mean渊额。

純屬瞎想……

參考：

穩(wěn)健回歸（Robustness regression）

Modern Statistics for Modern Biology_8.7.4 Robustness

當(dāng)你有很多數(shù)據(jù)的時(shí)候况木，你很難在柱狀圖中看到異常數(shù)據(jù)。比如下圖

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

image

這時(shí)候你可以考慮Zoom in你的數(shù)據(jù)

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

image

(coord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.) 注意這個(gè)區(qū)別

Solution的Exercise 7.3.4會(huì)對(duì)此有些解釋

讓我想起了ggforce的縮放

library(ggforce)
#> Loading required package: ggplot2
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
  geom_point() +
  facet_zoom(x = Species == "versicolor")

image

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

Exercise 7.3.4

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

The coord_cartesian() function zooms in on the area specified by the limits, after having calculated and drawn the geoms. Since the histogram bins have already been calculated, it is unaffected.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

image

However, the xlim() and ylim() functions influence actions before the calculation of the stats related to the histogram. Thus, any values outside the x- and y-limits are dropped before calculating bin widths and counts. This can influence how the histogram looks.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 14714 rows containing non-finite values (stat_bin).
#> Warning: Removed 6 rows containing missing values (geom_bar).

image

7.4 Missing values

如果你遇到了一些異常值旬迹，但不想費(fèi)心地處理火惊，可以考慮兩種對(duì)待異常值的方法

一種是把含有異常值的整行都丟棄

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))

I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

另一種是把異常值取代為NA

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false. Alternatively to ifelse, use dplyr::case_when(). case_when() is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables.

Exercise 7.4.1

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

這個(gè)solution推薦看一下

Missing values are removed when the number of observations in each bin are calculated. See the warning message: Removed 9 rows containing non-finite values (stat_bin)

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 9 rows containing non-finite values (stat_bin).

img

In the geom_bar() function, NA is treated as another category. The x aesthetic in geom_bar() requires a discrete (categorical) variable, and missing values act like another category.

diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))

img

In a histogram, the x aesthetic variable needs to be numeric, and stat_bin() groups the observations by ranges into bins. Since the numeric value of the NA observations is unknown, they cannot be placed in a particular bin, and are dropped.

7.5 Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.

注意這跟協(xié)變量（covariance）是不一樣的

下面的

一個(gè)分類變量、一個(gè)連續(xù)變量

兩個(gè)分類變量

兩個(gè)連續(xù)變量

都值得去看一看

7.5.1 A categorical and continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:

geom_freqpoly()的縱坐標(biāo)是數(shù)目奔垦，如果你的分類變量對(duì)應(yīng)的數(shù)目差距很大的話矗晃，你如果放一個(gè)圖里面，那么數(shù)目少那組的自身變化就很難看出來了宴倍。

For example, let’s explore how the price of a diamond varies with its quality:

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

img

這里紫色的Fair組你就很難看出趨勢(shì)來

It’s hard to see the difference in distribution because the overall counts differ so much:

ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))

img

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

img

我們可以看到张症，似乎質(zhì)量差的鉆石反而價(jià)格更高

這跟直接geom_density得到的密度圖還是不一樣的。但我也不知道density是怎么出來的

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_density(mapping = aes(colour = cut))

image

Smoothed density estimates

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of（這個(gè)解釋甚好）:
- A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
- Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
- A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
  
  image

看起來上面這個(gè)圖是個(gè)不對(duì)稱的分布

從boxplot同樣發(fā)現(xiàn)了質(zhì)量相對(duì)較低的鉆石價(jià)格更高鸵贬，而高質(zhì)量的鉆石價(jià)格偏低

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

image

有個(gè)小插曲俗他，怎么對(duì)連續(xù)型變量在圖中根據(jù)某一順序進(jìn)行排序

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

img

If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip().

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

img

這個(gè)比較有用的應(yīng)該是你畫GO plot的時(shí)候，可以按某個(gè)順序排下來阔逼，這樣圖會(huì)很清爽兆衅。

Exercise 7.5.1.2

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

這里算是前面那個(gè)質(zhì)量越差反而價(jià)格越高的解釋吧

這個(gè)來自solution 核心意思就是因?yàn)橘|(zhì)量差的那些鉆石，其個(gè)頭比較大，所以相對(duì)來說羡亩，價(jià)格比較大

如果不想看的話摩疑，可以跳過。

What are the general relationships of each variable with the price of the diamonds? I will consider the variables: carat, clarity, color, and cut. I ignore the dimensions（就是所謂的diamond數(shù)據(jù)里面的x畏铆、y雷袋、z） of the diamond since carat measures size, and thus incorporates most of the information contained in these variables.

Since both price and carat are continuous variables, I use a scatter plot to visualize their relationship.

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

img

However, since there is a large number of points in the data, I will use a boxplot by binning carat (as suggested in the chapter).

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

一個(gè)小問題就是如果你是ggplot2 3.3.0的話，是跑不出這樣的結(jié)果的

img

Note that the choice of the binning width is important, as if it were too large it would obscure any relationship, and if it were too small, the values in the bins could be too variable to reveal underlying trends.

The variables color and clarity are ordered categorical variables. The chapter suggests visualizing a categorical and continuous variable using frequency polygons or boxplots. In this case, I will use a box plot since it will better show a relationship between the variables.

There is a weak negative relationship between color and price. The scale of diamond color goes from D (best) to J (worst). Currently, the levels of diamonds$color are in the wrong order. Before plotting, I will reverse the order of the color levels so they will be in increasing order of quality on the x-axis. The color column is an example of a factor variable, which is covered in the “Factors” chapter of R4DS.

diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

img

fct_rev這個(gè)命令很奇怪

# 變不回去了
> f <- factor(c("a", "b", "c"))
> fct_rev(f)
[1] a b c
Levels: c b a
> fct_rev(f)
[1] a b c
Levels: c b a

There is also weak negative relationship between clarity and price. The scale of clarity goes from I1 (worst) to IF (best).

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

img

For both clarity and color, there is a much larger amount of variation within each category than between categories. Carat is clearly the single best predictor of diamond prices.

Now that we have established that carat appears to be the best predictor of price, what is the relationship between it and cut? Since this is an example of a continuous (carat) and categorical (cut) variable, it can be visualized with a box plot.

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

img

There is a lot of variability in the distribution of carat sizes within each cut category. There is a slight negative relationship between carat and cut. Noticeably, the largest carat diamonds have a cut of “Fair” (the lowest).

This negative relationship can be due to the way in which diamonds are selected for sale. A larger diamond can be profitably sold with a lower quality cut, while a smaller diamond requires a better cut.

Y叔介紹過的翻云覆雨圖

Exercise 7.5.1.4

One problem with box plots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn?

How do you interpret the plots?

基因組數(shù)據(jù)一般就是大量數(shù)據(jù)辞居，個(gè)人感覺如果對(duì)于一些基因組數(shù)據(jù)使用小提琴圖楷怒、箱式圖、點(diǎn)圖等效果都不好瓦灶，因?yàn)闀?huì)是密密麻麻的一團(tuán)鸠删。一般都得過濾之后才會(huì)展現(xiàn)出一些趨勢(shì)。

ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()

img

Exercise 7.5.1.6

If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

這個(gè)很適合實(shí)驗(yàn)上的表型數(shù)據(jù)展現(xiàn)贼陶。ggbeeswarm

There are two methods:

geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.

ggplot(data = mpg) +
  geom_quasirandom(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukey"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "tukeyDense"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "frowney"
  )

img

ggplot(data = mpg) +
  geom_quasirandom(
    mapping = aes(
      x = reorder(class, hwy, FUN = median),
      y = hwy
    ),
    method = "smiley"
  )

img

ggplot(data = mpg) +
  geom_beeswarm(mapping = aes(
    x = reorder(class, hwy, FUN = median),
    y = hwy
  ))

img

7.5.2 Two categorical variables

To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination（把所有組合個(gè)數(shù)頭統(tǒng)計(jì)出來）. One way to do that is to rely on the built-in geom_count():

# geom_count的確是個(gè)好函數(shù)唉
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

# 應(yīng)該是等價(jià)于
diamonds %>% 
  group_by(cut, color) %>% 
  mutate(n = n()) %>% 
  ggplot(mapping = aes(x = cut, y = color, size = n)) +
  geom_point()

img

上面的圖讓我想起了Motif的圖

image

這個(gè)圖Y叔有講過一點(diǎn)：怎樣用HOMER算出的P-value畫出CNS級(jí)別的泡泡圖刃泡？

還有一種展現(xiàn)方法：visualise with geom_tile() and the fill aesthetic:

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

img

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.

關(guān)于seriation包，搜到了這個(gè)【ComplexHeatmap包學(xué)習(xí)筆記】2.5 Seriation

Exercise 7.5.2.1

How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

其實(shí)核心就是用百分比碉怔，類似于之前的密度圖烘贴。我覺得還是很值得我們學(xué)習(xí)的

To clearly show the distribution of cut within color, calculate a new variable prop which is the proportion of each cut within a color. This is done using a grouped mutate.

diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1)) # from the viridis colour palette library

img

Similarly, to scale by the distribution of color within cut,

diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis(limits = c(0, 1))

img

I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.（數(shù)據(jù)集內(nèi)部比較與數(shù)據(jù)集之間的比較）

# 如果你不限定顏色范圍的話，圖例顏色標(biāo)度會(huì)自適應(yīng)你的顏色范圍
diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = prop)) +
  scale_fill_viridis()
image

這樣子看起來非常的清楚眨层，但有一個(gè)問題是這個(gè)顏色標(biāo)度只適應(yīng)于你現(xiàn)在的數(shù)據(jù)集庙楚，如果你還要跟其他的同類型數(shù)據(jù)集進(jìn)行比較的話上荡，就要scale_fill_viridis(limits = c(0, 1))了趴樱。這樣不同數(shù)據(jù)集之間才可以比較

geom_tile有個(gè)問題是不能聚類。但Y叔的ggtree卻可以做到聚類酪捡，這一點(diǎn)比patchwork還要好（patchwork可以拼圖叁征，但其只是對(duì)準(zhǔn)坐標(biāo)軸）

Example of aligning tree with data side by side by creating composite plot.

Exercise 7.5.2.3

Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

It’s usually better to use the categorical variable with a larger number of categories（這個(gè)變量里面有很多種） or the longer labels（標(biāo)簽名字很長(zhǎng)） on the y axis. If at all possible, labels should be horizontal because that is easier to read（y軸上的標(biāo)簽也應(yīng)該是垂直，更易讀）.

However, switching the order doesn’t result in overlapping labels.

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(y = color, x = cut)) +
  geom_tile(mapping = aes(fill = n))

img

Another justification, for switching the order is that the larger numbers are at the top when x = color and y = cut, and that lowers the cognitive burden of interpreting the plot.

7.5.3 Two continuous variables

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above).（尤其是你比如說一個(gè)甲基化數(shù)據(jù)逛薇，一個(gè)RNA-Seq數(shù)據(jù)捺疼，畫個(gè)點(diǎn)圖，如果不經(jīng)過篩選過濾永罚，你的數(shù)據(jù)點(diǎn)會(huì)鋪滿整個(gè)頁面啤呼，毫無規(guī)律可言） You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency（透明度）
面對(duì)這種情況一個(gè)辦法就是用：geom_bin2d() and geom_hex() to bin in two dimensions. -》use a fill color to display how many points fall into each bin

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

img

當(dāng)然，我覺得用二維核密度估計(jì)圖也是可以的
m <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
 geom_point() +
 xlim(0.5, 6) +
 ylim(40, 110)

m + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon")
img

來自：geom_density_2d

Another option is to bin one continuous variable so it acts like a categorical variable（把連續(xù)型變量分bin呢袱，我們之前遇到過了）

# 下面代碼出不來下圖的結(jié)果
# 可能是版本的問題
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

根據(jù)的geom_boxplot的reference來看官扣，ggplot2 3.3.0版本的確不是下圖的結(jié)果。

但我在服務(wù)器的ggplot2 3.2.1跑出來的就是跟書里面一樣的圖

不過這個(gè)問題應(yīng)該只存在于你的x變量是連續(xù)型變量的時(shí)候

img

One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE （如果設(shè)置了這個(gè)參數(shù)羞福，那么你boxplot的寬度就會(huì)和你的數(shù)據(jù)數(shù)目成正比）

Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number()

# 同樣地惕蹄，下面代碼出不來下圖的結(jié)果
# 可能是版本的問題
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

img

Exercise 7.5.3.1

Instead of summarizing the conditional distribution with a box plot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.

In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.

If categorical colors are used, no more than eight colors（之前ggplot2那章提到過，超過8個(gè)顏色就不會(huì)顯示了） should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).

ggplot(
  data = diamonds,
  mapping = aes(color = cut_number(carat, 5), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

img

Alternatively, I could use cut_width to specify widths at which to cut. I will choose 1-carat widths. Since there are very few diamonds larger than 2-carats, this is not as informative. However, using a width of 0.5 carats creates too many groups, and splitting at non-whole numbers is unappealing.

# boundary = 0稍微可以注意下，如果你不設(shè)置這個(gè)卖陵，似乎就不會(huì)是0-1,1-2了
ggplot(
  data = diamonds,
  mapping = aes(color = cut_width(carat, 1, boundary = 0), x = price)
) +
  geom_freqpoly() +
  labs(x = "Price", y = "Count", color = "Carat")

img

Exercise 7.5.3.2

Visualize the distribution of carat, partitioned by price.

Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

img

Plotted with a box plot with 10 equal-width bins of 2,000. The argument boundary = 0 ensures that first bin is $0–2,000.

ggplot(diamonds, aes(x = cut_width(price, 2000, boundary = 0), y = carat)) +
  geom_boxplot(varwidth = TRUE) +
  coord_flip() +
  xlab("Price")

img

Exercise 7.5.3.4

Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.

唯一需要注意的是遭顶，用離散型變量去分組

# scale_fill_viridis需要預(yù)先加載library(viridis)
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  facet_wrap(~cut, ncol = 1) +
  scale_fill_viridis()

img

ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
  geom_boxplot()

img

ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
  geom_boxplot()

img

Exercise 7.5.3.5

Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

img

Why is a scatterplot a better display than a binned plot for this case?

In this case, there is a strong relationship between xx and yy. The outliers in this case are not extreme in either xx or yy. A binned plot would not reveal these outliers, and may lead us to conclude that the largest value of xx was an outlier even though it appears to fit the bivariate pattern well.

個(gè)人感覺，geom_hex之類的是展現(xiàn)宏觀趨勢(shì)的泪蔫，對(duì)于少數(shù)點(diǎn)的效果并不好棒旗，當(dāng)然你也可以把bin變得很多，bin_width變得很小鸥滨，但這樣其實(shí)跟散點(diǎn)圖是一樣的了嗦哆。

7.6 Patterns and models

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
- Could this pattern be due to coincidence (i.e. random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
想到一個(gè)有意思的圖

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

img

這讓我想起對(duì)于這種出現(xiàn)subgroup的散點(diǎn)圖，如果你做相關(guān)性婿滓，很容易出現(xiàn)相關(guān)性很大的情況

下圖來自elife文章《Ten common statistical mistakes to watch out for when writing or reviewing a manuscript》老速，也有翻譯的中文版《論文寫作或?qū)徃鍟r(shí)的十種常見統(tǒng)計(jì)錯(cuò)誤》

其中A-F都是不相關(guān)的數(shù)據(jù)

image

所以說，我個(gè)人感覺用散點(diǎn)圖做相關(guān)性的時(shí)候凸主，是不能把幾個(gè)重復(fù)放一塊上去的橘券，因?yàn)橹貜?fù)自身就很容易出現(xiàn)subgroup

所以ggstatsplot的上面和右邊的barplot還是有必要的，他可以標(biāo)識(shí)出你的數(shù)據(jù)是否出現(xiàn)了subgroup

image

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
這里的模型給我提供了一個(gè)想法

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals（剩下的部分應(yīng)該就是carat所不能解釋的） (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed.

library(modelr)

# 之所以這里要log卿吐，我猜是因?yàn)槭且驗(yàn)閏arat和price的量級(jí)不一樣
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))

img

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))

img

You’ll learn how models, and the modelr package, work in the final part of the book, model. We’re saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

上面描述的就是在你有三個(gè)變量下（price（這個(gè)可以說是響應(yīng)變量了）旁舰，carat，cut（這兩個(gè)是自變量））的建模過程嗡官。其實(shí)可以類比到你有RNA-Seq箭窜、甲基化、H3K27me3數(shù)據(jù)衍腥。我們就可以首先把甲基化和RNA-Seq的數(shù)據(jù)建模磺樱，然后用殘差來對(duì)H3K27me3和RNA-Seq建模。

7.7 ggplot2 calls

這里也沒啥好說的婆咸，就是說了些小技巧

# 可以簡(jiǎn)寫
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) + 
  geom_freqpoly(binwidth = 0.25)

--------------------------------------------------------------

# tidyvese整理好的數(shù)據(jù)直接可以導(dǎo)入到ggplot2函數(shù)里面畫圖

# 值得一提的是竹捉，H神在自己的tidyverse style guide可不是這么說的，hhhh
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
    geom_tile()

# 他提到準(zhǔn)確的寫法應(yīng)該是
# ggplot2的+號(hào)的規(guī)則類似于 %>%尚骄。同時(shí)要注意块差，如果你ggplot2的數(shù)據(jù)來源于dplyr的管道結(jié)果，則只能有一個(gè)層次的縮進(jìn)倔丈。
# 這也是R studio的默認(rèn)寫法
diamonds %>% 
  count(cut, clarity) %>% 
  ggplot(aes(clarity, cut, fill = n)) + 
  geom_tile()

7.8 Learning more

推薦了三本書
- ggplot2: Elegant Graphics for Data Analysis (Use R!) 2nd ed 這書有g(shù)ithub倉(cāng)庫(kù)憨闰，如果你足夠厲害的話，可以自己編譯出來pdf版本（反正我編譯了好多次都失敗了）需五。當(dāng)然鹉动，網(wǎng)上也有pdf版本的。中文版的已經(jīng)在2019年出版了警儒，大家感興趣可以去看看训裆。
- R Graphics Cookbook 中文版叫《R數(shù)據(jù)可視化手冊(cè)》眶根。此書英文版有了最近剛有了第二版R Graphics Cookbook, 2nd edition。而且作者也有一本R的cookbook边琉，也是不錯(cuò)的属百。cookbook有一個(gè)中文網(wǎng)絡(luò)翻譯版：Cookbook for R 中文版
  
  作者的cookbook for R注意和 R Cookbook區(qū)分，雖然后者也是本很好的書
- Graphical Data Analysis with R (Chapman & Hall/CRC The R Series) 這書我沒看過变姨，不知道……

8 Workflow: projects

這一章也沒啥好說的族扰，我個(gè)人感覺最重要的就是你設(shè)置下圖這個(gè)東西！６ㄅ贰Ｓ婧恰！

我個(gè)人感覺砍鸠，一定要設(shè)置這個(gè)＠┣狻！爷辱！

如果你有一些很耗時(shí)間跑出來的結(jié)果（比如Seurat跑出來的一些中間結(jié)果）录豺，不舍得退出的時(shí)候自動(dòng)刪除。你可以考慮用save或者saveRDS饭弓。

img

兩個(gè)快捷鍵

Press Cmd/Ctrl + Shift + F10 to restart RStudio.
Press Cmd/Ctrl + Shift + S to rerun the current script.

The most important difference is how you separate the components of the path. Mac and Linux uses slashes (e.g. plots/diamonds.pdf) and Windows uses backslashes (e.g. plots\diamonds.pdf). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.

我個(gè)人習(xí)慣每次一個(gè)任務(wù)就建一個(gè)Project双饥，而不是setwd, getwd搞來搞去的…… 然后就是我會(huì)在建完project之后就立刻建下面5個(gè)文件夾

# plot放圖
# rawdata放原始數(shù)據(jù)
# result放文本結(jié)果
# script放腳本和函數(shù)
# temp_code放我暫時(shí)丟失的腳本和函數(shù)
.
├── plot
├── rawdata
├── result
├── script
└── temp_code

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市弟断，隨后出現(xiàn)的幾起案子咏花，更是在濱河造成了極大的恐慌，老刑警劉巖阀趴，帶你破解...
沈念sama閱讀 216,372評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件昏翰，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡舍咖，警方通過查閱死者的電腦和手機(jī)矩父，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門锉桑，熙熙樓的掌柜王于貴愁眉苦臉地迎上來排霉，“玉大人，你說我怎么就攤上這事民轴」ツ” “怎么了？”我有些...
開封第一講書人閱讀 162,415評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵后裸，是天一觀的道長(zhǎng)瑰钮。經(jīng)常有香客問我，道長(zhǎng)微驶，這世上最難降的妖魔是什么浪谴？我笑而不...
開封第一講書人閱讀 58,157評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任开睡，我火速辦了婚禮，結(jié)果婚禮上苟耻，老公的妹妹穿的比我還像新娘篇恒。我一直安慰自己，他們只是感情好凶杖，可當(dāng)我...
茶點(diǎn)故事閱讀 67,171評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布胁艰。她就那樣靜靜地躺著，像睡著了一般智蝠。火紅的嫁衣襯著肌膚如雪腾么。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,125評(píng)論 1贊 297
城市分裂傳說
那天杈湾，我揣著相機(jī)與錄音解虱，去河邊找鬼。笑死漆撞，一個(gè)胖子當(dāng)著我的面吹牛饭寺，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播叫挟，決...
沈念sama閱讀 40,028評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼艰匙，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了抹恳？” 一聲冷哼從身側(cè)響起员凝，我...
開封第一講書人閱讀 38,887評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎奋献，沒想到半個(gè)月后健霹，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,310評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡瓶蚂，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,533評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年糖埋，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片窃这。...
茶點(diǎn)故事閱讀 39,690評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡瞳别，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出杭攻，到底是詐尸還是另有隱情祟敛，我是刑警寧澤，帶...
沈念sama閱讀 35,411評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布兆解，位于F島的核電站馆铁，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏锅睛。R本人自食惡果不足惜埠巨，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,004評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一历谍、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧辣垒，春花似錦扮饶、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評(píng)論 0贊 22
一樁弒父案甜无，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至哥遮，卻和暖如春岂丘，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背眠饮。一陣腳步聲響...
開封第一講書人閱讀 32,812評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來泰國(guó)打工奥帘，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人仪召。一個(gè)月前我還...
沈念sama閱讀 47,693評(píng)論 2贊 368
代替公主和親
正文我出身青樓寨蹋，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親扔茅。傳聞我的和親對(duì)象是個(gè)殘疾皇子已旧，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,577評(píng)論 2贊 353

R_for_Data_Science_Script&EDA&projects

6 Workflow: scripts

7.1 Introduction

7.2 Questions

7.3 Variation

7.4 Missing values

7.5 Covariation

7.5.1 A categorical and continuous variable

7.5.2 Two categorical variables

7.5.3 Two continuous variables

7.6 Patterns and models

7.7 ggplot2 calls

7.8 Learning more

8 Workflow: projects

推薦閱讀更多精彩內(nèi)容