生信學(xué)習(xí)基礎(chǔ)_R語言07_visualization data with R_R語言ggplot2實現(xiàn)數(shù)據(jù)可視化

原文地址:https://hbctraining.github.io/Intro-to-R/lessons/07_ggplot2.html

大神的中文整理版:http://www.reibang.com/p/26e4a948df3c

本文是我拷貝的原文,加了自己的筆記和練習(xí)題答案。

Learning Objectives

  • Plot graphs using the external package “ggplot2”.
  • Use the “map” function for iterative tasks on data structures.
  • Export plots for use outside of the R environment.

Setting up a data frame for visualization

In this lesson we want to make various plots related to the average expression in each sample. When we make the plots, we also want to use all the metadata available to appropriately annotate the plots.

Let’s take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample eventually. Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called ‘mean()`:

mean(rpkm_ordered[,"sample1"])

That is great, if we only wanted the average from one of the samples (1 column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. What is the best way to do this?

Programming languages typically have a way to allow the execution of a single line of code or several lines of code multiple times, or in a “l(fā)oop”. While “l(fā)oops” are possible in R, there are functions that more directly achieve this purpose, such as the apply() family of functions and the map() family of functions. The map() family is a bit more intuitive to use than apply(), so we will explore this family in more detail. However, we have similar materials available using the apply() function if you would like to explore more on your own.

The map family of functions

The map() family of functions is available from the purrr package, which is part of the tidyverse suite of packages. More detailed information is available in the R for Data Science book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a “double” or numeric vector.
  • map_chr() creates a character vector.

The syntax for the map() family of functions is:

## DO NOT RUN
map(object, function_to_apply)

If you would like to practice with the map() family of functions, we have additional materialsavailable.

Wrangling our data with map_dbl()

To obtain mean values for all samples we can use the map_dbl() function which generates a numeric vector.

library(purrr)  # Load the purrr

samplemeans <- map_dbl(rpkm_ordered, mean) 

We can add this 12 element containing vector as a column to our metadata data frame, thus combining the average expression with experimental metadata. The cbind() or “column bind” function allows us to do this very easily.

new_metadata <- cbind(metadata, samplemeans)

Before we start to plot, we also want to add an additional metadata column to new_metadata, this new column lists the age of each of the mouse samples in days.

age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)    
# Create a numeric vector with ages. Note that there are 12 elements here.

new_metadata <- cbind(new_metadata, age_in_days)    
# add the new vector as the last column to the new_metadata dataframe

We are now ready for plotting and data visualization!

Data Visualization with ggplot2

When we are working with large sets of numbers it can be useful to display that information graphically to gain more insight. Visualization deserves an entire course of its own (there is that much to know!). If you are interested in learning about plotting with base R functions, we have a short lesson available here. In this lesson we will be plotting with the popular Bioconductor package ggplot2.

More recently, R users have moved away from base graphic options towards ggplot2 since it offers a lot more functionality as compared to the base R plotting functions. The ggplot2 syntax takes some getting used to, but once you get it, you will find it’s extremely powerful and flexible. We will start with drawing a simple x-y scatterplot of samplemeans versus age_in_days from the new_metadata data frame. ggplot2 assumes that the input is a data frame.

Let’s start by loading the ggplot2 library, you downloaded and installed this library as part of the tidyverse package.

library(ggplot2)

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the + operator. These parts are often referred to as layers.

Let’s start:

ggplot(new_metadata) # what happens? 

You get an blank plot, because you need to specify layers using the + operator.

One type of layer is geometric objects. These are the actual marks we put on a plot. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

For a more exhaustive list on all possible geometric objects and when to use them check out Hadley Wickham’s RPubs or the RStudio cheatsheet.

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator

ggplot(new_metadata) +
  geom_point() # note what happens here

You will find that even though we have added a layer by specifying geom_point, we get an error. This is because each type of geom usually has a required set of aesthetics to be set. Aesthetic mappings are set with the aes() function and can be set inside geom_point() to be specifically applied to that layer. If we supplied aesthetics within ggplot(), they will be used as defaults for every layer. Examples of aesthetics include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • linetype
  • size

To start, we will add position for the x- and y-axis since geom_point requires the most basic information about a scatterplot, i.e. what you want to plot on the x and y axes. All of the others mentioned above are optional.

ggplot(new_metadata) +
     geom_point(aes(x = age_in_days, y= samplemeans))

ggscatter1

Now that we have the required aesthetics, let’s add some extras like color to the plot. We can colorthe points on the plot based on genotype, by specifying the column header. You will notice that there are a default set of colors that will be used so we do not have to specify. Also, the legend has been conveniently plotted for us!

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype)) 

ggscatter1.1

Alternatively, we could color based on celltype by changing it to color =celltype. Let’s try something different and have both celltype and genotype identified on the plot. To do this we can assign the shape aesthetic the column header, so that each celltype is plotted with a different shaped data point. Add in shape = celltype to your aesthetic and see how it changes your plot:

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
            shape=celltype)) 

ggscatter3

The size of the data points are quite small. We can adjust that within the geom_point() layer, but does not need to be included in aes() since we are specifying how large we want the data points, rather than mapping it to a variable. Add in the size argument by specifying a number for the size of the data point:

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
            shape=celltype), size=3.0) 

ggscatter4

The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis label aesthetics
  • Plot background
  • Facet label backround
  • Legend appearance

There are built-in themes we can use (i.e. theme_bw()) that mostly change the background/foreground colours, by adding it as additional layer. Or we can adjust specific elements of the current default theme by adding the theme() layer and passing in arguments for the things we wish to change. Or we can use both.

Let’s add a layer theme_bw(). Do the axis labels or the tick labels get any larger by changing themes?

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
            shape=celltype), size=3.0) +
  theme_bw() 

Not in this case. But we can add arguments using theme() to change it ourselves. Since we are adding this layer on top (i.e later in sequence), any features we change will override what is set in the theme_bw(). Here we’ll increase the size of the axes labels and axes tick labels to be 1.5 times the default size. When modfying the size of text we often use the rel() function. In this way the size we specify is relative to the default (similar to cex for base plotting). We can also provide the number vaue as we did with the data point size, but can be cumbersome if you don’t know what the default font size is to begin with.

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
            shape=celltype), size=3.0) +
  theme_bw() +
  theme(axis.text = element_text(size=rel(1.5)),
        axis.title = element_text(size=rel(1.5)))           
ggscatter5

NOTE: You can use the example("geom_point") function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geometric object layers available in ggplot2 to learn how you can easily modify your plots!

NOTE: RStudio provide this very useful cheatsheet for plotting using ggplot2. Different example plots are provided and the associated code (i.e which geom or theme to use in the appropriate situation.)


Exercise

  1. The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled “Age (days)” and the y-axis is labeled “Mean expression”.
ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
                 shape=celltype), size=3.0) +
  theme(axis.text = element_text(size=rel(1.5)),
        axis.title = element_text(size=rel(1.5))) +xlab("Age (days)") +ylab("Mean expression")
  1. Use the ggtitle layer to add a title to your plot. NOTE: Useful code to center your title over your plot can be done using theme(plot.title=element_text(hjust=0.5)).
ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
                 shape=celltype), size=3.0) +
  theme(axis.text = element_text(size=rel(1.5)),
        axis.title = element_text(size=rel(1.5)),plot.title=element_text(hjust=0.5)) +xlab("Age (days)") +ylab("Mean expression") +ggtitle("Practice of KuRui")

Consistent formatting using custom functions

When publishing, it is helpful to ensure all plots have similar formatting. To do this we can create a custom function with our preferences for the theme. Remember the structure of a function is:

name_of_function <- function(arguments) {
    statements or code that does something
}

Now, let’s suppose we always wanted our theme to include the following:

theme_bw() +
    theme(axis.text=element_text(size=rel(1.5)),
          axis.title=element_text(size=rel(1.5)),
          plot.title=element_text(hjust=0.5))

If there is nothing that we want to change when we run this, then we do not need to specify any arguments. Creating the function is simple; we can just put the code inside the {}:

personal_theme <- function(){
  theme_bw() +
    theme(axis.text=element_text(size=rel(1.5)),
          axis.title=element_text(size=rel(1.5)),
          plot.title=element_text(hjust=0.5)) 
}

Now to run our personal theme with any plot, we can use this function in place of the theme() code:

ggplot(new_metadata) +
  geom_point(aes(x=age_in_days, y=samplemeans, color=genotype, shape=celltype), size=rel(3.0)) +
  personal_theme() +
  xlab("Age (days)") +
  ylab("Mean expression") +
  ggtitle("Expression with Age")

Boxplot

Now that we have all the required information for plotting with ggplot2 let’s try plotting a boxplot. A boxplot provides a graphical view of the distribution of data based on a five number summary. The top and bottom of the box represent the (1) first and (2) third quartiles (25th and 75th percentiles, respectively). The line inside the box represents the (3) median (50th percentile). The whiskers extending above and below the box represent the (4) maximum, and (5) minimum of a data set. The whiskers of the plot reach the minimum and maximum values that are not outliers.

Outliers are determined using the interquartile range (IQR), which is defined as: Q3 - Q1. Any values that exceeds 1.5 x IQR below Q1 or above Q3 are considered outliers and are represented as points above or below the whiskers. These outliers are useful to identify any unexpected observations.

  1. Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.
  2. Add a title to your plot.
  3. Add ‘Genotype’ as your x-axis label and ‘Mean expression’ as your y-axis labels.
  4. Change the size of your axes labels to 1.5x larger than the default.
  5. Change the size of your axes text (the labels on the tick marks) to 1.25x larger than the default.
  6. Change the size of your plot title in the same way that you change the size of the axes text but use plot.title.

BONUS: Use the fill aesthetic to look at differences in sample means between celltypes within each genotype.

Our final figure should look something like that provided below.

ggbox

NOTE: If you wanted to change the colors of these boxplots you would add another layer scale_fill_manual() to the code, and within the function specify which colors you want to use using the values argument. For example, if the factor column you are coloring with has 2 levels, you will need to give 2 values as follows scale_fill_manual(values=c("purple","orange")).

NOTE: You are not restricted to colors specified as above, you have the choice of a lot of colors using their hexadecimal code, click here for more information about color palettes in R.

Exporting figures to file

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen). The first (and easiest) is to export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to. It will also give you options to dictate the size and resolution of the output image.

The second option is to use R functions and have the write to file hard-coded in to your script. This would allow you to run the script from start to finish and automate the process (not requiring human point-and-click actions to save). In R’s terminology, output is directed to a particular output device and that dictates the output format that will be produced. A device must be created or “opened” in order to receive graphical output and, for devices that create a file on disk, the device must also be closed in order to complete the output.

Let’s print our scatterplot to a pdf file format. First you need to initialize a plot using a function which specifies the graphical format you intend on creating i.e.pdf(), png(), tiff() etc. Within the function you will need to specify a name for your image, and the with and height (optional). This will open up the device that you wish to write to:

pdf("figures/scatterplot.pdf")

If you wish to modify the size and resolution of the image you will need to add in the appropriate parameters as arguments to the function when you initialize. Then we plot the image to the device, using the ggplot scatterplot that we just created.

ggplot(new_metadata) +
  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,
            shape=celltype), size=rel(3.0)) 

Finally, close the “device”, or file, using the dev.off() function. There are also bmp, tiff, and jpegfunctions, though the jpeg function has proven less stable than the others.

dev.off()

Note 1: You will not be able to open and look at your file using standard methods (Adobe Acrobat or Preview etc.) until you execute the dev.off() function.

Note 2: If you had made any additional plots before closing the device, they will all be stored in the same file; each plot usually gets its own page, unless you specify otherwise.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末达传,一起剝皮案震驚了整個濱河市插掂,隨后出現(xiàn)的幾起案子变勇,更是在濱河造成了極大的恐慌,老刑警劉巖魁瞪,帶你破解...
    沈念sama閱讀 219,427評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件卿叽,死亡現(xiàn)場離奇詭異,居然都是意外死亡滋早,警方通過查閱死者的電腦和手機(jī)榄审,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,551評論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來杆麸,“玉大人搁进,你說我怎么就攤上這事∥敉罚” “怎么了饼问?”我有些...
    開封第一講書人閱讀 165,747評論 0 356
  • 文/不壞的土叔 我叫張陵,是天一觀的道長揭斧。 經(jīng)常有香客問我莱革,道長,這世上最難降的妖魔是什么讹开? 我笑而不...
    開封第一講書人閱讀 58,939評論 1 295
  • 正文 為了忘掉前任驮吱,我火速辦了婚禮,結(jié)果婚禮上萧吠,老公的妹妹穿的比我還像新娘。我一直安慰自己桐筏,他們只是感情好纸型,可當(dāng)我...
    茶點故事閱讀 67,955評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著梅忌,像睡著了一般狰腌。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上牧氮,一...
    開封第一講書人閱讀 51,737評論 1 305
  • 那天琼腔,我揣著相機(jī)與錄音,去河邊找鬼踱葛。 笑死丹莲,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的尸诽。 我是一名探鬼主播甥材,決...
    沈念sama閱讀 40,448評論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼性含!你這毒婦竟也來了洲赵?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,352評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎叠萍,沒想到半個月后芝发,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,834評論 1 317
  • 正文 獨居荒郊野嶺守林人離奇死亡苛谷,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,992評論 3 338
  • 正文 我和宋清朗相戀三年辅鲸,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片抄腔。...
    茶點故事閱讀 40,133評論 1 351
  • 序言:一個原本活蹦亂跳的男人離奇死亡瓢湃,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出赫蛇,到底是詐尸還是另有隱情绵患,我是刑警寧澤,帶...
    沈念sama閱讀 35,815評論 5 346
  • 正文 年R本政府宣布悟耘,位于F島的核電站落蝙,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏暂幼。R本人自食惡果不足惜筏勒,卻給世界環(huán)境...
    茶點故事閱讀 41,477評論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望旺嬉。 院中可真熱鬧管行,春花似錦、人聲如沸邪媳。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,022評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽雨效。三九已至迅涮,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間徽龟,已是汗流浹背叮姑。 一陣腳步聲響...
    開封第一講書人閱讀 33,147評論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留据悔,地道東北人传透。 一個月前我還...
    沈念sama閱讀 48,398評論 3 373
  • 正文 我出身青樓,卻偏偏與公主長得像屠尊,于是被迫代替她去往敵國和親旷祸。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,077評論 2 355

推薦閱讀更多精彩內(nèi)容

  • Linux安裝R語言環(huán)境 更新包列表sudo apt-get update 安裝Rsudo apt install...
    小土豆dy閱讀 1,127評論 2 3
  • 如何恢復(fù)自己的想象力和敏感力應(yīng)該會是要持續(xù)探索的話題讼昆。自我封閉控制了太久托享,確實忘記了自我放松是怎樣的骚烧。 自我剖析真...
    花團(tuán)與芥末閱讀 216評論 0 3
  • 現(xiàn)在都在說Retrofit2和RxJava2,作為一個程序猿自然不能落伍,然后就試用了一下,遇到一些坑,在這里記錄...
    o動感超人o閱讀 6,665評論 2 52
  • 最近總是糾結(jié)于要寫到將近二百篇了赃绊,還是寫到四不像,越寫越痛苦羡榴,都找不到自己的初衷了碧查。 回過頭來想想,我的初衷是啥呢...
    王小哇閱讀 58評論 0 0