I(hadley wickham) was honored to speak this week at the IASC-ARS/NZSA Conference, hosted by the Stats Department at The University of Auckland. One of the conference themes is to celebrate the accomplishments of Ross Ihaka, who got R started back in 1992, along with Robert Gentleman. My talk included advice on setting up your R life to maximize effectiveness and reduce frustration.
我很榮幸本周在由奧克蘭大學統(tǒng)計部主辦的IASC-ARS / NZSA會議上發(fā)言划址。其中一個會議主題是慶祝Ross Ihaka的成就,他在1992年與Robert Gentleman一起開始R語言的開發(fā)。我的演講包括你在使用R過程中的一些建議瞬场,以最大限度地提高效率并減少挫敗感。
Two specific slides generated much discussion and consternation in #rstats Twitter: 兩個特定的幻燈片在#rstats Twitter中產(chǎn)生了很多討論和驚愕:
If the first line of your R script is 如果你的R腳本的第一行是
setwd("C:\Users\jenny\path\that\only\I\have")
I will come into your office and SET YOUR COMPUTER ON FIRE ??.我將進入你的辦公室并將你的計算機放一把火燒掉
If the first line of your R script is如果你的R腳本的第一行是
rm(list = ls())
I will come into your office and SET YOUR COMPUTER ON FIRE ??.我將進入你的辦公室并將你的計算機放一把火燒掉
I stand by these strong opinions, but on their own, threats to commit arson aren’t terribly helpful! Here I explain why these habits can be harmful and may be indicative of an awkward workflow. Feel free to discuss more on community.rstudio.com.
我堅持這些強烈的意見,但就他們自己而言,縱火燒電腦的威脅并不是非常有用挠羔!在這里,我解釋了為什么這些習慣可能是有害的埋嵌,并可能表明一個尷尬的工作流程破加。歡迎在community.rstudio.com上討論更多內容 。
Caveat: only you can decide how much you care about this. The importance of these practices has a lot to do with whether your code will be run by other people, on other machines, and in the future. If your current practices serve your purposes, then go forth and be happy.
警告:只有你可以決定你對此有多關心雹嗦。這些實踐的重要性與您的代碼是否將來會由其他人在其他計算機上運行有很大關系范舀。如果您目前的做法符合您的目的合是,那么請開開心心,繼續(xù)前進尿背。
Workflow versus Product 工作流程與產(chǎn)品
Let’s make a distinction between things you do because of personal taste and habits (“workflow”) versus the logic and output that is the essence of your project (“product”). These are part of your workflow:
讓我們來區(qū)分一下你做的事情端仰,個人的品味和習慣(“工作流程”)是一方面捶惜,邏輯和輸出是你項目的本質(“產(chǎn)品”)是另外一方面田藐。以下這些是您工作流程的一部分:
- The editor you use to write your R code.您用來編寫R代碼的編輯器。
- The name of your home directory.主目錄的名稱吱七。
- The R code you ran before lunch.你在午餐前跑過的R代碼汽久。
I consider these to be clearly product: 我認為這些顯然是產(chǎn)品:
- The raw data.原始數(shù)據(jù)。
- The R code someone needs to run on your raw data to get your results, including the explicit
library()
calls to load necessary packages.有人需要在原始數(shù)據(jù)上運行R代碼以獲得結果踊餐,包括顯式library()
調用以加載必要的包景醇。
Ideally, you don’t hardwire anything about your workflow into your product. Workflow-related operations should be executed by you interactively, using whatever means is appropriate to your setup, but not built into the scripts themselves.
理想情況下,您不會將有關工作流程的任何內容硬連接到產(chǎn)品中吝岭。工作流程相關的操作應由您以交互方式執(zhí)行三痰,使用適合您的設置的任何方法,但不是內置于腳本本身窜管。
Self-contained projects 獨立的項目
I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.
我建議將每個數(shù)據(jù)分析組織到一個項目中:計算機上的一個文件夾散劫,其中包含與該特定工作相關的所有文件。我不是假設這是一個RStudio項目幕帆,盡管這是一個很好的實現(xiàn)获搏,如下所述。
Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).
編寫任何駐留R腳本失乾,都假設它將從一個新的R進程運行常熙,并將工作目錄設置為項目目錄。它在自己的工作空間或文件夾中創(chuàng)建所需的一切碱茁,并且它沒有觸及任何它沒有創(chuàng)建的東西裸卫。例如,它沒有安裝額外的包(安裝了額外的包是另一個煩我的地方)纽竣。
This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.
此約定保證這個項目可以在您的計算機上移動或移動到其他計算機上彼城,并且仍然“正常工作”。我認為這是唯一可以在不同計算機或用戶之間創(chuàng)建可靠退个,禮貌行為的實用約定募壕。這個慣例既不是新的,也不是R獨有的语盈。
It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.
這就像是同意我們要不全都在左側駕駛舱馅,要不就是全都在右側駕駛。文明的標志是遵循慣例刀荒,這些慣例通常是以公共安全的名義來限制你的行為代嗤。
Use of a development environment 使用開發(fā)環(huán)境
You will notice that the workflow recommendations given here are easier to implement if you use an IDE (integrated development environment). RStudio is a great example (what I use today), but there are many others, including: Emacs + ESS(what I used for ~15 years before RStudio), vim + Nvim-R, Visual Studio + RTVS.
您會注意到棘钞,如果您使用IDE(集成開發(fā)環(huán)境),則此處給出的工作流建議更容易實現(xiàn)干毅。RStudio 是一個很好的例子(我今天使用的)宜猜,但還有很多其他的,包括:Emacs + ESS(我在RStudio之前用了大約15年)硝逢,vim + Nvim-R姨拥,Visual Studio + RTVS。
Direction of causality: long-time coders don’t organize their work into self-contained projects and use relative paths because they use an IDE. They use an IDE because it makes it easier to follow standard practices, such as these.
因果關系的方向:長期編碼員不會將他們的工作組織成自包含的獨立項目并使用相對路徑渠鸽,因為他們使用IDE叫乌。他們使用IDE,因為它可以更容易地遵循標準做法徽缚,例如這些憨奸。
What’s wrong with setwd()
? 使用setwd()有什么錯?
I run a lot of student code in STAT 545 and, at the start, I see a lot of R scripts that look like this:
我在STAT 545中運行了很多學生代碼凿试,在開始時排宰,我看到很多R腳本看起來像這樣:
library(ggplot2)
setwd("/Users/jenny/cuddly_broccoli/verbose_funicular/foofy/data")
df <- read.delim("raw_foofy_data.csv")
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave("../figs/foofy_scatterplot.png")
The chance of the setwd()
command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable. To recreate and perhaps extend this plot, the lucky recipient will need to hand edit one or more paths to reflect where the project has landed on their machine. When you do this for the 73rd time in 2 days, while marking an assignment, you start to fantasize about lighting the perpetrator’s computer on fire.
除了作者之外的任何人,setwd()
命令要是能使文件路徑工作具有所需效果的機會都是0%那婉。此后板甘,它也不太可能為你正常工作一兩年,或者其他電腦上還能工作吧恃。該項目不是獨立和便攜的虾啦。要重新創(chuàng)建并擴展此圖,幸運的收件人需要手動編輯一個或多個路徑痕寓,以反映項目在其計算機上的實際位置傲醉。當您在批改作業(yè)時,2天內第73次(很多次)執(zhí)行此操作(改路徑)呻率,您開始想要燒掉這些人的計算機硬毕。
This use of setwd()
is also highly suggestive that the useR does all of their work in one R process and manually switches gears when they shift from one project to another. That sort of workflow makes it unpleasant to work on more than one project at a time and also makes it easy for work done on one project to accidentally leak into subsequent work on another (e.g., objects, loaded packages, session options).
這種使用setwd()
也高度暗示useR在一個R過程中完成所有工作,并在從一個項目轉移到另一個項目時手動切換礼仗。這種工作流程使得一次處理多個項目變得令人不愉快吐咳,并且使得在一個項目上完成的工作很容易意外泄漏到另一個項目的后續(xù)工作中(例如,對象元践,加載的包韭脊,會話選項)。
Use projects and the here package 使用項目和here包
How can you avoid setwd()
at the top of every script? 你怎么才能能避免setwd()
在每個腳本的頂部单旁?
- Organize each logical project into a folder on your computer. 將每個邏輯項目組織到計算機上的文件夾中沪羔。
- Make sure the top-level folder advertises itself as such. This can be as simple as having an empty file named
.here
. Or, if you use RStudio and/or Git, those both leave characteristic files behind that will get the job done. 確保頂級文件夾是一眼就能看明白的是有特征的(自白)。這可以很簡單象浑,例如頂級文件夾中有一個名字是.here
的空文件蔫饰±哦梗或者,如果你使用RStudio和/或Git篓吁,那些都會留下特征文件茫因,這將完成工作。 - Use the
here()
function from the here package to build the path when you read or write a file. Create paths relative to the top-level directory.使用here package的here()
函數(shù)可在讀取或寫入文件時構建路徑杖剪。創(chuàng)建相對于頂級目錄的路徑冻押。 - Whenever you work on this project, launch the R process from the project’s top-level directory. If you launch R from the shell,
cd
to the correct folder first. 每當您處理此項目時,從項目的頂級目錄啟動R進程摘盆。如果從shell啟動R翼雀,則首先啟動切換目錄(Change Directory: CD)到正確的文件夾饱苟。
To continue our example, start R in the foofy
directory, wherever that may be. Now the code looks like so:要繼續(xù)我們的示例孩擂,請在foofy
目錄中啟動R ,無論它在哪里∠浒荆現(xiàn)在代碼看起來像這樣:
library(ggplot2)
library(here)
df <- read.delim(here("data", "raw_foofy_data.csv"))
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave(here("figs", "foofy_scatterplot.png"))
This will run, with no edits, for anyone who follows the convention about launching R in the project folder. In fact, it will even work if R’s working directory is anywhere inside the project, i.e. it will work from sub-folders. This plays well with knitr/rmarkdown’s default behavior around working directory and in package development/checking workflows.
這段代碼不需要額外的編輯类垦,對于任何遵循關于在項目文件夾中啟動R的約定的人都可以運行的很好。實際上城须,它甚至可以工作在項目內的任何位置蚤认,只要R的工作目錄是在項目里邊,例如是子文件夾里邊是可以工作的糕伐。這與knitr / rmarkdown在工作目錄和包開發(fā)/檢查工作流程中的默認行為相吻合砰琢。
Read up on the here package to learn about more features, such as additional ways to mark the top directory and troubleshooting with dr_here()
. I have also written a more detailed paean to this package before.
閱讀here package,了解更多功能良瞧,例如標記頂級目錄和故障排除的其他方法dr_here()
陪汽。我之前也寫過文章,推薦贊頌這個包褥蚯,文章詳見這里挚冤。
RStudio Projects - RStudio項目
This work style is so crucial that RStudio has an official notion of a Project (with a capital “P”). You can designate a new or existing folder as a Project. All this means is that RStudio leaves a file, e.g., foofy.Rproj
, in the folder, which is used to store settings specific to that project.
這種工作方式至關重要,以至于RStudio有一個正式的項目 Project概念(大寫的“P”)赞庶。您可以將新文件夾或現(xiàn)有文件夾指定為項目训挡。所有這些意味著RStudio foofy.Rproj
在文件夾中留下文件,例如歧强,該文件用于存儲特定于該項目的設置澜薄。
Double-click on a .Rproj
file to open a fresh instance of RStudio, with the working directory and file browser pointed at the project folder. The here package is aware of this and the presence of an .Rproj
is one of the ways it recognizes the top-level folder for a project.
雙擊.Rproj
文件以打開RStudio的新實例,這個新實例包括了工作目錄和文件瀏覽器摊册,直接指向項目文件夾肤京。here包是知道這一特點的,并且.Rproj
文件的存在是它識別項目的頂級文件夾的方式之一丧靡。
RStudio fully supports Project-based workflows, making it easy to switch from one to another, have many projects open at once, re-launch recently used Projects, etc.
RStudio完全支持基于項目的工作流程蟆沫,可以輕松地從一個工作流切換到另一個工作流籽暇,一次打開許多項目,重新啟動最近使用的項目等饭庞。
What’s wrong with rm(list = ls())
? 使用rm(list = ls())有什么不對
It’s also fairly common to see data analysis scripts that begin with this object-nuking command: 查看以此object-nuking命令開頭的數(shù)據(jù)分析腳本也很常見:
rm(list = ls())
Just like hard-wiring the working directory, this is highly suggestive that the useR works in one R process and manually switches gears when they shift from one project to another. That, in turn, suggests that development frequently happens in a long-running R process that has been used vs. fresh and clean.
就像在工作目錄中進行硬連接一樣戒悠,這非常強烈地表明useR在一個R過程中工作,并且當它們從一個項目轉移到另一個項目時手動切換舟山。反過來绸狐,這表明經(jīng)常在長期運行的R過程中進行R代碼的開發(fā),該R過程已被使用累盗,這個R過程已經(jīng)不是新鮮和清潔的了寒矿。
The problem is that rm(list = ls())
does NOT, in fact, create a fresh R process. All it does is delete user-created objects from the global workspace.
問題在于,rm(list = ls())
實際上并沒有創(chuàng)建一個新的R過程若债。它所做的就是從全局工作區(qū)中刪除用戶創(chuàng)建的對象符相。
Many other changes to the R landscape persist invisibly and can have profound effects on subsequent development. Any packages that have been loaded are still available. Any options that have been set to non-default values remain that way. Working directory is not affected (which is, of course, why we see setwd()
so often here too!).
R這塊地區(qū)(作用空間)的的許多其他變化無形地持續(xù)存在著,并可能對后續(xù)發(fā)展產(chǎn)生深遠影響蠢琳。任何已加載的包仍然可用啊终。任何已設置為非默認值的選項都保持這種方式。工作目錄不受影響(當然傲须,這也是我們setwd()
經(jīng)常在這里看到的原因@渡)。
Why does this matter? It makes your script vulnerable to hidden dependencies on things you ran in this R process before you executed rm(list = ls())
.
為什么這很重要泰讽?它會使您的腳本在執(zhí)行rm(list = ls())
之前例衍,容易受到您在此R進程中運行的事物的隱藏依賴性的影響。
- You might use functions from a package without including the necessary
library()
call. Your collaborator won’t be able to run this script. 您可以使用包中的函數(shù)而不包括必要的library()
調用已卸。您的協(xié)作者將無法運行此腳本佛玄。 - You might code up an analysis assuming that
stringsAsFactors = FALSE
but next week, when you have restarted R, everything will inexplicably be broken. 您可能會編寫一個分析,假設stringsAsFactors = FALSE
但是下周咬最,當您重新啟動R時翎嫡,一切都將莫名其妙地被打破。 - You might write paths relative to some random working directory, then be puzzled next month when nothing can be found or results don’t appear where you expect. 您可能會編寫相對于某個隨機工作目錄的路徑永乌,然后在下個月遇到任何問題或者結果沒有出現(xiàn)在您預期的位置時會感到困惑惑申。
The solution is to write every script assuming it will be run in a fresh R process. How do you adopt this style? Key steps:
解決方案是編寫每個腳本都假設它將在新的R進程中運行。你如何采用這種風格翅雏?關鍵步驟如下:
- User-level setup: Do not save
.RData
, when you quit R and don’t load.RData
when you fire up R. - 用戶級設置:
.RData
退出R時不保存圈驼,并且.RData
在啟動R時不加載。- In RStudio, this behavior can be requested in the General tab of Preferences. 在RStudio中望几,可以在“首選項”的“常規(guī)”選項卡中請求此行為绩脆。
- If you run R from the shell, put something like this in your
.bash_profile
:alias R='R --no-save --no-restore-data'
. 如果從shell中運行R,把這樣的事情在你的.bash_profile
:alias R='R --no-save --no-restore-data'
。
- Don’t do things in your
.Rprofile
that affect how R code runs, such as loading a package like dplyr or ggplot or setting an option such asstringsAsFactors = FALSE
. 不要在.Rprofile
中加入影響R代碼的運行方式的東西靴迫,例如加載像dplyr或ggplot這樣的包惕味,或者設置諸如的選項stringsAsFactors = FALSE
。 - Daily work habit: Restart R very often and re-run your under-development script from the top.
- 日常工作習慣:經(jīng)常重啟R并從項目頂部重新運行正在開發(fā)的腳本玉锌。
- If you use RStudio, use the menu item Session > Restart R or the associated keyboard shortcut Ctrl+Shift+F10 (Windows and Linux) or Command+Shift+F10 (Mac OS). You can re-run all code up to the current line with Ctrl+Alt+B (Windows and Linux) or Command+Option+B (Mac OS). 如果您使用RStudio名挥,請使用菜單項Session> Restart R或相關的鍵盤快捷鍵Ctrl + Shift + F10(Windows和Linux)或Command + Shift + F10(Mac OS)。您可以使用Ctrl + Alt + B(Windows和Linux)或Command + Option + B(Mac OS)將所有代碼重新運行到當前行主守。
- If you run R from the shell, use Ctrl+D to quit, then
R
to restart. 如果從shell運行R禀倔,請使用Ctrl + D退出,然后R
重新啟動参淫。
This requires that you fully embrace the idea that source is real: 這要求您完全接受源代碼就是真實的想法:
The source code is real. The objects are realizations of the source code. Source for EVERY user modified object is placed in a particular directory or directories, for later editing and retrieval. – from the ESS manual 源代碼是真實的救湖。對象是源代碼的實現(xiàn)。每個用戶修改對象的源都放在特定目錄或目錄中涎才,以便以后編輯和檢索鞋既。- 來自ESS手冊
This doesn’t mean that your scripts need to be perfectly polished and ready to run unattended on a remote server. Scripts can be messy, anticipating interactive execution, but still be complete. Clean them up when and if you need to.
這并不意味著您的腳本需要完美修飾拋光,并且可以在遠程服務器上無人值守地運行憔维。腳本可能很混亂涛救,預計可以交互式執(zhí)行畏邢,但仍然是完整的业扒。如果需要,可以清理它們舒萎。
What about objects that take a long time to create? Isolate that bit in its own script and write the precious object to file with saveRDS(my_precious, here("results", "my_precious.rds"))
. Now you can develop scripts to do downstream work that reload the precious object via my_precious <- readRDS(here("results", "my_precious.rds"))
. It is a good idea to break data analysis into logical, isolated pieces anyway.
對于需要很長時間才能創(chuàng)建的對象呢程储?在自己的腳本中隔離該位并將珍貴的對象寫入文件saveRDS(my_precious, here("results", "my_precious.rds"))
。現(xiàn)在臂寝,您可以開發(fā)腳本來執(zhí)行下游工作章鲤,從而重新加載寶貴的對象my_precious <- readRDS(here("results", "my_precious.rds"))
。無論如何咆贬,將數(shù)據(jù)分析分解為邏輯孤立的部分是個好主意败徊。
Lastly, rm(list = ls())
is hostile to anyone that you ask to help you with your R problems. If they take a short break from their own work to help debug your code, their generosity is rewarded by losing all of their previous work. Now granted, if your helper has bought into all the practices recommended here, this is easy to recover from, but it’s still irritating. When this happens for the 100th time in a semester, it rekindles the computer arson fantasies triggered by last week’s fiascos with setwd()
.
最后,rm(list = ls())
對你要求幫助你解決你的R問題的任何人都是敵對(不友好)的掏缎。如果他們從他們自己的工作中稍作休息以幫助調試您的代碼皱蹦,他們會因此失去他們以前的所有工作,這就是他們慷慨的回報【祢冢現(xiàn)在沪哺,如果幫助你的人已經(jīng)使用了這里推薦的所有做法,這很容易恢復酌儒,但它仍然令人生氣辜妓。當這樣的事情在一個學期發(fā)生了第100次時,它重新點燃了由上周setwd()
的慘敗引發(fā)的計算機縱火幻想(你這個setwd()很讓人生氣, 真想把你的電腦一把火點了)。