JHK數(shù)據(jù)科學(xué)系列課程的課程筆記,這是前兩門課《數(shù)據(jù)科學(xué)家的工具箱》和《R語(yǔ)言編程》
數(shù)據(jù)科學(xué)家的工具箱
目標(biāo)
Types of Questions
descriptive analyses
Describe a set of data
exploratory analysis
Find relationships you didn't know about
inferential analysis
use a relatively small sample of data to say something about a bigger population
predictive analysis
To use the data on some objects to predict values for another object
casual analysis
To find out what happens to one variable when you make another variable change
mechanistic analysis
Understand the exact changes in variables that lead to changes in other variables for individual objects
The data is the second most important thing,the most important thing in data science is the question
R語(yǔ)言
方法
參數(shù)匹配
位置匹配
名稱匹配
部分匹配
給定參數(shù)后匹配的順序:
- Check for exact match for a named argument
- Check for a partial match
- Check for a positional match
Lazy Evaluation
傳遞給方法的參數(shù)父叙,只有在用的時(shí)候才去求值。
"..."變長(zhǎng)參數(shù)
-
在不想拷貝原始方法的全部參數(shù)的時(shí)候,用于擴(kuò)展方法
myplot <- function(x, y, type = "l", ...) {
plot(x, y, type = type, ...)
} -
傳遞額外的參數(shù)
mean
function(x, ...)
UseMethod("mean") -
在預(yù)先不知道參數(shù)數(shù)目的時(shí)候使用
args(paste)
function(..., sep = " ", collapse = NULL)
paste("a", "b", sep = ":")
[1] "a:b"
編碼標(biāo)準(zhǔn)
- Always use text files / text editor
- Indent your code
- Limit the width of your code (80 columns?)
- Limit the length of individual functions
Lexical Scoping
這部分很重要,詳細(xì)參考課件Scoping Rules
Loop Function
apply
用來(lái)對(duì)一個(gè)數(shù)組使用同一個(gè)方法(或者通常使用匿名方法)求值柒室。
- 通常用來(lái)對(duì)矩陣的行或者列使用一個(gè)函數(shù)
- 可以生成數(shù)組空骚,例如求一個(gè)矩陣數(shù)組的平均值
- 并不比使用循環(huán)快囤屹,但是一行就能完成
apply(X, MARGIN, FUN, ...)
lapply
遍歷一個(gè)list,并對(duì)每一個(gè)元素都調(diào)用一個(gè)方法
sapply
和lapply一樣,但是嘗試簡(jiǎn)化結(jié)果(如果可能的話)
- 如果結(jié)果是個(gè)list铣鹏,其中的元素都是長(zhǎng)度為1葵第,那么返回一個(gè)vector
- 如果結(jié)果是個(gè)list,其中的元素都是長(zhǎng)度是長(zhǎng)度相等(>1)的向量恭朗,那么返回一個(gè)matrix
- 如果不行的話,返回一個(gè)list
tapply
對(duì)一個(gè)向量的子集使用一個(gè)方法膀值,不清楚為什么叫做tapply
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
根據(jù)index參數(shù)指定的不同的級(jí)別,對(duì)X中的每種級(jí)別使用FUN求值翘狱。
> x <- c(rnorm(10), runif(10), rnorm(10, 1))
> f <- gl(3, 10)
> f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
[24] 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
1 2 3
0.1144464 0.5163468 1.2463678
mapply
lapply的多變量版本
split
split接收一個(gè)向量或者其他對(duì)象赚导,并按照一個(gè)factor或者一系列的factor把他們分組凰锡。
> str(split)
function(x, f, drop = FALSE, ...)
x is a vector (or list) or data frame
f is a factor (or coerced to one) or a list of factors
drop indicates whether empty factors levels should be dropped
split經(jīng)常與lapply同時(shí)使用。
> lapply(split(x, f), mean)
$‘1‘
[1] 0.1144464
$‘2‘
[1] 0.5163468
$‘3‘
[1] 1.246368
Debugging
- traceback: 打印方法的調(diào)用堆棧
- debug: 標(biāo)記一個(gè)函數(shù)為“debug”模式昵济,可以一次執(zhí)行一行
- browser: 暫停函數(shù)的執(zhí)行瞧栗,進(jìn)入debug模式
- trace: 允許在函數(shù)中指定位置插入調(diào)試代碼
- recover: allows you to modify the error behavior so that you can browse the function call stack
生成隨機(jī)數(shù)
概率函數(shù)
形如: [dpqr]distribution_abbreviation()
其中第一個(gè)字母表示所指分布的某一方面:
d=密度函數(shù)(density)
p=分布函數(shù)(distribution function)
q=分位數(shù)函數(shù)(quantile function)
r=生成隨機(jī)數(shù)
set.seed() 函數(shù)設(shè)置隨機(jī)數(shù)種子確保復(fù)現(xiàn)性(reproducibility)
隨機(jī)采樣
sample函數(shù)從一個(gè)對(duì)象幾何中隨機(jī)抽取
Profiling
profiling是使用系統(tǒng)的方法來(lái)檢查程序的不同部分花費(fèi)了多少時(shí)間,在優(yōu)化代碼時(shí)特別有用
優(yōu)化的一般原則
- 首先設(shè)計(jì)憎茂,然后優(yōu)化
- 記住,早期的優(yōu)化是萬(wàn)惡之源
- 測(cè)量(收集數(shù)據(jù)),不要猜測(cè)
使用system.time()
輸入任意的R表達(dá)式馋评,返回其執(zhí)行所需時(shí)間(秒)
返回proc_time類的一個(gè)對(duì)象
user time: time charged to the CPU(s) for this expression
elapsed time: "wall clock" time
## Elapsed time > user time
system.time(readLines("http://www.jhsph.edu"))
user system elapsed
0.004 0.002 0.431
## Elapsed time < user time
hilbert <- function(n) {
i <- 1:n
1/ outer(i - 1, i, "+”)
}
x <- hilbert(1000)
system.time(svd(x))
user system elapsed
1.605 0.094 0.742
- 通常情況下玛瘸,user time and elapsed time are relatively close, for straight computing tasks
- Elapsed time 可能會(huì)大于user time市咆,如果CPU在等待任務(wù)上花費(fèi)了較多時(shí)間的話
- Elapsed time 可能會(huì)小于user time蒙兰,如果你的機(jī)器擁有并能夠使用多個(gè)處理器(核心)的話
- Multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
- Parallel processing via the parallelpackage
The R Profiler
- Rprof()函數(shù)在R中開始profile
- summaryRprof()函數(shù)總結(jié)Rprof()函數(shù)的輸出
- 注意:Rprof()的默認(rèn)采樣間隔是0.02秒,以0.02秒的間隔跟蹤函數(shù)調(diào)用堆棧
example:
## lm(y ~ x)
sample.interval=10000
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
summaryRprof有兩種方式歸一化數(shù)據(jù):
- "by.total" 每個(gè)方法中花費(fèi)的時(shí)間除以整個(gè)運(yùn)行的時(shí)間
- "by.self" 一樣的作用,但是首先減去花費(fèi)在方法調(diào)用上的時(shí)間