Workflow:Basics
4.Practice
1.the "i"?
2.??? 不懂
`ggplot(data=mpg)+geom_point(mapping=aes(x=displ,y=hwy),data=filter(mpg,cyl==8))`
`ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut),data=filter(diamonds,carat>3))`
3.Press Alt + Shift + K. What happens? How can you get to the same place
using the menus?
keyboard shortcut reference
Tools->keyboard shortcut help
Data: Transformation
1.Introduction
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()#It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: `stats::filter()` and `stats::lag()`.
2.Filter rows with filter()
2.4 Excercise
-
Find all flights that
Had an arrival delay of two or more hours
Flew to Houston (
IAH
orHOU
)Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
-
Departed between midnight and 6am (inclusive)
filter(flights,arr_delay>=120) filter(flights,dest %in% c("IAH","HOU")) filter(flights,dest=="IAH"|dest=="HOU")#same filter(flights,carrier %in% c("UA","AA","DL")) filter(flights,month %in% c("7","8","9")) filter(flights,arr_delay>120&dep_delay<=0) filter(flights,arr_delay>=120&air_time>30) midnight1<-filter(flights,hour %in% c(0:5)|(hour==6&minute==0))#不太確定
這個(gè)數(shù)據(jù)集本身數(shù)據(jù)有問(wèn)題腰池?為什么hour minute的數(shù)據(jù)與時(shí)間time_hour對(duì)不上拔沧椤?
知道了hour minute是schedule的時(shí)間示弓,難怪這么規(guī)整
-
Another useful dplyr filtering helper is
between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?between(x, left, right) #等于 x %in% c(left:right)#when left and right are numeric. #This is a shortcut for x >= left & x <= right
%in%的使用范圍更廣些讳侨,構(gòu)成向量的可以不是數(shù)字。
-
How many flights have a missing
dep_time
? What other variables are missing? What might these rows represent?
- dep_delay, arr_time, arr_delay.
They might represent the flights be canceled(they didn't take off.)
- dep_delay, arr_time, arr_delay.
Why is
NA ^ 0
not missing? Why isNA | TRUE
not missing? Why isFALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)
運(yùn)算的先后順序奏属??jī)?yōu)先服從邏輯運(yùn)算符/數(shù)學(xué)運(yùn)算符的規(guī)則跨跨。任何數(shù)的0次方為1;|的規(guī)則是任意一個(gè)為T(mén)RUE即為T(mén)RUE囱皿,&的規(guī)則是任意一個(gè)為FALSE則為FALSE勇婴。然而NA*0先考慮的是NA的不可比較性。
(不確定呢)
3.Arrange rows with arrange()
arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)#先按year排序嘱腥,再在相同year中按month排序耕渴,再在相同year中按day排序
arrange(flights, desc(dep_delay))#按dep_delay降序排列
Use desc()
to re-order by a column in descending order
Missing values are always sorted at the end:
Excercise:
-
How could you use
arrange()
to sort all missing values to the start? (Hint: useis.na()
).arrange(df,desc(is.na(x)))#is.na返回值TRUE(1),FALSE(0).缺失值返回1.此時(shí)再按降序排列,則1(na值)都在前面
-
Sort
flights
to find the most delayed flights. Find the flights that left earliest.arrange(flights,desc(dep_delay),desc(arr_delay))#emm所以哪項(xiàng)最大算是most delayed呢齿兔?貌似找到的那個(gè)是兩項(xiàng)都最大 arrange(flights,dep_time)#不知道誒 arrange(flights,desc(distance/air_time)) arrange(flights,is.na(dep_time),desc(distance)) arrange(flights,is.na(dep_time),distance)#不加is.na的話(huà)會(huì)有實(shí)際上沒(méi)起飛的航班
Sort
flights
to find the fastest flights.Which flights travelled the longest? Which travelled the shortest?
4.select columns with select()
select(flights,year,month,day)
select(flights,year:day)
select(flights,-(year:day))
select(flights,starts_with("dep"))
select(flights,ends_with("delay"))
select(flights,matches("(.)\\1"))
rename(flights,tail_num=tailnum)#這里rename之后變不回去了怎么辦
select(flights,time_hour,air_time,everything())#把所選的提到最前面橱脸,并且保留所有的列
#Excercise
select(flights,starts_with("dep"),starts_with("arr"))
select(flights,dep_time,dep_delay,arr_time,arr_delay)
select(flights,year,year)#只出現(xiàn)一列,不重復(fù)
vars<-c("year","month","day","dep_delay","arr_delay")
select(flights,one_of(vars))#運(yùn)行結(jié)果是五列都出來(lái)了分苇,所以是等價(jià)于
#one_of(): select variables in character vector.
select(flights,year,month,day,dep_delay,arr_delay)#添诉?
#contains(match, ignore.case = TRUE, vars = peek_vars())
select(flights,contains("TIME",ignore.case=FALSE))#修改默認(rèn)值
Excercise
Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.
如上What happens if you include the name of a variable multiple times in a
select()
call?-
What does the
one_of()
function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
-
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
5.Add new variables with mutate()
mutate(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)
transmute(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)#輸出結(jié)果僅保留顯式提及的變量和新產(chǎn)生的變量
transmute(flights,dep_time,hour=dep_time%/%100,minute=dep_time%%100)#%/%商,%%余數(shù)
#lead,lag干啥的沒(méi)懂医寿?
1.Useful creation functions
對(duì)一個(gè)向量進(jìn)行運(yùn)算栏赴,返回一個(gè)同等大小的向量
1.Arithmetic operators: +
, -
, *
, /
, ^
.
2.Modular arithmetic: %/%
(integer division) and %%
(remainder), wherex == y * (x %/% y) + (x %% y)
.
3.Logs: log()
, log2()
, log10()
.
4.Offsets: lead()
and lag()
allow you to refer to leading or lagging values.
Find the "next" or "previous" values in a vector. Useful for comparing values ahead of or behind the current values.
x<-runif(5)
> cbind(ahead=lead(x),x,behind=lag(x))
ahead x behind
[1,] 0.3001377 0.01974997 NA
[2,] 0.2235623 0.30013771 0.01974997
[3,] 0.2873173 0.22356229 0.30013771
[4,] 0.2258159 0.28731729 0.22356229
[5,] NA 0.22581594 0.28731729
>#大概就是找到向量中當(dāng)前位置的前一個(gè)值和后一個(gè)值
5.Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum()
, cumprod()
, cummin()
, cummax()
; and dplyr provides cummean()
for cumulative means.
x<-c(1:10)
> roll_mean(x)
[1] 1 2 3 4 5 6 7 8 9 10
> roll_sum(x)
[1] 1 2 3 4 5 6 7 8 9 10
> cumsum(x)
[1] 1 3 6 10 15 21 28 36 45 55
>cummean(x)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5#區(qū)分roll和cummulative
6.Logical comparisons, <
, <=
, >
, >=
, !=
7.Ranking:
y<-c(1,2,2,NA,4,5)
> min_rank(y)
[1] 1 2 2 NA 4 5#返回的是相應(yīng)位置值的排位,1st,2nd
> min_rank(desc(y))
[1] 5 3 3 NA 2 1
>z<-c(5,4,NA,2,2,1)
> min_rank(z)
[1] 5 4 NA 2 2 1#desc不是簡(jiǎn)單的倒過(guò)來(lái)排靖秩,是轉(zhuǎn)換成相反數(shù)這樣內(nèi)在的大小順序就反過(guò)來(lái)了
> desc(y)
[1] -1 -2 -2 NA -4 -5
>min_rank(desc(y))
[1] 5 3 3 NA 2 1
If min_rank()
doesn’t do what you need, look at the variantsrow_number()
, dense_rank()
, percent_rank()
, cume_dist()
,ntile()
.
> y<-c(1,1,3,NA,5,5,7)
> min_rank(y)
[1] 1 1 3 NA 4 4 6#同樣大小的給予相同排位须眷,然后下一位順延(1乌叶,1,3)
> min_rank(desc(y))
[1] 5 5 4 NA 2 2 1
> row_number(y)
[1] 1 2 3 NA 4 5 6#同樣大小的排位不同柒爸,不存在相同排位
> dense_rank(y)
[1] 1 1 2 NA 3 3 4#dense意思是密集排序吧准浴,相同大小相同排位,下一個(gè)緊接著排(1捎稚,1乐横,2)
> percent_rank(y)
[1] 0.0 0.0 0.4 NA 0.6 0.6 1.0#排位規(guī)則跟min_rank一樣,1->0,最大->1今野,換算成百分位數(shù)
> cume_dist(y)
[1] 0.3333333 0.3333333 0.5000000 NA 0.8333333 0.8333333 1.0000000
#排位規(guī)則跟dense_rank一樣葡公,再換成百分位數(shù)
2.Excercise
-
Currently
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100) # A tibble: 336,776 x 4 dep_time deptime arr_time arrtime <int> <dbl> <int> <dbl> 1 517 317 830 510 2 533 333 850 530 3 542 342 923 563 4 544 344 1004 604 5 554 354 812 492 6 554 354 740 460 7 555 355 913 553 8 557 357 709 429 9 557 357 838 518 10 558 358 753 473 # ... with 336,766 more rows
-
Compare
air_time
witharr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?>transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100,air_time,airtime=arrtime-deptime) # A tibble: 336,776 x 6 dep_time deptime arr_time arrtime air_time airtime <int> <dbl> <int> <dbl> <dbl> <dbl> 1 517 317 830 510 227 193 2 533 333 850 530 227 197 3 542 342 923 563 160 221 4 544 344 1004 604 183 260 5 554 354 812 492 116 138 6 554 354 740 460 150 106 7 555 355 913 553 158 198 8 557 357 709 429 53 72 9 557 357 838 518 140 161 10 558 358 753 473 138 115 # ... with 336,766 more rows #所以為啥還是對(duì)不上啊,它這個(gè)airtime咋算的条霜?
-
Compare
dep_time
,sched_dep_time
, anddep_delay
. How would you expect those three numbers to be related?>transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,sched_dep_time,schedtime=sched_dep_time%/%100*60+sched_dep_time%%100,dep_delay,pseudo=dep_time-sched_dep_time,delay=deptime-schedtime)#直接減是不對(duì)的 # A tibble: 336,776 x 7 dep_time deptime sched_dep_time schedtime dep_delay pseudo delay <int> <dbl> <int> <dbl> <dbl> <int> <dbl> 1 517 317 515 315 2 2 2 2 533 333 529 329 4 4 4 3 542 342 540 340 2 2 2 4 544 344 545 345 -1 -1 -1 5 554 354 600 360 -6 -46 -6 6 554 354 558 358 -4 -4 -4 7 555 355 600 360 -5 -45 -5 8 557 357 600 360 -3 -43 -3 9 557 357 600 360 -3 -43 -3 10 558 358 600 360 -2 -42 -2 # ... with 336,766 more rows
-
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank()
.arrange(flights,min_rank(desc(dep_delay))) arrange(flights,min_rank(desc(arr_delay)))
-
What does
1:3 + 1:10
return? Why?> 1:3+1:10 [1] 2 4 6 5 7 9 8 10 12 11 Warning message: In 1:3 + 1:10 : longer object length is not a multiple of shorter object length #=(1催什,2,3宰睡,1蒲凶,2,3拆内,1旋圆,2,3麸恍,1)+(1:10)
-
What trigonometric functions does R provide?
cos(x) sin(x) tan(x)
acos(x) asin(x) atan(x)
atan2(y, x)cospi(x) sinpi(x) tanpi(x)( compute cos(pix), sin(pix), and tan(pi*x).