R數(shù)據(jù)科學(xué)筆記——data transformation1

Workflow:Basics

4.Practice

1.the "i"?

2.??? 不懂

`ggplot(data=mpg)+geom_point(mapping=aes(x=displ,y=hwy),data=filter(mpg,cyl==8))`

`ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut),data=filter(diamonds,carat>3))`

3.Press Alt + Shift + K. What happens? How can you get to the same place
using the menus?

keyboard shortcut reference

Tools->keyboard shortcut help

Data: Transformation

1.Introduction

-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()#It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: `stats::filter()` and `stats::lag()`.

2.Filter rows with filter()

2.4 Excercise

  1. Find all flights that

    1. Had an arrival delay of two or more hours

    2. Flew to Houston (IAH or HOU)

    3. Were operated by United, American, or Delta

    4. Departed in summer (July, August, and September)

    5. Arrived more than two hours late, but didn’t leave late

    6. Were delayed by at least an hour, but made up over 30 minutes in flight

    7. Departed between midnight and 6am (inclusive)

      filter(flights,arr_delay>=120)
      filter(flights,dest %in% c("IAH","HOU"))
      filter(flights,dest=="IAH"|dest=="HOU")#same
      filter(flights,carrier %in% c("UA","AA","DL"))
      filter(flights,month %in% c("7","8","9"))
      filter(flights,arr_delay>120&dep_delay<=0)
      filter(flights,arr_delay>=120&air_time>30)
      midnight1<-filter(flights,hour %in% c(0:5)|(hour==6&minute==0))#不太確定
      

這個(gè)數(shù)據(jù)集本身數(shù)據(jù)有問(wèn)題腰池?為什么hour minute的數(shù)據(jù)與時(shí)間time_hour對(duì)不上拔沧椤?
知道了hour minute是schedule的時(shí)間示弓,難怪這么規(guī)整

  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

    between(x, left, right)
    #等于
    x %in% c(left:right)#when left and right are numeric.
    
    #This is a shortcut for x >= left & x <= right
    

    %in%的使用范圍更廣些讳侨,構(gòu)成向量的可以不是數(shù)字。

  2. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

    1. dep_delay, arr_time, arr_delay.
      They might represent the flights be canceled(they didn't take off.)
  3. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

運(yùn)算的先后順序奏属??jī)?yōu)先服從邏輯運(yùn)算符/數(shù)學(xué)運(yùn)算符的規(guī)則跨跨。任何數(shù)的0次方為1;|的規(guī)則是任意一個(gè)為T(mén)RUE即為T(mén)RUE囱皿,&的規(guī)則是任意一個(gè)為FALSE則為FALSE勇婴。然而NA*0先考慮的是NA的不可比較性。

(不確定呢)

3.Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)#先按year排序嘱腥,再在相同year中按month排序耕渴,再在相同year中按day排序
arrange(flights, desc(dep_delay))#按dep_delay降序排列

Use desc() to re-order by a column in descending order

Missing values are always sorted at the end:

Excercise:

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

    arrange(df,desc(is.na(x)))#is.na返回值TRUE(1),FALSE(0).缺失值返回1.此時(shí)再按降序排列,則1(na值)都在前面
    
  2. Sort flights to find the most delayed flights. Find the flights that left earliest.

    arrange(flights,desc(dep_delay),desc(arr_delay))#emm所以哪項(xiàng)最大算是most delayed呢齿兔?貌似找到的那個(gè)是兩項(xiàng)都最大
    arrange(flights,dep_time)#不知道誒
    arrange(flights,desc(distance/air_time))
    arrange(flights,is.na(dep_time),desc(distance))
    arrange(flights,is.na(dep_time),distance)#不加is.na的話(huà)會(huì)有實(shí)際上沒(méi)起飛的航班
    
  3. Sort flights to find the fastest flights.

  4. Which flights travelled the longest? Which travelled the shortest?

4.select columns with select()

select(flights,year,month,day)
select(flights,year:day)
select(flights,-(year:day))
select(flights,starts_with("dep"))
select(flights,ends_with("delay"))
select(flights,matches("(.)\\1"))
rename(flights,tail_num=tailnum)#這里rename之后變不回去了怎么辦
select(flights,time_hour,air_time,everything())#把所選的提到最前面橱脸,并且保留所有的列
#Excercise
select(flights,starts_with("dep"),starts_with("arr"))
select(flights,dep_time,dep_delay,arr_time,arr_delay)
select(flights,year,year)#只出現(xiàn)一列,不重復(fù)
vars<-c("year","month","day","dep_delay","arr_delay")
select(flights,one_of(vars))#運(yùn)行結(jié)果是五列都出來(lái)了分苇,所以是等價(jià)于
#one_of(): select variables in character vector.
select(flights,year,month,day,dep_delay,arr_delay)#添诉?
#contains(match, ignore.case = TRUE, vars = peek_vars())
select(flights,contains("TIME",ignore.case=FALSE))#修改默認(rèn)值

Excercise

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
    如上

  2. What happens if you include the name of a variable multiple times in a select() call?

  3. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

    vars <- c("year", "month", "day", "dep_delay", "arr_delay")
    
  4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))
    

5.Add new variables with mutate()

mutate(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)
transmute(flights_sml,gain=dep_delay-arr_delay,hours=air_time/60,gain_per_hour=gain/hours)#輸出結(jié)果僅保留顯式提及的變量和新產(chǎn)生的變量
transmute(flights,dep_time,hour=dep_time%/%100,minute=dep_time%%100)#%/%商,%%余數(shù)
#lead,lag干啥的沒(méi)懂医寿?

1.Useful creation functions

對(duì)一個(gè)向量進(jìn)行運(yùn)算栏赴,返回一個(gè)同等大小的向量

1.Arithmetic operators: +, -, *, /, ^.

2.Modular arithmetic: %/% (integer division) and %% (remainder), wherex == y * (x %/% y) + (x %% y).

3.Logs: log(), log2(), log10().

4.Offsets: lead() and lag() allow you to refer to leading or lagging values.

Find the "next" or "previous" values in a vector. Useful for comparing values ahead of or behind the current values.

x<-runif(5)
> cbind(ahead=lead(x),x,behind=lag(x))
         ahead          x     behind
[1,] 0.3001377 0.01974997         NA
[2,] 0.2235623 0.30013771 0.01974997
[3,] 0.2873173 0.22356229 0.30013771
[4,] 0.2258159 0.28731729 0.22356229
[5,]        NA 0.22581594 0.28731729
>#大概就是找到向量中當(dāng)前位置的前一個(gè)值和后一個(gè)值

5.Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.

x<-c(1:10)
> roll_mean(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> roll_sum(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> cumsum(x)
 [1]  1  3  6 10 15 21 28 36 45 55
>cummean(x)
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5#區(qū)分roll和cummulative

6.Logical comparisons, <, <=, >, >=, !=

7.Ranking:

y<-c(1,2,2,NA,4,5)
> min_rank(y)
[1]  1  2  2 NA  4  5#返回的是相應(yīng)位置值的排位,1st,2nd
> min_rank(desc(y))
[1]  5  3  3 NA  2  1
>z<-c(5,4,NA,2,2,1)
> min_rank(z)
[1]  5  4 NA  2  2  1#desc不是簡(jiǎn)單的倒過(guò)來(lái)排靖秩,是轉(zhuǎn)換成相反數(shù)這樣內(nèi)在的大小順序就反過(guò)來(lái)了
> desc(y)
[1] -1 -2 -2 NA -4 -5
>min_rank(desc(y))
[1]  5  3  3 NA  2  1

If min_rank() doesn’t do what you need, look at the variantsrow_number(), dense_rank(), percent_rank(), cume_dist(),ntile().

> y<-c(1,1,3,NA,5,5,7)
> min_rank(y)
[1]  1  1  3 NA  4  4  6#同樣大小的給予相同排位须眷,然后下一位順延(1乌叶,1,3)
> min_rank(desc(y))
[1]  5  5  4 NA  2  2  1
> row_number(y)
[1]  1  2  3 NA  4  5  6#同樣大小的排位不同柒爸,不存在相同排位
> dense_rank(y)
[1]  1  1  2 NA  3  3  4#dense意思是密集排序吧准浴,相同大小相同排位,下一個(gè)緊接著排(1捎稚,1乐横,2)
> percent_rank(y)
[1] 0.0 0.0 0.4  NA 0.6 0.6 1.0#排位規(guī)則跟min_rank一樣,1->0,最大->1今野,換算成百分位數(shù)
> cume_dist(y)
[1] 0.3333333 0.3333333 0.5000000        NA 0.8333333 0.8333333 1.0000000
#排位規(guī)則跟dense_rank一樣葡公,再換成百分位數(shù)

2.Excercise

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

    transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100)
    # A tibble: 336,776 x 4
       dep_time deptime arr_time arrtime
          <int>   <dbl>    <int>   <dbl>
     1      517     317      830     510
     2      533     333      850     530
     3      542     342      923     563
     4      544     344     1004     604
     5      554     354      812     492
     6      554     354      740     460
     7      555     355      913     553
     8      557     357      709     429
     9      557     357      838     518
    10      558     358      753     473
    # ... with 336,766 more rows
    
  2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

    >transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,arr_time,arrtime=arr_time%/%100*60+arr_time%%100,air_time,airtime=arrtime-deptime)
    # A tibble: 336,776 x 6
       dep_time deptime arr_time arrtime air_time airtime
          <int>   <dbl>    <int>   <dbl>    <dbl>   <dbl>
     1      517     317      830     510      227     193
     2      533     333      850     530      227     197
     3      542     342      923     563      160     221
     4      544     344     1004     604      183     260
     5      554     354      812     492      116     138
     6      554     354      740     460      150     106
     7      555     355      913     553      158     198
     8      557     357      709     429       53      72
     9      557     357      838     518      140     161
    10      558     358      753     473      138     115
    # ... with 336,766 more rows
    #所以為啥還是對(duì)不上啊,它這個(gè)airtime咋算的条霜?
    
  3. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

    >transmute(flights,dep_time,deptime=dep_time%/%100*60+dep_time%%100,sched_dep_time,schedtime=sched_dep_time%/%100*60+sched_dep_time%%100,dep_delay,pseudo=dep_time-sched_dep_time,delay=deptime-schedtime)#直接減是不對(duì)的
    # A tibble: 336,776 x 7
       dep_time deptime sched_dep_time schedtime dep_delay pseudo delay
          <int>   <dbl>          <int>     <dbl>     <dbl>  <int> <dbl>
     1      517     317            515       315         2      2     2
     2      533     333            529       329         4      4     4
     3      542     342            540       340         2      2     2
     4      544     344            545       345        -1     -1    -1
     5      554     354            600       360        -6    -46    -6
     6      554     354            558       358        -4     -4    -4
     7      555     355            600       360        -5    -45    -5
     8      557     357            600       360        -3    -43    -3
     9      557     357            600       360        -3    -43    -3
    10      558     358            600       360        -2    -42    -2
    # ... with 336,766 more rows
    
  4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

    arrange(flights,min_rank(desc(dep_delay)))
    arrange(flights,min_rank(desc(arr_delay)))
    
  5. What does 1:3 + 1:10 return? Why?

    > 1:3+1:10
     [1]  2  4  6  5  7  9  8 10 12 11
    Warning message:
    In 1:3 + 1:10 :
      longer object length is not a multiple of shorter object length
    #=(1催什,2,3宰睡,1蒲凶,2,3拆内,1旋圆,2,3麸恍,1)+(1:10)
    
  6. What trigonometric functions does R provide?

    cos(x) sin(x) tan(x)

    acos(x) asin(x) atan(x)
    atan2(y, x)

    cospi(x) sinpi(x) tanpi(x)( compute cos(pix), sin(pix), and tan(pi*x).

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末灵巧,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子抹沪,更是在濱河造成了極大的恐慌刻肄,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,402評(píng)論 6 499
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件融欧,死亡現(xiàn)場(chǎng)離奇詭異敏弃,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)蹬癌,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,377評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門(mén)权她,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人逝薪,你說(shuō)我怎么就攤上這事『铮” “怎么了董济?”我有些...
    開(kāi)封第一講書(shū)人閱讀 162,483評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)要门。 經(jīng)常有香客問(wèn)我虏肾,道長(zhǎng)廓啊,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,165評(píng)論 1 292
  • 正文 為了忘掉前任封豪,我火速辦了婚禮谴轮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘吹埠。我一直安慰自己第步,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,176評(píng)論 6 388
  • 文/花漫 我一把揭開(kāi)白布缘琅。 她就那樣靜靜地躺著粘都,像睡著了一般。 火紅的嫁衣襯著肌膚如雪刷袍。 梳的紋絲不亂的頭發(fā)上翩隧,一...
    開(kāi)封第一講書(shū)人閱讀 51,146評(píng)論 1 297
  • 那天,我揣著相機(jī)與錄音呻纹,去河邊找鬼堆生。 笑死,一個(gè)胖子當(dāng)著我的面吹牛雷酪,可吹牛的內(nèi)容都是我干的顽频。 我是一名探鬼主播,決...
    沈念sama閱讀 40,032評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼太闺,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼糯景!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起省骂,我...
    開(kāi)封第一講書(shū)人閱讀 38,896評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤蟀淮,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后钞澳,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體怠惶,經(jīng)...
    沈念sama閱讀 45,311評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,536評(píng)論 2 332
  • 正文 我和宋清朗相戀三年轧粟,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了策治。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,696評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡兰吟,死狀恐怖通惫,靈堂內(nèi)的尸體忽然破棺而出境蜕,到底是詐尸還是另有隱情膨俐,我是刑警寧澤蒂胞,帶...
    沈念sama閱讀 35,413評(píng)論 5 343
  • 正文 年R本政府宣布豁跑,位于F島的核電站弊知,受9級(jí)特大地震影響攒磨,放射性物質(zhì)發(fā)生泄漏塑娇。R本人自食惡果不足惜翁涤,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,008評(píng)論 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望延旧。 院中可真熱鬧谋国,春花似錦、人聲如沸迁沫。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,659評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)弯洗。三九已至旅急,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間牡整,已是汗流浹背藐吮。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,815評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留逃贝,地道東北人谣辞。 一個(gè)月前我還...
    沈念sama閱讀 47,698評(píng)論 2 368
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像沐扳,于是被迫代替她去往敵國(guó)和親泥从。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,592評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容