[R語(yǔ)言] Join 連接《R for data science》 7

《R for Data Science》第十三章 Relational data 啃書知識(shí)點(diǎn)積累
參考鏈接:

  1. R for Data Science
  2. R語(yǔ)言中dplyr包join函數(shù)之目前我看到過(guò)的最形象的教程

Introduction

本章節(jié)的應(yīng)用前提需要了解nycflights13各數(shù)據(jù)集的內(nèi)容

library(nycflights13)

?flights # dplyr用的很熟悉就不查了
?airlines
?airports
?planes
?weather

畫模式圖

涉及了datamodelrDiagrammeR

library(datamodelr,DiagrammeR)

dm <- dm_from_data_frames(list(
  #  指定數(shù)據(jù)集來(lái)源
  Master = Lahman::Master,
  Managers = Lahman::Managers,
  AwardsManagers = Lahman::AwardsManagers
)) %>%
  # 設(shè)置主鍵
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
  dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
  # 設(shè)置外鍵
  dm_add_references(
    Managers$playerID == Master$playerID,
    AwardsManagers$playerID == Master$playerID
  )

# 繪圖
# rankdir可以指定'BT', 'RL'等
# columnArrows默認(rèn)為F,只是數(shù)據(jù)集間粗略箭頭
dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
  dm_render_graph()

There are three families of verbs designed to work with relational data:

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
  • Set operations, which treat observations as if they were set elements.

Keys

There are two types of keys:

  • A primary key uniquely identifies an observation in its own table.
  • A foreign key uniquely identifies an observation in another table.
  • 確認(rèn)是不是主鍵:分組后各組僅一個(gè)觀測(cè)
planes %>% 
  count(tailnum) %>% 
  filter(n > 1) %>% 
  nrow()
# [1] 0
  • 代理鍵的概念:

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

flights %>% 
  mutate(id = row_number()) %>% 
  select(id,everything())

Mutating joins

flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  left_join(airlines, by = "carrier")


# 也可以利用mutate和match完成left_join相同操作摔竿,但麻煩很多
flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

- inner_join

The most important property of an inner join is that unmatched rows are not included in the result.

- Outer joins

  1. A left join keeps all observations in x.
  2. A right join keeps all observations in y.
  3. A full join keeps all observations in x and y.

The left join should be your default join

  • base::merge()有類似join的功能

- Defining the key columns

  1. The default, by = NULL, uses all variables that appear in both tables, the so called natural join
  2. A character vector,by = "x". This is like a natural join, but uses only some of the common variables.
flights %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  1. A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
flights %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
#> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#> #   tzone <chr>

- Q: 畫出航班的大致起落

flights %>%
  inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
             by = "origin") %>% 
  inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
             by = "dest") %>% 
  # 取子集,類似filter(row_number() <= 100)
  slice(1:100) %>%
  ggplot(aes(
    x = ori_lon, xend = des_lon,
    y = ori_lat, yend = des_lat
  )) +
  # 美國(guó)地圖
  borders("state") +
  # 調(diào)整地圖到合適比例
  coord_quickmap() +
  geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
  labs(y = "Latitude", x = "Longitude")

Filtering joins

  1. semi_join(x, y) keeps all observations in x that have a match in y.
    Semi-joins are useful for matching filtered summary tables back to the original rows.
  2. anti_join(x, y) drops all observations in x that have a match in y.叉庐、
    Anti-joins are useful for diagnosing join mismatches.
x <- tribble(
  ~key, ~val_x,
  # --/---
    1, "x1",
    2, "x2",
    3, "x3"
)

y <- tribble(
  ~key, ~val_y,
  # --/---
    1, "y1",
    2, "y2",
    4, "y3"
)

semi_join(x,y,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 x1   
# 2     2 x2 

semi_join(y,x,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 y1   
# 2     2 y2

anti_join(x,y,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     3 x3 

anti_join(y,x,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     4 y3 


# 另一個(gè)例子,獲取飛往受歡迎目的地的所有航班
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  slice(1:10)

flights %>% 
  filter(dest %in% top_dest$dest)

# 用半連接
flights %>% 
  semi_join(top_dest)

Set operations

  1. intersect(x, y): return only observations in both x and y.
  2. union(x, y): return unique observations in x and y.
  3. setdiff(x, y): return observations in x, but not in y.

These expect the x and y inputs to have the same variables

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     1     2

setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1

setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末若专,一起剝皮案震驚了整個(gè)濱河市冷冗,隨后出現(xiàn)的幾起案子吱窝,更是在濱河造成了極大的恐慌,老刑警劉巖太颤,帶你破解...
    沈念sama閱讀 221,635評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件苞俘,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡龄章,警方通過(guò)查閱死者的電腦和手機(jī)吃谣,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,543評(píng)論 3 399
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)做裙,“玉大人岗憋,你說(shuō)我怎么就攤上這事∶” “怎么了仔戈?”我有些...
    開(kāi)封第一講書人閱讀 168,083評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)拧廊。 經(jīng)常有香客問(wèn)我监徘,道長(zhǎng),這世上最難降的妖魔是什么吧碾? 我笑而不...
    開(kāi)封第一講書人閱讀 59,640評(píng)論 1 296
  • 正文 為了忘掉前任凰盔,我火速辦了婚禮,結(jié)果婚禮上倦春,老公的妹妹穿的比我還像新娘户敬。我一直安慰自己落剪,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,640評(píng)論 6 397
  • 文/花漫 我一把揭開(kāi)白布尿庐。 她就那樣靜靜地躺著忠怖,像睡著了一般。 火紅的嫁衣襯著肌膚如雪抄瑟。 梳的紋絲不亂的頭發(fā)上凡泣,一...
    開(kāi)封第一講書人閱讀 52,262評(píng)論 1 308
  • 那天,我揣著相機(jī)與錄音锐借,去河邊找鬼问麸。 笑死,一個(gè)胖子當(dāng)著我的面吹牛钞翔,可吹牛的內(nèi)容都是我干的严卖。 我是一名探鬼主播,決...
    沈念sama閱讀 40,833評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼布轿,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼哮笆!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起汰扭,我...
    開(kāi)封第一講書人閱讀 39,736評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤稠肘,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后萝毛,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體项阴,經(jīng)...
    沈念sama閱讀 46,280評(píng)論 1 319
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,369評(píng)論 3 340
  • 正文 我和宋清朗相戀三年笆包,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了环揽。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,503評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡庵佣,死狀恐怖歉胶,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情巴粪,我是刑警寧澤通今,帶...
    沈念sama閱讀 36,185評(píng)論 5 350
  • 正文 年R本政府宣布,位于F島的核電站肛根,受9級(jí)特大地震影響辫塌,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜派哲,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,870評(píng)論 3 333
  • 文/蒙蒙 一臼氨、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧狮辽,春花似錦一也、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 32,340評(píng)論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至树叽,卻和暖如春舆蝴,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背题诵。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 33,460評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工洁仗, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人性锭。 一個(gè)月前我還...
    沈念sama閱讀 48,909評(píng)論 3 376
  • 正文 我出身青樓赠潦,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親草冈。 傳聞我的和親對(duì)象是個(gè)殘疾皇子她奥,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,512評(píng)論 2 359

推薦閱讀更多精彩內(nèi)容