R for data science ||使用readr進(jìn)行數(shù)據(jù)導(dǎo)入

使用R包提供的數(shù)據(jù)是學(xué)習(xí)數(shù)據(jù)科學(xué)工具的好方法，但是在某個時候昼榛，您希望停止學(xué)習(xí)，開始使用自己的數(shù)據(jù)剔难。在本章中胆屿，您將學(xué)習(xí)如何將純文本矩形文件讀入r。在這里偶宫，我們只討論數(shù)據(jù)導(dǎo)入的皮毛非迹，但是許多原則將轉(zhuǎn)換為其他形式的數(shù)據(jù)。

library(tidyverse)
setwd("D:\\Users\\Administrator\\Desktop\\RStudio\\R-Programming")
heights <- read_csv("heights.csv")

Parsed with column specification:
cols(
  earn = col_double(),
  height = col_double(),
  sex = col_character(),
  ed = col_double(),
  age = col_double(),
  race = col_character()
)

?read_csv()
? read_csv2()
? read_tsv()
纯趋？ read_delim()
?read_fwf()
?read_log()

直接創(chuàng)建行內(nèi)csv文件憎兽。

read_csv("a,b,c
          1,2,3
         4,5,6")


# A tibble: 2 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

用skip=n來跳過前n行冷离。

read_csv("The first line of metadata
  The second line of metadata
         x,y,z
         1,2,3", skip = 2)

# A tibble: 1 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")

# A tibble: 1 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

無列名

read_csv("1,2,3\n4,5,6", col_names = FALSE)

# A tibble: 2 x 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
# A tibble: 2 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

與R基礎(chǔ)包進(jìn)行比較

速度更快
可以生成tibble，不會將字符串向量轉(zhuǎn)化為因子纯命，不使用行名稱西剥，也不會隨意改變列名稱。
更易于重復(fù)使用亿汞。

解析向量

str(parse_logical(c("TRUE", "FALSE", "NA")))
#>  logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
#>  int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
#>  Date[1:2], format: "2010-01-01" "1979-10-14"


str(parse_integer(c("1", "2", "a")))
Warning: 1 parsing failure.
row col   expected actual
  3  -- an integer      a

 int [1:3] 1 2 NA
 - attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1 obs. of  4 variables:
  ..$ row     : int 3
  ..$ col     : int NA
  ..$ expected: chr "an integer"
  ..$ actual  : chr "a"

數(shù)值

parse_double("1.23")
#> [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
#> [1] 1.23

parse_number("$100")
#> [1] 100
parse_number("20%")
#> [1] 20
parse_number("It cost $123.45")
#> [1] 123


# Used in America
parse_number("$123,456,789")
#> [1] 1.23e+08

# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
#> [1] 1.23e+08

# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
#> [1] 1.23e+08

字符串

#In R, we can get at the underlying representation of a string using charToRaw():

charToRaw("Hadley")
#> [1] 48 61 64 6c 65 79

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

x1
#> [1] "El Ni\xf1o was particularly bad this year"
x2
#> [1] "\x82\xb1\x82\xf1\x82?\xbf\x82\xcd"

parse_character(x1, locale = locale(encoding = "Latin1"))
#> [1] "El Ni?o was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
#> [1] "こんにちは"

查看編碼格式

guess_encoding(charToRaw(x1))
#> # A tibble: 2 x 2
#>   encoding   confidence
#>   <chr>           <dbl>
#> 1 ISO-8859-1       0.46
#> 2 ISO-8859-9       0.23
guess_encoding(charToRaw(x2))
#> # A tibble: 1 x 2
#>   encoding confidence
#>   <chr>         <dbl>
#> 1 KOI8-R         0.42

因子

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
#> Warning: 1 parsing failure.
#> row col           expected   actual
#>   3  -- value in level set bananana
#> [1] apple  banana <NA>  
#> attr(,"problems")
#> # A tibble: 1 x 4
#>     row   col expected           actual  
#>   <int> <int> <chr>              <chr>   
#> 1     3    NA value in level set bananana
#> Levels: apple banana

時間

parse_datetime("2010-10-01T2010")
#> [1] "2010-10-01 20:10:00 UTC"
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
#> [1] "2010-10-10 UTC"

parse_date("2010-10-01")
#> [1] "2010-10-01"

library(hms)
parse_time("01:10 am")
#> 01:10:00
parse_time("20:10:01")
#> 20:10:01

parse_date("01/02/15", "%m/%d/%y")
#> [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
#> [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
#> [1] "2001-02-15"

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
#> [1] "2015-01-01"

解析文件

既然您已經(jīng)了解了如何解析單個向量瞭空，現(xiàn)在就回到開始部分，研究readr如何解析文件疗我。在本節(jié)中咆畏，您將了解兩個新內(nèi)容:

readr如何自動猜測每個列的類型。
如何修改默認(rèn)值吴裤。

啟發(fā)式

guess_parser("2010-10-01")
#> [1] "date"
guess_parser("15:01")
#> [1] "time"
guess_parser(c("TRUE", "FALSE"))
#> [1] "logical"
guess_parser(c("1", "5", "9"))
#> [1] "double"
guess_parser(c("12,352,561"))
#> [1] "number"

str(parse_guess("2010-10-10"))
#>  Date[1:1], format: "2010-10-10"

challenge <- read_csv(readr_example("challenge.csv"))
Parsed with column specification:
cols(
  x = col_double(),
  y = col_logical()
)
Warning: 1000 parsing failures.
 row col           expected     actual                                             file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
.... ... .................. .......... ................................................
See problems(...) for more details.

有兩個打印輸出:查看前1000行生成的列規(guī)范和前5個解析失敗旧找。顯式地找出問題()總是一個好主意，這樣您就可以更深入地研究它們:

 problems(challenge)
# A tibble: 1,000 x 5
     row col   expected           actual     file                                            
   <int> <chr> <chr>              <chr>      <chr>                                           
 1  1001 y     1/0/T/F/TRUE/FALSE 2015-01-16 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 2  1002 y     1/0/T/F/TRUE/FALSE 2018-05-18 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 3  1003 y     1/0/T/F/TRUE/FALSE 2015-09-05 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 4  1004 y     1/0/T/F/TRUE/FALSE 2012-11-28 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 5  1005 y     1/0/T/F/TRUE/FALSE 2020-01-13 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 6  1006 y     1/0/T/F/TRUE/FALSE 2016-04-17 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 7  1007 y     1/0/T/F/TRUE/FALSE 2011-05-14 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 8  1008 y     1/0/T/F/TRUE/FALSE 2020-07-18 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
 9  1009 y     1/0/T/F/TRUE/FALSE 2011-04-30 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
10  1010 y     1/0/T/F/TRUE/FALSE 2010-05-11 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
# ... with 990 more rows

一個好的策略是逐列工作嚼摩，直到?jīng)]有問題為止钦讳。這里我們可以看到x列有很多解析問題——整數(shù)值后面有尾隨字符。這意味著我們需要使用雙解析器枕面。

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_integer(),
    y = col_character()
  )
)

Warning: 1000 parsing failures.
 row col               expected             actual                                             file
1001   x no trailing characters .23837975086644292 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1002   x no trailing characters .41167997173033655 'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1003   x no trailing characters .7460716762579978  'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1004   x no trailing characters .723450553836301   'D:/R-3.5.1/library/readr/extdata/challenge.csv'
1005   x no trailing characters .614524137461558   'D:/R-3.5.1/library/readr/extdata/challenge.csv'
.... ... ...................... .................. ................................................
See problems(...) for more details.

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_character()
  )
)

tail(challenge)
# A tibble: 6 x 2
      x y         
  <dbl> <chr>     
1 0.805 2019-11-21
2 0.164 2018-03-29
3 0.472 2014-08-04
4 0.718 2015-08-16
5 0.270 2020-02-04
6 0.608 2019-01-06

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)
tail(challenge)
#> # A tibble: 6 x 2
#>       x y         
#>   <dbl> <date>    
#> 1 0.805 2019-11-21
#> 2 0.164 2018-03-29
#> 3 0.472 2014-08-04
#> 4 0.718 2015-08-16
#> 5 0.270 2020-02-04
#> 6 0.608 2019-01-06

challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )
challenge2
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # … with 1,994 more rows

challenge2 <- read_csv(readr_example("challenge.csv"), 
                       col_types = cols(.default = col_character())
)

challenge2
# A tibble: 2,000 x 2
   x     y    
   <chr> <chr>
 1 404   NA   
 2 4172  NA   
 3 3004  NA   
 4 787   NA   
 5 37    NA   
 6 2332  NA   
 7 2489  NA   
 8 1449  NA   
 9 3665  NA   
10 3863  NA   
# ... with 1,990 more rows

df <- tribble(
  ~x,  ~y,
  "1", "1.21",
  "2", "2.32",
  "3", "4.56"
)
df
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 1     1.21 
#> 2 2     2.32 
#> 3 3     4.56

# Note the column types
type_convert(df)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_double()
#> )
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1  1.21
#> 2     2  2.32
#> 3     3  4.56

文件寫出

readr還提供了兩個將數(shù)據(jù)寫入磁盤的有用函數(shù):write_csv()和write_tsv()愿卒。這兩個函數(shù)都增加了輸出文件被正確讀入的機(jī)會:

總是用UTF-8編碼字符串。
以ISO8601格式保存日期和日期時間潮秘，以便在其他地方輕松解析琼开。

write_csv(challenge, "challenge.csv")

challenge
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # … with 1,994 more rows
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
#> # A tibble: 2,000 x 2
#>       x y    
#>   <dbl> <lgl>
#> 1   404 NA   
#> 2  4172 NA   
#> 3  3004 NA   
#> 4   787 NA   
#> 5    37 NA   
#> 6  2332 NA   
#> # … with 1,994 more rows

write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
#> # A tibble: 2,000 x 2
#>       x y         
#>   <dbl> <date>    
#> 1   404 NA        
#> 2  4172 NA        
#> 3  3004 NA        
#> 4   787 NA        
#> 5    37 NA        
#> 6  2332 NA        
#> # … with 1,994 more rows

feather包實現(xiàn)了一種快速的二進(jìn)制文件格式，可以跨編程語言共享:

library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#>       x      y
#>   <dbl> <date>
#> 1   404   <NA>
#> 2  4172   <NA>
#> 3  3004   <NA>
#> 4   787   <NA>
#> 5    37   <NA>
#> 6  2332   <NA>
#> # ... with 1,994 more rows

r4ds

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末枕荞，一起剝皮案震驚了整個濱河市柜候，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌躏精，老刑警劉巖渣刷，帶你破解...
沈念sama閱讀 218,284評論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異矗烛，居然都是意外死亡辅柴，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,115評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門瞭吃，熙熙樓的掌柜王于貴愁眉苦臉地迎上來碌嘀，“玉大人，你說我怎么就攤上這事歪架」扇撸” “怎么了？”我有些...
開封第一講書人閱讀 164,614評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵和蚪，是天一觀的道長止状。經(jīng)常有香客問我烹棉，道長，這世上最難降的妖魔是什么怯疤？我笑而不...
開封第一講書人閱讀 58,671評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任峦耘，我火速辦了婚禮，結(jié)果婚禮上旅薄，老公的妹妹穿的比我還像新娘。我一直安慰自己泣崩，他們只是感情好少梁，可當(dāng)我...
茶點(diǎn)故事閱讀 67,699評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著矫付，像睡著了一般凯沪。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上买优，一...
開封第一講書人閱讀 51,562評論 1贊 305
城市分裂傳說
那天妨马，我揣著相機(jī)與錄音，去河邊找鬼杀赢。笑死烘跺，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的脂崔。我是一名探鬼主播滤淳，決...
沈念sama閱讀 40,309評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼砌左！你這毒婦竟也來了脖咐？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,223評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤汇歹，失蹤者是張志新（化名）和其女友劉穎屁擅，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體产弹，經(jīng)...
沈念sama閱讀 45,668評論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡派歌，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,859評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了取视。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片硝皂。...
茶點(diǎn)故事閱讀 39,981評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖作谭，靈堂內(nèi)的尸體忽然破棺而出稽物，到底是詐尸還是另有隱情，我是刑警寧澤折欠，帶...
沈念sama閱讀 35,705評論 5贊 347
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布贝或，位于F島的核電站吼过，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏咪奖。R本人自食惡果不足惜盗忱，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,310評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望羊赵。院中可真熱鬧趟佃，春花似錦、人聲如沸昧捷。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,904評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽靡挥。三九已至序矩，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間跋破，已是汗流浹背簸淀。一陣腳步聲響...
開封第一講書人閱讀 33,023評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留毒返，地道東北人租幕。一個月前我還...
沈念sama閱讀 48,146評論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像拧簸，于是被迫代替她去往敵國和親令蛉。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,933評論 2贊 355