2020-2-23 更新:
import 輸入其他格式的數(shù)據(jù)報錯的問題,可用參數(shù)format 解決,例如format = "\t"渤涌。幫助文檔中該參數(shù)解釋為:
An optional character string code of file format, which can be used to override the format inferred from file. Shortcuts include: “,” (for comma-separated values), “;” (for semicolon-separated values), and “|” (for pipe-separated values).
此外吹埠,發(fā)現(xiàn)import可以使用fread的參數(shù)超营,例如skip="ID"眠砾,表示從ID所在行開始讀取抹锄。
所以,還是吹爆import吧~
read.delim替代read.table比較好用荠藤,只有一點(diǎn)點(diǎn)默認(rèn)參數(shù)的差別伙单。
花花寫于2019.8.9
為了豐富全國巡講的課件“文件讀寫”部分(講義也寫了3/4了),我從《高效R語言編程》中學(xué)了幾個新的讀取和導(dǎo)出文件的函數(shù)哈肖,特此記錄吻育。
原書英文電子版免費(fèi):https://csgillespie.github.io/efficientR/input-output.html#prerequisites-4
昨天推文忘記改題目,也是服了自己淤井。
1.準(zhǔn)備工作
需要用到幾個包
if(!require("rio")) install.packages("rio")
if(!require("readr")) install.packages("readr")
if(!require("data.table")) install.packages("data.table")
if(!require("WDI")) install.packages("WDI")
library(rio)
library(readr)
library(data.table)
library(WDI)
然后是全部的示例數(shù)據(jù)布疼,在生信星球
公眾號后臺回復(fù)“讀取”即可獲得。
2. 高效讀寫 - rio包
rio 包是一個名副其實(shí)的多功能數(shù)據(jù)讀寫包币狠,支持多種格式:.csv
, .feather
, .json
, .dta
, .xls
, .xlsx
使用方法之簡單游两,令人發(fā)指,不需要什么附加的參數(shù),只要是支持的格式漩绵,一律一行代碼讀進(jìn)來贱案,想導(dǎo)出什么格式,直接給文件名加后綴就行了止吐。
co2 = import("CO2.csv")
head(co2)
#> V1 time value
#> 1 1 1959.000 315.42
#> 2 2 1959.083 316.31
#> 3 3 1959.167 316.50
#> 4 4 1959.250 317.56
#> 5 5 1959.333 318.13
#> 6 6 1959.417 318.00
export(co2,"CO2.txt")
voyages = import("voc_voyages.tsv")
head(voyages)
#> number number_sup trip trip_sup boatname master
#> 1 1 1 AMSTERDAM Jan Jakobsz. Schellinger
#> 2 2 1 DUIFJE Simon Lambrechtsz. Mau
#> 3 3 1 HOLLANDIA Jan Dignumsz. van Kwadijk+
#> 4 4 1 MAURITIUS Jan Jansz. Molenaar
#> 5 5 1 LANGEBARK Hans Huibrechtsz. Tonneman
#> 6 6 1 MAAN
#> tonnage type_of_boat built bought hired yard chamber departure_date
#> 1 260 1594 A 1595-04-02
#> 2 50 pinas 1594 A 1595-04-02
#> 3 460 1594 A 1595-04-02
#> 4 460 1594 A 1595-04-02
#> 5 300 1598-03-25
#> 6 NA 1598-03-25
#> departure_harbour cape_arrival cape_departure cape_call arrival_date
#> 1 Texel TRUE 1596-06-06
#> 2 Texel TRUE 1596-06-06
#> 3 Texel TRUE 1596-06-06
#> 4 Texel TRUE 1596-06-06
#> 5 Zeeland TRUE 1599-03-01
#> 6 Zeeland TRUE
#> arrival_harbour next_voyage
#> 1 Engano NA
#> 2 Engano 5001
#> 3 Engano 5002
#> 4 Engano 5003
#> 5 Bantam 5010
#> 6 NA
#> particulars
#> 1 from 04-08 till 11-08 in the Mosselbaai; from 13-09 till 07-10 in the Ampalazabaai; from 09-10 till 13-12 in S. Augustins Bay, where before departure 127 of the 249 men were still alive; 11-01 till 21-01 at Ste. Marie I.; from 23-01 till 12-02 in the Bay of Antongil. The AMSTERDAM was set on fire near Bawean, 11-01-1597.
#> 2 HOLLANDIA on 26-10-1595; he was succeeded by Hendrik Jansz.
#> 3 Jan Dignumsz. died on 29-05-1595 and Mau was his successor.
#> 4 Jan Jansz. died on 25-12-1596 and Hendrik Jansz. was his successor.
#> 5 other.
#> 6
export(voyages, "voc_voyages.xlsx")
#> Note: zip::zip() is deprecated, please use zip::zipr() instead
#讀取在線文件也木有問題
capitals = import("https://github.com/mledoze/countries/raw/master/countries.json")
3.其他方法
我剛看到import的時候是震驚的宝踪,如果它什么數(shù)據(jù)都能讀,那還學(xué)什么read.table,read.csv呀碍扔,那不是畫蛇添足么瘩燥。。不同。實(shí)際上并不厉膀。
接著往下看,除了import意外的另外幾種方法:
(1)read.table
及其同類的read.csv
和read.delim
(2)readr
包的read_table
、read_csv
等
(3)data.table
包的fread
函數(shù)
df_co2 = read.csv("CO2.csv")
df_co2_dt = readr::read_csv("CO2.csv")
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> time = col_double(),
#> value = col_double()
#> )
df_co2_readr = data.table::fread("CO2.csv")
voyages_base = read.delim("voc_voyages.tsv")
voyages_readr = readr::read_tsv("voc_voyages.tsv")
#> Parsed with column specification:
#> cols(
#> .default = col_character(),
#> number = col_double(),
#> number_sup = col_logical(),
#> trip = col_double(),
#> tonnage = col_double(),
#> hired = col_logical(),
#> departure_date = col_date(format = ""),
#> cape_arrival = col_date(format = ""),
#> cape_departure = col_date(format = ""),
#> cape_call = col_logical(),
#> arrival_date = col_date(format = ""),
#> next_voyage = col_double()
#> )
#> See spec(...) for full column specifications.
#> Warning: 77 parsing failures.
#> row col expected actual file
#> 1023 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1025 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1030 hired 1/0/T/F/TRUE/FALSE 1664 'voc_voyages.tsv'
#> 1034 hired 1/0/T/F/TRUE/FALSE 1664/5 'voc_voyages.tsv'
#> 1035 hired 1/0/T/F/TRUE/FALSE 1665 'voc_voyages.tsv'
#> .... ..... .................. ...... .................
#> See problems(...) for more details.
voyages_dt = data.table::fread("voc_voyages.tsv")
4.差異在哪
書中提到fread和read_*對有異常值的數(shù)值型變量進(jìn)行的不同處理
For columns in which the first 1000 rows are of one type but which contain anomalies, such as ‘built’ and ‘departure_data’ in the shipping example, fread() coerces the result to characters. read_csv() and siblings, by contrast, keep the class that is correct for the first 1000 rows and sets the anomalous records to NA. This is illustrated in 5.1, where read_tsv() produces a numeric class for the ‘built’ variable, ignoring the non-numeric text in row 2841.
奇怪的是我跑代碼發(fā)現(xiàn)書上所說的built一列并無區(qū)別,都是character二拐。
class(voyages_dt$built)
#> [1] "character"
class(voyages_readr$built)
#> [1] "character"
這個并不重要服鹅,但我有個意外發(fā)現(xiàn),我很好奇差異在哪卓鹿,就做了一些嘗試:
首先是維度
dim(voyages)
#> [1] 8131 22
dim(voyages_base)
#> [1] 8131 22
dim(voyages_dt)
#> [1] 8131 22
dim(voyages_readr)
#> [1] 8131 22
行列數(shù)是一致的菱魔。
然后是列名
#返回結(jié)果一致
colnames(voyages)
colnames(voyages_base)
colnames(voyages_dt)
colnames(voyages_readr)
#> [1] "number" "number_sup" "trip"
#> [4] "trip_sup" "boatname" "master"
#> [7] "tonnage" "type_of_boat" "built"
#> [10] "bought" "hired" "yard"
#> [13] "chamber" "departure_date" "departure_harbour"
#> [16] "cape_arrival" "cape_departure" "cape_call"
#> [19] "arrival_date" "arrival_harbour" "next_voyage"
#> [22] "particulars"
可以用identical來查看兩兩是否一致。吟孙。澜倦。。
第三是每列的數(shù)據(jù)類型
這個有難度杰妓,雖然str()函數(shù)是可以看的藻治,但是對于這個例子來說,并不方便比較巷挥。我寫了一個函數(shù)
dfc =function(kk,name=F){
kk=data.frame(kk)
x=vector()
n=0
for(i in colnames(kk)){
n=n+1
x[[n]]=class(kk[,i])
}
if(name)names(x)=colnames(kk)
x
}
#拼在一起做個對比
cn = cbind(colnames(voyages_dt),
dfc(voyages),
dfc(voyages_dt),
dfc(voyages_readr),
dfc(voyages_base));cn
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] "number" "integer" "integer" "numeric" "integer"
#> [2,] "number_sup" "character" "character" "logical" "character"
#> [3,] "trip" "integer" "integer" "numeric" "integer"
#> [4,] "trip_sup" "character" "character" "character" "character"
#> [5,] "boatname" "character" "character" "character" "character"
#> [6,] "master" "character" "character" "character" "character"
#> [7,] "tonnage" "integer" "integer" "numeric" "integer"
#> [8,] "type_of_boat" "character" "character" "character" "character"
#> [9,] "built" "character" "character" "character" "character"
#> [10,] "bought" "character" "character" "character" "character"
#> [11,] "hired" "character" "character" "logical" "character"
#> [12,] "yard" "character" "character" "character" "character"
#> [13,] "chamber" "character" "character" "character" "character"
#> [14,] "departure_date" "character" "character" "Date" "character"
#> [15,] "departure_harbour" "character" "character" "character" "character"
#> [16,] "cape_arrival" "character" "character" "Date" "character"
#> [17,] "cape_departure" "character" "character" "Date" "character"
#> [18,] "cape_call" "logical" "logical" "logical" "character"
#> [19,] "arrival_date" "character" "character" "Date" "character"
#> [20,] "arrival_harbour" "character" "character" "character" "character"
#> [21,] "next_voyage" "integer" "integer" "numeric" "integer"
#> [22,] "particulars" "character" "character" "character" "character"
第18列有差異桩卵!另外有幾列智能地被readr識別為“Date”。
head(voyages[,18])
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
head(voyages_dt[,18])
#> cape_call
#> 1: TRUE
#> 2: TRUE
#> 3: TRUE
#> 4: TRUE
#> 5: TRUE
#> 6: TRUE
head(voyages_base[,18])
#> [1] "true" "true" "true" "true" "true" "true"
head(voyages_readr[,18])
#> # A tibble: 6 x 1
#> cape_call
#> <lgl>
#> 1 TRUE
#> 2 TRUE
#> 3 TRUE
#> 4 TRUE
#> 5 TRUE
#> 6 TRUE
關(guān)于讀取文件所需時間的差異,書中給了一張圖:
5.生信數(shù)據(jù)實(shí)例
書中給的例子都是規(guī)則的數(shù)據(jù)框雏节,所以都沒有報錯或讀取錯誤的情況胜嗓。我從果子師兄那里拿了幾個示例數(shù)據(jù),來看實(shí)際應(yīng)用中是否都能讀取成功钩乍。
(1)簡單txt和csv
## .csv
#file.show("B cell receptor signaling pathway.csv")
csv1 = read.csv(file = "B cell receptor signaling pathway.csv");dim(csv1)
#> [1] 18 169
csv2 = data.table::fread(file = "B cell receptor signaling pathway.csv");dim(csv2)
#> [1] 18 169
csv3 = import("B cell receptor signaling pathway.csv");dim(csv3)
#> [1] 18 169
csv4 <- read_csv("B cell receptor signaling pathway.csv");dim(csv4)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> X1 = col_character()
#> )
#> See spec(...) for full column specifications.
#> [1] 18 169
## .txt
txt1 <- read.table("platformMap.txt",header = T,sep = "\t");dim(txt1)
#> [1] 74 6
txt2 <- data.table::fread("platformMap.txt");dim(txt2)
#> [1] 74 6
txt3 <- import("platformMap.txt");dim(txt3)
#> [1] 74 6
txt4 <- read_tsv("platformMap.txt");dim(txt4)
#> Parsed with column specification:
#> cols(
#> title = col_character(),
#> gpl = col_character(),
#> bioc_package = col_character(),
#> manufacturer = col_character(),
#> organism = col_character(),
#> data_row_count = col_double()
#> )
#> [1] 74 6
csv 都沒有問題辞州,txt則是3個直接讀取成功,一個需要少量加參數(shù)寥粹。
(2)GPL注釋表格
特點(diǎn)是前12行為井號開頭的注釋行:如圖
geo1 <- read.table("GPL6244-17930.txt",header = T,sep="\t",fill = T);dim(geo1)
#> [1] 33297 12
geo2 <- data.table::fread("GPL6244-17930.txt");dim(geo2)
#> [1] 33297 12
geo3 <- import("GPL6244-17930.txt");dim(geo3)
#> [1] 33297 12
geo4 <- read_tsv("GPL6244-17930.txt",skip = 12)
#> Parsed with column specification:
#> cols(
#> ID = col_double(),
#> GB_LIST = col_character(),
#> SPOT_ID = col_character(),
#> seqname = col_character(),
#> RANGE_GB = col_character(),
#> RANGE_STRAND = col_character(),
#> RANGE_START = col_double(),
#> RANGE_STOP = col_double(),
#> total_probes = col_double(),
#> gene_assignment = col_character(),
#> mrna_assignment = col_character(),
#> category = col_character()
#> )
#> Warning: 8936 parsing failures.
#> row col expected actual file
#> 3169 RANGE_START a double --- 'GPL6244-17930.txt'
#> 3169 RANGE_STOP a double --- 'GPL6244-17930.txt'
#> 3752 RANGE_START a double --- 'GPL6244-17930.txt'
#> 3752 RANGE_STOP a double --- 'GPL6244-17930.txt'
#> 9529 RANGE_START a double --- 'GPL6244-17930.txt'
#> .... ........... ........ ...... ...................
#> See problems(...) for more details.
兩個智能派還是可以讀取正確变过,而另外兩個則需要加參數(shù)才能讀取正確:read.table需要fill=T,表示缺值的地方填充空字符串涝涤;read_table2則需要指定跳過前12行(也就是井號開頭的注釋行)媚狰。
(3)GEO表達(dá)矩陣
## 3.讀取GEO數(shù)據(jù)(帶有注釋行的txt)
exp1 <- read.table("GSE42872_series_matrix.txt",comment.char="!",header=T);dim(exp1)
#> [1] 33297 7
exp2 <- data.table::fread("GSE42872_series_matrix.txt",skip = 59);dim(exp2)
#> Warning in data.table::fread("GSE42872_series_matrix.txt", skip = 59):
#> Discarded single-line footer: <<!series_matrix_table_end>>
#> [1] 33297 7
exp3 <- import("GSE42872_series_matrix.txt") ;dim(exp3)
#> Warning in fread(dec = ".", input = structure("GSE42872_series_matrix.txt",
#> class = c("rio_tsv", : Stopped early on line 26. Expected 2 fields but
#> found 0. Consider fill=TRUE and comment.char=. First discarded non-empty
#> line: <<!Sample_title "A375 cells 24h Control rep1" "A375 cells 24h Control
#> rep2" "A375 cells 24h Control rep3" "A375 cells 24h Vemurafenib rep1" "A375
#> cells 24h Vemurafenib rep2" "A375 cells 24h Vemurafenib rep3">>
#> [1] 24 2
exp4 <- read_tsv("GSE42872_series_matrix.txt",skip = 59);dim(exp4)
#> Parsed with column specification:
#> cols(
#> ID_REF = col_double(),
#> GSM1052615 = col_double(),
#> GSM1052616 = col_double(),
#> GSM1052617 = col_double(),
#> GSM1052618 = col_double(),
#> GSM1052619 = col_double(),
#> GSM1052620 = col_double()
#> )
#> Warning: 2 parsing failures.
#> row col expected actual file
#> 33298 ID_REF a double !series_matrix_table_end 'GSE42872_series_matrix.txt'
#> 33298 NA 7 columns 1 columns 'GSE42872_series_matrix.txt'
#> [1] 33298 7
此時需要加的參數(shù)更多咯,import無參數(shù)可加阔拳,雖然沒報錯崭孤,但是讀取的并不對。
(4)讀取TCGA甲基化文件
tcga1 <- read.table("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt",header=T,fill = T,sep = "\t")
tcga2 <- data.table::fread("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
tcga3 <- import("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
tcga4 <- read_tsv("jhu-usc.edu_BRCA.HumanMethylation450.9.lvl-3.TCGA-BH-A1EV-11A-24D-A138-05.gdc_hg38.txt")
#> Parsed with column specification:
#> cols(
#> `Composite Element REF` = col_character(),
#> Beta_value = col_double(),
#> Chromosome = col_character(),
#> Start = col_double(),
#> End = col_double(),
#> Gene_Symbol = col_character(),
#> Gene_Type = col_character(),
#> Transcript_ID = col_character(),
#> Position_to_TSS = col_character(),
#> CGI_Coordinate = col_character(),
#> Feature_Type = col_character()
#> )
除了read.table 需要加fill=T,其余都沒什么問題衫生。
(5)讀取TCGA數(shù)據(jù)RNAseq data counts 文件
rna1 <- read.table("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts",header = F,sep = "\t")
rna2 <- data.table::fread("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts")
rna3 <- import("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts")#錯誤
#> Unrecognized file format. Try specifying with the format argument.
#> Error in .import.default(file = file, ...): Format not supported
rna4 <- read_tsv("0e30bd18-8e8b-4c52-aace-b5587c6df51a.htseq.counts",col_names =F)
#> Parsed with column specification:
#> cols(
#> X1 = col_character(),
#> X2 = col_double()
#> )
import只支持它認(rèn)識的幾種后綴裳瘪!其他均報錯格式不支持。
(6)終極大招:GEO的soft文件
這個文件很復(fù)雜罪针,注釋符號不統(tǒng)一彭羹,單獨(dú)的platform表格應(yīng)該是33297行,但后面又跟了十幾萬行的其他內(nèi)容泪酱。
第一個特殊之處:
第二個特殊之處:
## 讀取GEO平臺注釋信息soft文件
soft1 <-data.table::fread("GSE42872_family.soft",skip ="ID")
#> Warning in data.table::fread("GSE42872_family.soft", skip = "ID"): Stopped
#> early on line 33404. Expected 12 fields but found 1. Consider fill=TRUE and
#> comment.char=. First discarded non-empty line: <<!platform_table_end>>
soft2 <-import("GSE42872_family.soft") #直接報錯
#> Unrecognized file format. Try specifying with the format argument.
#> Error in .import.default(file = file, ...): Format not supported
soft3 <-read.table("GSE42872_family.soft",skip = 105,sep = "\t",header = T,fill = T)
soft4 <- read_tsv("GSE42872_family.soft",skip = 105)
#> Parsed with column specification:
#> cols(
#> ID = col_double(),
#> GB_LIST = col_character(),
#> SPOT_ID = col_character(),
#> seqname = col_character(),
#> RANGE_GB = col_character(),
#> RANGE_STRAND = col_character(),
#> RANGE_START = col_double(),
#> RANGE_STOP = col_double(),
#> total_probes = col_double(),
#> gene_assignment = col_character(),
#> mrna_assignment = col_character(),
#> category = col_character()
#> )
#> Warning: 209188 parsing failures.
#> row col expected actual file
#> 3169 RANGE_START a double --- 'GSE42872_family.soft'
#> 3169 RANGE_STOP a double --- 'GSE42872_family.soft'
#> 3752 RANGE_START a double --- 'GSE42872_family.soft'
#> 3752 RANGE_STOP a double --- 'GSE42872_family.soft'
#> 9529 RANGE_START a double --- 'GSE42872_family.soft'
#> .... ........... ........ ...... ......................
#> See problems(...) for more details.
fread的參數(shù)skip="ID"表示的是從以“ID”開頭的那一行開始讀取派殷。
幫助文檔里說到:
If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row. **skip>0 means ignore the first skip rows manually. skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line **(inspired by read.xls in package gdata).
fread很智能的取了前半部分幾萬行,而read.table和read.tsv則是全部讀取了墓阀,在這個問題上可以看出毡惜,fread無往不利。
總結(jié)一下
果子師兄說:read.table是最常用的斯撮,fread則是最智能的经伙。
果子師兄還說:fread加一個參數(shù)data.table = F,可以讓數(shù)據(jù)讀進(jìn)來就是data.frame勿锅。
有了今天的學(xué)習(xí)成果帕膜,我認(rèn)為他說的很對。對于常見格式溢十,可以先嘗試import導(dǎo)入(其實(shí)import是根據(jù)fread函數(shù)寫的)垮刹;
如果失敗,再用fread讀取张弛,最多是加個參數(shù)荒典,理論上就可以成功酪劫;
如果還是不行,哈德雷大神寫的read_*系列也不是吃素的寺董,拿來試試覆糟。
base包有點(diǎn)笨,但他參數(shù)多螃征,更靈活搪桂,可以作為一個選擇。畢竟我學(xué)R數(shù)據(jù)科學(xué)的時候盯滚,哪知道什么base包。酗电。魄藕。
別走!還有個大招
readLines()這個函數(shù)也很棒撵术,他是將每行
作為一個字符串
,讀取的結(jié)果是一個大的字符串向量背率。
別管三七二十一,先讀進(jìn)來嫩与,轉(zhuǎn)化為一個"一列的數(shù)據(jù)框"寝姿,然后再分割也是可以的。
關(guān)于分割划滋,推薦tidyr::separate()
舉個栗子:
if(!require(tidyr)) install.packages("tidyr")
if(require(stringr))install.packages("stringr")
#> Error in install.packages : Updating loaded packages
library(tidyr)
library(stringr)
x = readLines("B cell receptor signaling pathway.csv")
n=str_count(x[1],",")
dfx=data.frame(x=x) %>%
separate(x,into = paste0("V",1: (n+1)),sep = ",");dim(dfx)
#> [1] 19 169
微信公眾號生信星球同步更新我的文章饵筑,歡迎大家掃碼關(guān)注!
我們有為生信初學(xué)者準(zhǔn)備的學(xué)習(xí)小組,點(diǎn)擊查看??
想要參加我的線上線下課程处坪,也可加好友咨詢??
如果需要提問根资,請先看生信星球答疑公告