比較數(shù)據(jù)框
問題
你想要比較兩個或多個數(shù)據(jù)框并找到在超過一個數(shù)據(jù)框中出現(xiàn)的行,或者僅在一個數(shù)據(jù)框中出現(xiàn)的行甲抖。
方案
一個例子
假設(shè)你有下面三個數(shù)據(jù)框,你想要知道那些至少在兩個數(shù)據(jù)框中出現(xiàn)的行心铃。
dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X"))
dfA
#> Subject Response
#> 1 1 X
#> 2 1 X
#> 3 2 X
#> 4 2 X
dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X"))
dfB
#> Subject Response
#> 1 1 X
#> 2 2 Y
#> 3 3 X
dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z"))
dfC
#> Subject Response
#> 1 1 Z
#> 2 2 Y
#> 3 3 Z
在dfA中准谚,包括(1,X)的行同樣出現(xiàn)在了dfB,但是包含(2,X)的行沒有出現(xiàn)在任何其他的數(shù)據(jù)框去扣。相似地柱衔,df包含的(1,X)出現(xiàn)在了dfA,(2,Y)出現(xiàn)在了dfC,但是(3,X)沒有出現(xiàn)在其他數(shù)據(jù)框秀存。
你可能想要標(biāo)記在其他數(shù)據(jù)框中出現(xiàn)了的行捶码,或者沒有數(shù)據(jù)框中都是唯一的行。
連接數(shù)據(jù)框
進(jìn)一步地或链,我們首先用一個可以識別每一行來自哪里的列來連接數(shù)據(jù)框惫恼。這里稱為Coder
變量因為它可能是由三個不同的人編碼的數(shù)據(jù)。在這個例子中澳盐,你可能想要找到編碼者同意之處(至少出現(xiàn)在兩個數(shù)據(jù)框中的行)祈纯,或者它們不同意之處。
dfA$Coder <- "A"
dfB$Coder <- "B"
dfC$Coder <- "C"
df <- rbind(dfA, dfB, dfC) # 把它們粘在一起
df <- df[,c("Coder", "Subject", "Response")] # 重新排序
df
#> Coder Subject Response
#> 1 A 1 X
#> 2 A 1 X
#> 3 A 2 X
#> 4 A 2 X
#> 5 B 1 X
#> 6 B 2 Y
#> 7 B 3 X
#> 8 C 1 Z
#> 9 C 2 Y
#> 10 C 3 Z
如果你的數(shù)據(jù)一開始就是這種格式叼耙,那就不要將它們連接到一起啦腕窥。
尋找重復(fù)行
使用在文末定義的函數(shù)dupsBetweenGroups
,我們可以找出在不同組別中重復(fù)的行筛婉。
# 找出在不同組別中重復(fù)的行
dupRows <- dupsBetweenGroups(df, "Coder")
# 在數(shù)據(jù)框的旁邊打印出來
cbind(df, dup=dupRows)
#> Coder Subject Response dup
#> 1 A 1 X TRUE
#> 2 A 1 X TRUE
#> 3 A 2 X FALSE
#> 4 A 2 X FALSE
#> 5 B 1 X TRUE
#> 6 B 2 Y TRUE
#> 7 B 3 X FALSE
#> 8 C 1 Z FALSE
#> 9 C 2 Y TRUE
#> 10 C 3 Z FALSE
注意這不會標(biāo)記在同一組中的重復(fù)行簇爆,比如Coder=A時,有兩行Subject=2以及Response=X爽撒,但沒有標(biāo)記出來入蛆。
尋找唯一行
同樣可以找出在每一組中唯一出現(xiàn)的行。
cbind(df, unique=!dupRows)
#> Coder Subject Response unique
#> 1 A 1 X FALSE
#> 2 A 1 X FALSE
#> 3 A 2 X TRUE
#> 4 A 2 X TRUE
#> 5 B 1 X FALSE
#> 6 B 2 Y FALSE
#> 7 B 3 X TRUE
#> 8 C 1 Z TRUE
#> 9 C 2 Y FALSE
#> 10 C 3 Z TRUE
拆分?jǐn)?shù)據(jù)框
如果你想要把連接的數(shù)據(jù)框拆分為三個原始的數(shù)據(jù)框
# 保存df的結(jié)果
dfDup <- cbind(df, dup=dupRows)
dfA <- subset(dfDup, Coder=="A", select=-Coder)
dfA
#> Subject Response dup
#> 1 1 X TRUE
#> 2 1 X TRUE
#> 3 2 X FALSE
#> 4 2 X FALSE
dfB <- subset(dfDup, Coder=="B", select=-Coder)
dfB
#> Subject Response dup
#> 5 1 X TRUE
#> 6 2 Y TRUE
#> 7 3 X FALSE
dfC <- subset(dfDup, Coder=="C", select=-Coder)
dfC
#> Subject Response dup
#> 8 1 Z FALSE
#> 9 2 Y TRUE
#> 10 3 Z FALSE
忽略列
有可能需要通過移除數(shù)據(jù)框的列來忽略一個或者多個列硕勿,結(jié)果又可以把原始完整的數(shù)據(jù)框連接起來哨毁。
# 忽略Subject列——僅使用Response列
dfNoSub <- subset(df, select=-Subject)
dfNoSub
#> Coder Response
#> 1 A X
#> 2 A X
#> 3 A X
#> 4 A X
#> 5 B X
#> 6 B Y
#> 7 B X
#> 8 C Z
#> 9 C Y
#> 10 C Z
# 檢查重復(fù)行
dupRows <- dupsBetweenGroups(dfNoSub, "Coder")
# 把結(jié)果連接起來
cbind(df, dup=dupRows)
#> Coder Subject Response dup
#> 1 A 1 X TRUE
#> 2 A 1 X TRUE
#> 3 A 2 X TRUE
#> 4 A 2 X TRUE
#> 5 B 1 X TRUE
#> 6 B 2 Y TRUE
#> 7 B 3 X TRUE
#> 8 C 1 Z FALSE
#> 9 C 2 Y TRUE
#> 10 C 3 Z FALSE
dupsBetweenGroups 函數(shù)
該函數(shù)用來尋找不同組別的重復(fù)行:
dupsBetweenGroups <- function (df, idcol) {
# df: the data frame
# idcol: the column which identifies the group each row belongs to
# Get the data columns to use for finding matches
datacols <- setdiff(names(df), idcol)
# Sort by idcol, then datacols. Save order so we can undo the sorting later.
sortorder <- do.call(order, df)
df <- df[sortorder,]
# Find duplicates within each id group (first copy not marked)
dupWithin <- duplicated(df)
# With duplicates within each group filtered out, find duplicates between groups.
# Need to scan up and down with duplicated() because first copy is not marked.
dupBetween = rep(NA, nrow(df))
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols])
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin]
# ============= Replace NA's with previous non-NA value ==============
# This is why we sorted earlier - it was necessary to do this part efficiently
# Get indexes of non-NA's
goodIdx <- !is.na(dupBetween)
# These are the non-NA values from x only
# Add a leading NA for later use when we index into this vector
goodVals <- c(NA, dupBetween[goodIdx])
# Fill the indices of the output vector with the indices pulled from
# these offsets of goodVals. Add 1 to avoid indexing to zero.
fillIdx <- cumsum(goodIdx)+1
# The original vector, now with gaps filled
dupBetween <- goodVals[fillIdx]
# Undo the original sort
dupBetween[sortorder] <- dupBetween
# Return the vector of which entries are duplicated across groups
return(dupBetween)
}
注意
想要尋找單個數(shù)據(jù)框中的重復(fù)行,參看../Finding and removing duplicate records.