tidy數(shù)據(jù)集特征:
- 每個變量形成一個列
- 每一個觀察都形成一行
- 每一種觀測單位都形成一個表
untidy數(shù)據(jù)集特征
? 列是值蚊惯,而不是變量名 :
religion
, income
and frequency
.
人口統(tǒng)計群體被sex(m, f)和age(0-14,15-25,25-34,35-44,45-54,55-64,55-64)劃分
在各個列(id酥泞、年蘑斧、月)中有變量,分布在列(day, d1-d31)和跨行(tmin, tmax)(最小和最高溫度)。
billborad數(shù)據(jù)集實際上包含了對兩種觀察單元的觀察:歌曲信息和它在每個星期的排名。藝術家
artist
,年year
和時間time
被重復了很多次肖油。這個數(shù)據(jù)集需要細分為兩個部分:一個歌曲數(shù)據(jù)集,它存儲藝術家臂港、歌曲名稱和時間森枪,以及一個排名數(shù)據(jù)集,每個星期都給出歌曲的排名审孽。PRACTICE
- data : sat.csv
- resource :The 2013 SAT Report on College & Career Readiness
# 處理方案
# 1. select() all columns that do NOT contain the word "total",
# since if we have the male and female data, we can always
# recreate the total count in a separate column, if we want it.
# Hint: Use the contains() function, which you'll
# find detailed in 'Special functions' section of ?select.
#
# 2. gather() all columns EXCEPT score_range, using
# key = part_sex and value = count.
#
# 3. separate() part_sex into two separate variables (columns),
# called "part" and "sex", respectively. You may need to check
# the 'Examples' section of ?separate to remember how the 'into'
# argument should be phrased.
#
sat1 <- sat[2:11] %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(part_sex, c("part", "sex")) %>%
group_by(part, sex)%>%
mutate(total = sum(count),
prop = count / total
) %>%
print