STATA
1箱熬、egen命令+cut()函數(shù)
這種分割方法沽瞭,是對感興趣的變量的取值進行排序,然后變量的實際取值五等分海雪。
// 收入五等分
egen income2=cut(income), group(5)
tab income2
// 查看五等分后的均值分布
mean income, over(income2)
income2 | Freq. | Percent | Cum. |
---|---|---|---|
0 | 1,947 | 19.64 | 19.64 |
1 | 1,586 | 16 | 35.65 |
2 | 1,754 | 17.7 | 53.34 |
3 | 2,566 | 25.89 | 79.24 |
4 | 2,058 | 20.76 | 100 |
Total | 9,911 | 100 |
Over | Mean | Std. Err. | [95% Conf. | Interval] | |
---|---|---|---|---|---|
income | |||||
0 | 645.1798 | 19.65844 | 606.6452 | 683.7143 | |
1 | 5340.14 | 46.23396 | 5249.512 | 5430.768 | |
2 | 13028.83 | 71.89518 | 12887.9 | 13169.76 | |
3 | 25492.33 | 93.73953 | 25308.58 | 25676.08 | |
4 | 67655.16 | 1357.951 | 64993.3 | 70317.02 |
2土匀、gen命令+group()函數(shù)
這種分割方法是根據(jù)感興趣的變量取值排序子房,然后對樣本五等分。
//根據(jù)收入進行排序就轧,然后對數(shù)據(jù)集五等分证杭。
sort income
gen income3=group(5)
tab income3
mean income, over(income3)
income3 | Freq. | Percent | Cum. |
---|---|---|---|
1 | 1,983 | 20.01 | 20.01 |
2 | 1,982 | 20 | 40.01 |
3 | 1,982 | 20 | 60 |
4 | 1,982 | 20 | 80 |
5 | 1,982 | 20 | 100 |
Total | 9,911 | 100 |
Over | Mean | Std. Err. | [95% Conf. | Interval] | |
---|---|---|---|---|---|
income | |||||
1 | 687.9299 | 20.55276 | 647.6423 | 728.2175 | |
2 | 6398.316 | 56.05585 | 6288.435 | 6508.197 | |
3 | 16010.37 | 81.97568 | 15849.68 | 16171.06 | |
4 | 27724.18 | 96.34285 | 27535.33 | 27913.03 | |
5 | 68868.98 | 1402.875 | 66119.06 | 71618.9 |
R
cut()函數(shù)
# 利用cut()函數(shù)進行切割,生成新的因子變量income2
> cgss2 <- cgss %>%
+ mutate(income2=cut(income, breaks=5))
# 分組求均值
> avg_income2 <- cgss2 %>%
+ group_by(income2) %>%
+ summarise(avg=mean(income))
> avg_income2
income2 | avg | |
---|---|---|
1 | (-1e+03,2e+05] | 22143. |
2 | (2e+05,4e+05] | 315116. |
3 | (4e+05,6e+05] | 525000 |
4 | (6e+05,8e+05] | 750000 |
5 | (8e+05,1e+06] | 1000000 |
這個結(jié)果和STATA的結(jié)果差別很大妒御,主要是由于R的cut()
函數(shù)是先設(shè)定最大值和最小值解愤,然后對取值區(qū)間進行五等分。
即携丁,上例中的cut(income, breaks=5)相當于
cut(income, breaks=c(0, 200000,400000,600000,800000, 1000000))
參見下面的例子,尤其是等效的hist()函數(shù)兰怠。
> Z <- stats::rnorm(10000)
> table(cut(Z, breaks = -6:6))
(-6,-5] (-5,-4] (-4,-3] (-3,-2] (-2,-1] (-1,0] (0,1] (1,2] (2,3] (3,4] (4,5] (5,6]
0 2 7 220 1335 3510 3356 1335 225 10 0 0
> table(cut(cgss$income, breaks=5))
(-1e+03,2e+05] (2e+05,4e+05] (4e+05,6e+05] (6e+05,8e+05] (8e+05,1e+06]
9860 43 4 3 1
> hist(cgss$income, breaks=5, plot=F)
$breaks
[1] 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
$counts
[1] 9860 43 4 3 1
$density
[1] 4.974271e-06 2.169307e-08 2.017960e-09 1.513470e-09 5.044900e-10
$mids
[1] 1e+05 3e+05 5e+05 7e+05 9e+05
$xname
[1] "cgss$income"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
很明顯梦鉴,cut()函數(shù)是無法滿足很多情況下的五等分的切割需求的。
dplyr包中的case_when()函數(shù)揭保,是類似if_else條件判斷來分組肥橙,仍需要事先知道切割點。
如果知道五等分點秸侣,不用這些函數(shù)存筏,也可以很好地切割。
目前沒有找到更好的更好用的R函數(shù)味榛。