通過將《機(jī)器學(xué)習(xí)與R語言》一書中的代碼tidyverse
化,來學(xué)習(xí)這本書纹腌。
書中第一個(gè)例子是利用kNN算法來診斷乳腺癌。
首先載入需要用到的包:
library(tidyverse) # 清洗數(shù)據(jù)
library(here) # 設(shè)置數(shù)據(jù)文件路徑
library(knitr) # 呈現(xiàn)更好看的表格
library(kableExtra) # 同上
library(class) # 使用包中的knn()函數(shù)
library(gmodels) # 使用包中的CrossTable()函數(shù)
然后導(dǎo)入數(shù)據(jù)并清洗:
wbcd <- read_csv(here('data', '01-wisc_bc_data.csv')) %>%
select(-id) %>%
mutate(diagnosis = factor(diagnosis, levels = c('B', 'M'),
labels = c('Benign', 'Malignant'))) %>%
mutate_if(is.numeric, ~ (.x - min(.x)) / (max(.x) - min(.x)))
首先使用here
函數(shù)找到數(shù)據(jù)文件的路徑柏卤,然后使用read_csv
函數(shù)將其讀入R中县貌;隨后通過select
函數(shù)將id變量去掉;然后利用mutate
函數(shù)將diagnosis變量改為因子型痪寻;最后利用mutate_if
函數(shù)螺句,將所有數(shù)值型的變量進(jìn)行min-max標(biāo)準(zhǔn)化,這里用到了公式化的匿名函數(shù)橡类,可以使代碼更為簡練蛇尚。此時(shí)的數(shù)據(jù)是這樣的:
wbcd %>% head()
## # A tibble: 6 x 31
## diagnosis radius_mean texture_mean perimeter_mean area_mean
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Malignant 0.521 0.0227 0.546 0.364
## 2 Malignant 0.643 0.273 0.616 0.502
## 3 Malignant 0.601 0.390 0.596 0.449
## 4 Malignant 0.210 0.361 0.234 0.103
## 5 Malignant 0.630 0.157 0.631 0.489
## 6 Malignant 0.259 0.203 0.268 0.142
## # ... with 26 more variables: smoothness_mean <dbl>,
## # compactness_mean <dbl>, concavity_mean <dbl>, `concave
## # points_mean` <dbl>, symmetry_mean <dbl>, fractal_dimension_mean <dbl>,
## # radius_se <dbl>, texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## # smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## # `concave points_se` <dbl>, symmetry_se <dbl>,
## # fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
## # perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
## # compactness_worst <dbl>, concavity_worst <dbl>, `concave
## # points_worst` <dbl>, symmetry_worst <dbl>,
## # fractal_dimension_worst <dbl>
書中還提到了Z分?jǐn)?shù)標(biāo)準(zhǔn)化,因?yàn)橛鞋F(xiàn)成的scale
函數(shù)顾画,所以代碼會稍微簡單:
wbcd <- read_csv(here('data', '01-wisc_bc_data.csv')) %>%
select(-id) %>%
mutate(diagnosis = factor(diagnosis, levels = c('B', 'M'),
labels = c('Benign', 'Malignant'))) %>%
mutate_if(is.numeric, scale)
下一步是創(chuàng)建訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集取劫。首先先設(shè)定一個(gè)隨機(jī)種子,保證結(jié)果可以復(fù)現(xiàn)亲雪,然后利用sample_n
函數(shù)從完整數(shù)據(jù)中隨機(jī)選擇469行作為訓(xùn)練數(shù)據(jù)集勇凭,并利用setdiff
函數(shù)篩選出訓(xùn)練數(shù)據(jù)集的補(bǔ)集作為測試數(shù)據(jù)集;最后利用pull
函數(shù)把標(biāo)簽提取出來:
set.seed(0412)
wbcd_train <- wbcd %>% sample_n(469)
wbcd_test <- wbcd %>% setdiff(wbcd_train)
wbcd_train_labels <- wbcd_train %>% pull(1)
wbcd_test_labels <- wbcd_test %>% pull(1)
數(shù)據(jù)已經(jīng)整理好义辕,可以建模了虾标,但是在書中沒有看到將數(shù)據(jù)集中的標(biāo)簽變量去掉的過程,所以在這里的模型中,我把兩個(gè)數(shù)據(jù)集的標(biāo)簽變量都去掉了:
wbcd_test_pred <- knn(train = wbcd_train[, -1], test = wbcd_test[, -1],
cl = wbcd_train_labels, k = 21)
看一下模型的性能:
CrossTable(wbcd_test_labels, wbcd_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 68 | 0 | 68 |
## | 1.000 | 0.000 | 0.680 |
## | 0.986 | 0.000 | |
## | 0.680 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 1 | 31 | 32 |
## | 0.031 | 0.969 | 0.320 |
## | 0.014 | 1.000 | |
## | 0.010 | 0.310 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 69 | 31 | 100 |
## | 0.690 | 0.310 | |
## -----------------|-----------|-----------|-----------|
##
##
跟書中的結(jié)果不一樣璧函,但也不錯(cuò)傀蚌。
最后,書中還使用不同的k值對模型進(jìn)行了評估蘸吓,但沒有給出相應(yīng)的代碼善炫,我這里補(bǔ)充了一下:
k <- map(1:30, ~ knn(train = wbcd_train[, -1], test = wbcd_test[, -1],
cl = wbcd_train_labels, k = .x)) %>%
enframe(name = 'k', value = 'prediction') %>%
unnest() %>%
mutate(label = rep(wbcd_test_labels, 30),
FN = prediction == 'Malignant' & label == 'Benign',
FP = prediction == 'Benign' & label == 'Malignant') %>%
group_by(k) %>%
summarise(FN = sum(FN),
FP = sum(FP),
total = FN + FP)
首先利用map
函數(shù)將1到30分別映射到模型的k參數(shù)上,此時(shí)得到了會是一個(gè)長度為30的列表库继;隨后利用enframe
函數(shù)將列表變?yōu)樾袛?shù)為30的數(shù)據(jù)框箩艺,這時(shí)value變量下的每一個(gè)元素都包含100個(gè)字符;隨后利用unnest
將value變量中的字符解放出來宪萄,使數(shù)據(jù)框的行數(shù)變?yōu)?000艺谆;剩余的代碼就比較簡單,不多描述拜英。
這時(shí)的數(shù)據(jù)是這樣的:
print(k, n = nrow(k))
## # A tibble: 30 x 4
## k FN FP total
## <int> <int> <int> <int>
## 1 1 2 1 3
## 2 2 4 0 4
## 3 3 2 0 2
## 4 4 4 0 4
## 5 5 2 0 2
## 6 6 2 0 2
## 7 7 3 0 3
## 8 8 3 0 3
## 9 9 2 0 2
## 10 10 3 1 4
## 11 11 0 0 0
## 12 12 0 0 0
## 13 13 0 0 0
## 14 14 0 0 0
## 15 15 0 0 0
## 16 16 0 0 0
## 17 17 0 0 0
## 18 18 0 1 1
## 19 19 0 1 1
## 20 20 0 1 1
## 21 21 0 1 1
## 22 22 1 1 2
## 23 23 1 1 2
## 24 24 1 1 2
## 25 25 1 1 2
## 26 26 1 1 2
## 27 27 1 0 1
## 28 28 1 1 2
## 29 29 1 1 2
## 30 30 1 0 1
可以看到静汤,k值從11到17時(shí)的結(jié)果都很“完美”。