_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.5.0-rc1.0 (2020-06-26)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
這是一個(gè)已知用戶的各種屬性, 預(yù)測(cè)用戶是否會(huì)購(gòu)買車險(xiǎn)(Response)的標(biāo)準(zhǔn)分類問(wèn)題. 數(shù)據(jù)集大家可以去kaggle自行下載.
- 載入數(shù)據(jù)
using Queryverse, MLJ, StatsKit, PrettyPrinting, LossFunctions, Plots
train_data = Queryverse.load("D:\\data\\archive\\train.csv") |> DataFrame
test_data = Queryverse.load("D:\\data\\archive\\test.csv") |> DataFrame
"|>" 是Julia的管道函數(shù), 等效于R的"%>%". 作用是將上一個(gè)結(jié)果作為下一個(gè)函數(shù)的參數(shù)傳入. 在上述語(yǔ)句中:是將讀取的數(shù)據(jù)轉(zhuǎn)換為DataFrame類型
- 查看數(shù)據(jù)的科學(xué)類型(Scitype)
train_data |> MLJ.schema
可以看到返回了兩種類型:
1.types (機(jī)器類型)
2.scitypes (科學(xué)類型)
機(jī)器類型很好理解, 與R, python, SQL一樣, 代表數(shù)據(jù)的存儲(chǔ)類型. 科學(xué)類型是MLJ庫(kù)為方便模型理解而定義的類型, 不同的模型兼容的科學(xué)類型也不同, 使用時(shí)需要注意.
詳細(xì)說(shuō)明文檔里有
- 查看訓(xùn)練集統(tǒng)計(jì)摘要
train_data |> describe |> print
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────────────────┼──────────┼──────────┼──────────┼───────────┼─────────┼──────────┼──────────┤
│ 1 │ id │ 190555.0 │ 1 │ 190555.0 │ 381109 │ │ │ Int64 │
│ 2 │ Gender │ │ Female │ │ Male │ 2 │ │ String │
│ 3 │ Age │ 38.8226 │ 20 │ 36.0 │ 85 │ │ │ Int64 │
│ 4 │ Driving_License │ 0.997869 │ 0 │ 1.0 │ 1 │ │ │ Int64 │
│ 5 │ Region_Code │ 26.3888 │ 0.0 │ 28.0 │ 52.0 │ │ │ Float64 │
│ 6 │ Previously_Insured │ 0.45821 │ 0 │ 0.0 │ 1 │ │ │ Int64 │
│ 7 │ Vehicle_Age │ │ 1-2 Year │ │ > 2 Years │ 3 │ │ String │
│ 8 │ Vehicle_Damage │ │ No │ │ Yes │ 2 │ │ String │
│ 9 │ Annual_Premium │ 30564.4 │ 2630.0 │ 31669.0 │ 540165.0 │ │ │ Float64 │
│ 10 │ Policy_Sales_Channel │ 112.034 │ 1.0 │ 133.0 │ 163.0 │ │ │ Float64 │
│ 11 │ Vintage │ 154.347 │ 10 │ 154.0 │ 299 │ │ │ Int64 │
│ 12 │ Response │ 0.122563 │ 0 │ 0.0 │ 1 │ │ │ Int64 │
id: 對(duì)訓(xùn)練模型沒(méi)有幫助需要剔除
Gender, Driving_License, Region_Code, Previously_Insured, Previously_Insured, Vehicle_Age, Vehicle_Damage, 以及Response: 分類變量處理為one-hot編碼
- 查看正負(fù)樣本是否均衡
train_data.Response |> StatsKit.countmap
正負(fù)樣本不均衡, 選擇后續(xù)在模型中處理. (也可在測(cè)試集中做欠采樣)
- 從訓(xùn)練集中剔除id變量
train_data = train_data[:, Not(:id)]
- 拆包 - 將數(shù)據(jù)分為預(yù)測(cè)變量和目標(biāo)變量
y, X = unpack(train_data, ==(:Response), colname -> true)
- 先用自動(dòng)轉(zhuǎn)換科學(xué)類型方法, 將預(yù)測(cè)變量轉(zhuǎn)換為模型可接受的科學(xué)類型
X = coerce(X, autotype(X)) #先對(duì)訓(xùn)練集自動(dòng)轉(zhuǎn)換scitype為學(xué)習(xí)支持類型
預(yù)測(cè)變量的被轉(zhuǎn)換成了三種科學(xué)類型: 無(wú)序分類, 有序因子, 連續(xù)數(shù)值
- 連續(xù)數(shù)值化
X = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), X)), X)
- 標(biāo)準(zhǔn)化
X = MLJ.transform(fit!(machine(Standardizer(), X)), X)
為提高梯度下降效率, 將數(shù)據(jù)標(biāo)準(zhǔn)化為標(biāo)準(zhǔn)差=1, 均值=0
- 將目標(biāo)變量的科學(xué)類型轉(zhuǎn)換為OrderedFactor
y = coerce(y, OrderedFactor)
- 查看邏輯回歸學(xué)習(xí)器參數(shù)
info("LogisticClassifier", pkg = "ScikitLearn") |> pprint
[ Info: Training Machine{ContinuousEncoder} @192.
name = "LogisticClassifier",
package_name = "ScikitLearn",
is_supervised = true,
docstring = "Logistic regression classifier.\n→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).\n→ do `@load LogisticClassifier pkg=\"ScikitLearn\"` to use the model.\n→ do `?LogisticClassifier` for documentation.",
hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing),
hyperparameter_types = ("String", "Bool", "Float64", "Float64", "Bool", "Float64", "Any", "Any", "String", "Int64", "String", "Int64", "Bool", "Union{Nothing, Int64}", "Union{Nothing, Float64}"),
hyperparameters = (:penalty, :dual, :tol, :C, :fit_intercept, :intercept_scaling, :class_weight, :random_state, :solver, :max_iter, :multi_class, :verbose, :warm_start, :n_jobs, :l1_ratio),
implemented_methods = [:clean!, :fit, :fitted_params, :predict],
is_pure_julia = false,
is_wrapper = true,
load_path = "MLJScikitLearnInterface.LogisticClassifier",
package_license = "BSD",
package_url = "https://github.com/cstjean/ScikitLearn.jl",
package_uuid = "3646fa90-6ef7-5e7e-9f22-8aca16db6324",
prediction_type = :probabilistic,
supports_online = false,
supports_weights = false,
input_scitype = Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous),
target_scitype = AbstractArray{_s267,1} where _s267<:Finite,
output_scitype = Unknown)
- 載入模型
@load LogisticClassifier pkg="ScikitLearn"
lc = LogisticClassifier(class_weight = "balanced", #由于樣本不均衡, 讓模型自動(dòng)計(jì)算權(quán)重
solver = "sag") #優(yōu)化算法選擇 隨機(jī)梯度下降
- 訓(xùn)練模型
r = range(lc, :max_iter, lower = 100, upper = 500) #選擇測(cè)試提升輪數(shù)的范圍
tm = TunedModel(model = lc,
tuning = Grid(), #參數(shù)范圍的搜索策略
resampling = CV(rng = 11, nfolds = 10),
range = [r], #參數(shù)范圍
measure = area_under_curve #判斷最優(yōu)結(jié)果的指標(biāo) ROC曲線下面積
)
mtm = machine(tm, X, y) #構(gòu)造machine(學(xué)習(xí)器)
fit!(mtm) #擬合已調(diào)整的模型
[ Info: Training Machine{ProbabilisticTunedModel{Grid,…}} @931.
[ Info: Attempting to evaluate 10 models.
Evaluating over 10 metamodels: 100%[=========================] Time: 0:07:00
14.可視化調(diào)參結(jié)果
res = report(mtm).plotting
scatter(res.parameter_values[:,1],
res.measurements)
best_model = fitted_params(mtm).best_model #查看模型最佳參數(shù)
max_iter = 278時(shí), AUC最大(ROC曲線下面積)
15.同樣的轉(zhuǎn)換方法處理預(yù)測(cè)集
test_data |> describe |> pprint
id = test_data[:, :id]
test_data = select(test_data, Not(:id))
test_data = coerce(test_data, autotype(test_data)) #自動(dòng)scitype
test_data = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), test_data)), test_data) #數(shù)值化scitype
test_data = MLJ.transform(fit!(machine(Standardizer(), test_data)), test_data) #標(biāo)準(zhǔn)化
- 用訓(xùn)練好的模型進(jìn)行預(yù)測(cè)
result = predict_mode(mtm, test_data)
- 查看結(jié)果比例
result |> countmap
- 將id與預(yù)測(cè)結(jié)果合并至DataFrame
result_data = DataFrame(id = id, Response = result)