【機(jī)器學(xué)習(xí)實(shí)驗(yàn)】scikit-learn的主要模塊和基本使用

引言

對(duì)于一些開始搞機(jī)器學(xué)習(xí)算法有害怕下手的小朋友禁熏，該如何快速入門敏簿，這讓人挺掙扎的涎跨。
在從事數(shù)據(jù)科學(xué)的人中延刘，最常用的工具就是R和Python了，每個(gè)工具都有其利弊六敬，但是Python在各方面都相對(duì)勝出一些，這是因?yàn)閟cikit-learn庫(kù)實(shí)現(xiàn)了很多機(jī)器學(xué)習(xí)算法驾荣。

加載數(shù)據(jù)(Data Loading)

我們假設(shè)輸入時(shí)一個(gè)特征矩陣或者csv文件外构。
首先普泡，數(shù)據(jù)應(yīng)該被載入內(nèi)存中。
scikit-learn的實(shí)現(xiàn)使用了NumPy中的arrays审编，所以撼班，我們要使用NumPy來載入csv文件。
以下是從UCI機(jī)器學(xué)習(xí)數(shù)據(jù)倉(cāng)庫(kù)中下載的數(shù)據(jù)垒酬。

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

我們要使用該數(shù)據(jù)集作為例子砰嘁，將特征矩陣作為X，目標(biāo)變量作為y勘究。

數(shù)據(jù)歸一化(Data Normalization)

大多數(shù)機(jī)器學(xué)習(xí)算法中的梯度方法對(duì)于數(shù)據(jù)的縮放和尺度都是很敏感的矮湘，在開始跑算法之前，我們應(yīng)該進(jìn)行歸一化或者標(biāo)準(zhǔn)化的過程口糕，這使得特征數(shù)據(jù)縮放到0-1范圍中缅阳。scikit-learn提供了歸一化的方法：

from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)

特征選擇(Feature Selection)

在解決一個(gè)實(shí)際問題的過程中，選擇合適的特征或者構(gòu)建特征的能力特別重要景描。這成為特征選擇或者特征工程十办。
特征選擇時(shí)一個(gè)很需要?jiǎng)?chuàng)造力的過程，更多的依賴于直覺和專業(yè)知識(shí)超棺，并且有很多現(xiàn)成的算法來進(jìn)行特征的選擇向族。
下面的樹算法(Tree algorithms)計(jì)算特征的信息量：

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)

算法的使用

scikit-learn實(shí)現(xiàn)了機(jī)器學(xué)習(xí)的大部分基礎(chǔ)算法，讓我們快速了解一下棠绘。

邏輯回歸

大多數(shù)問題都可以歸結(jié)為二元分類問題件相。這個(gè)算法的優(yōu)點(diǎn)是可以給出數(shù)據(jù)所在類別的概率。

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

結(jié)果：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)
precision recall f1-score support

   0.0       0.79      0.89      0.84       500
   1.0       0.74      0.55      0.63       268

avg / total 0.77 0.77 0.77 768

[[447 53]
[120 148]]

樸素貝葉斯

這也是著名的機(jī)器學(xué)習(xí)算法弄唧，該方法的任務(wù)是還原訓(xùn)練樣本數(shù)據(jù)的分布密度适肠，其在多類別分類中有很好的效果。

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

結(jié)果：

GaussianNB()
precision recall f1-score support

   0.0       0.80      0.86      0.83       500

    1.0       0.69      0.60      0.64       268

avg / total 0.76 0.77 0.76 768

[[429 71]
[108 160]]

k近鄰

k近鄰算法常常被用作是分類算法一部分候引，比如可以用它來評(píng)估特征侯养，在特征選擇上我們可以用到它。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

結(jié)果：

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,
n_neighbors=5, p=2, weights=uniform)
precision recall f1-score support

   0.0       0.82      0.90      0.86       500

    1.0       0.77      0.63      0.69       268

avg / total 0.80 0.80 0.80 768

[[448 52]
[ 98 170]]

決策樹

分類與回歸樹(Classification and Regression Trees ,CART)算法常用于特征含有類別信息的分類或者回歸問題澄干，這種方法非常適用于多分類情況逛揩。

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

結(jié)果：

DecisionTreeClassifier(compute_importances=None, criterion=gini,
max_depth=None, max_features=None, min_density=None,
min_samples_leaf=1, min_samples_split=2, random_state=None,
splitter=best)
precision recall f1-score support

   0.0       1.00      1.00      1.00       500

    1.0       1.00      1.00      1.00       268

avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

支持向量機(jī)

SVM是非常流行的機(jī)器學(xué)習(xí)算法，主要用于分類問題麸俘，如同邏輯回歸問題辩稽，它可以使用一對(duì)多的方法進(jìn)行多類別的分類。

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

結(jié)果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
precision recall f1-score support

   0.0       1.00      1.00      1.00       500

    1.0       1.00      1.00      1.00       268

avg / total 1.00 1.00 1.00 768

[[500 0]
[ 0 268]]

除了分類和回歸算法外从媚，scikit-learn提供了更加復(fù)雜的算法逞泄，比如聚類算法，還實(shí)現(xiàn)了算法組合的技術(shù)，如Bagging和Boosting算法喷众。

如何優(yōu)化算法參數(shù)

一項(xiàng)更加困難的任務(wù)是構(gòu)建一個(gè)有效的方法用于選擇正確的參數(shù)各谚，我們需要用搜索的方法來確定參數(shù)。scikit-learn提供了實(shí)現(xiàn)這一目標(biāo)的函數(shù)到千。
下面的例子是一個(gè)進(jìn)行正則參數(shù)選擇的程序：

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

結(jié)果：

GridSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,
n_jobs=1,
param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,
1.00000e-04, 0.00000e+00])},
pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,
verbose=0)
0.282118955686
1.0

有時(shí)隨機(jī)從給定區(qū)間中選擇參數(shù)是很有效的方法昌渤，然后根據(jù)這些參數(shù)來評(píng)估算法的效果進(jìn)而選擇最佳的那個(gè)。

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

結(jié)果：

RandomizedSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,
n_jobs=1,
param_distributions={'alpha': <scipy.stats.distributions.rv_frozen object at 0x04B86DD0>},
pre_dispatch=2*n_jobs, random_state=None, refit=True,
scoring=None, verbose=0)
0.282118643885
0.988443794636

小結(jié)

我們總體了解了使用scikit-learn庫(kù)的大致流程憔四，希望這些總結(jié)能讓初學(xué)者沉下心來膀息，一步一步盡快的學(xué)習(xí)如何去解決具體的機(jī)器學(xué)習(xí)問題。

轉(zhuǎn)載請(qǐng)注明作者Jason Ding及其出處
GitCafe博客主頁(yè)(http://jasonding1354.gitcafe.io/)
Github博客主頁(yè)(http://jasonding1354.github.io/)
CSDN博客(http://blog.csdn.net/jasonding1354)
簡(jiǎn)書主頁(yè)(http://www.reibang.com/users/2bd9b48f6ea8/latest_articles)
百度搜索jasonding1354進(jìn)入我的博客主頁(yè)

最后編輯于：2017.11.27 02:22:37

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末了赵，一起剝皮案震驚了整個(gè)濱河市潜支，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌斟览，老刑警劉巖毁腿，帶你破解...
沈念sama閱讀 216,372評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異苛茂，居然都是意外死亡已烤，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門妓羊，熙熙樓的掌柜王于貴愁眉苦臉地迎上來胯究，“玉大人，你說我怎么就攤上這事躁绸≡Ｑ” “怎么了？”我有些...
開封第一講書人閱讀 162,415評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵净刮，是天一觀的道長(zhǎng)剥哑。經(jīng)常有香客問我，道長(zhǎng)淹父，這世上最難降的妖魔是什么株婴？我笑而不...
開封第一講書人閱讀 58,157評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮暑认，結(jié)果婚禮上困介，老公的妹妹穿的比我還像新娘。我一直安慰自己蘸际，他們只是感情好座哩，可當(dāng)我...
茶點(diǎn)故事閱讀 67,171評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著粮彤，像睡著了一般根穷。火紅的嫁衣襯著肌膚如雪姜骡。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,125評(píng)論 1贊 297
城市分裂傳說
那天屿良，我揣著相機(jī)與錄音溶浴，去河邊找鬼。笑死管引，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的闯两。我是一名探鬼主播褥伴，決...
沈念sama閱讀 40,028評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼漾狼！你這毒婦竟也來了重慢？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 38,887評(píng)論 0贊 274
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤逊躁，失蹤者是張志新（化名）和其女友劉穎似踱，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體稽煤，經(jīng)...
沈念sama閱讀 45,310評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡核芽，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,533評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了酵熙。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片轧简。...
茶點(diǎn)故事閱讀 39,690評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖匾二，靈堂內(nèi)的尸體忽然破棺而出哮独，到底是詐尸還是另有隱情，我是刑警寧澤察藐，帶...
沈念sama閱讀 35,411評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布皮璧，位于F島的核電站，受9級(jí)特大地震影響分飞，放射性物質(zhì)發(fā)生泄漏悴务。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,004評(píng)論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一浸须、第九天我趴在偏房一處隱蔽的房頂上張望惨寿。院中可真熱鬧，春花似錦删窒、人聲如沸裂垦。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評(píng)論 0贊 22
一樁弒父案肌索，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)蕉拢。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間晕换，已是汗流浹背午乓。一陣腳步聲響...
開封第一講書人閱讀 32,812評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來泰國(guó)打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留闸准，地道東北人益愈。一個(gè)月前我還...
沈念sama閱讀 47,693評(píng)論 2贊 368
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像夷家，于是被迫代替她去往敵國(guó)和親蒸其。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,577評(píng)論 2贊 353

【機(jī)器學(xué)習(xí)實(shí)驗(yàn)】scikit-learn的主要模塊和基本使用

引言

加載數(shù)據(jù)(Data Loading)

數(shù)據(jù)歸一化(Data Normalization)

特征選擇(Feature Selection)

算法的使用

邏輯回歸

樸素貝葉斯

k近鄰

決策樹

支持向量機(jī)

如何優(yōu)化算法參數(shù)

小結(jié)

推薦閱讀更多精彩內(nèi)容