一個(gè)方便的scikit-learn備忘錄穷吮,用于使用Python進(jìn)行機(jī)器學(xué)習(xí)预吆,包括代碼示例。
大多數(shù)使用Python學(xué)習(xí)數(shù)據(jù)科學(xué)的人肯定已經(jīng)聽說過scikit-learn
失仁,開源Python庫在統(tǒng)一界面的幫助下實(shí)現(xiàn)了各種機(jī)器學(xué)習(xí)尸曼,預(yù)處理,交叉驗(yàn)證和可視化算法萄焦。
如果你還是這個(gè)領(lǐng)域的新手控轿,你應(yīng)該意識到機(jī)器學(xué)習(xí),以及這個(gè)Python庫拂封,都屬于每個(gè)有抱負(fù)的數(shù)據(jù)科學(xué)家必須知道的茬射。
這就是為什么DataCamp已經(jīng)scikit-learn
為那些已經(jīng)開始學(xué)習(xí)Python包的人創(chuàng)建了一個(gè)備忘錄,但仍然需要一個(gè)方便的參考表冒签≡谂祝或者,如果您仍然不知道如何scikit-learn
工作萧恕,這臺機(jī)器學(xué)習(xí)備忘錄可能會派上用場刚梭,以便快速了解入門時(shí)需要了解的基礎(chǔ)知識。
無論哪種方式票唆,我們都確信您在解決機(jī)器學(xué)習(xí)問題時(shí)會發(fā)現(xiàn)它很有用望浩!
這個(gè) scikit-learn
備忘錄將向您介紹成功實(shí)現(xiàn)機(jī)器學(xué)習(xí)算法所需的基本步驟:您將看到如何加載數(shù)據(jù),如何預(yù)處理它惰说,如何創(chuàng)建自己的模型以適合您的模型您的數(shù)據(jù)和預(yù)測目標(biāo)標(biāo)簽磨德,如何驗(yàn)證您的模型以及如何進(jìn)一步調(diào)整以提高其性能。
簡而言之吆视,這個(gè)備忘錄將啟動您的數(shù)據(jù)科學(xué)項(xiàng)目:借助代碼示例典挑,您可以立即創(chuàng)建,驗(yàn)證和調(diào)整您的機(jī)器學(xué)習(xí)模型啦吧。
你還在等什么您觉?開始的時(shí)候了!
**(點(diǎn)擊上方下載可打印版本或閱讀以下在線版本授滓。) **
Python For Data Science備忘錄:Scikit-learn
Scikit-learn是一個(gè)開源Python庫琳水,使用統(tǒng)一的界面實(shí)現(xiàn)一系列機(jī)器學(xué)習(xí),預(yù)處理般堆,交叉驗(yàn)證和可視化算法在孝。
一個(gè)基本的例子
>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)
加載數(shù)據(jù)
您的數(shù)據(jù)需要是數(shù)字并存儲為NumPy數(shù)組或SciPy稀疏矩陣。其他可轉(zhuǎn)換為數(shù)字?jǐn)?shù)組的類型(如Pandas DataFrame)也是可以接受的淮摔。
>>> import numpy as np
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0
預(yù)處理數(shù)據(jù)
標(biāo)準(zhǔn)化
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)
正則化
>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)
二值化
>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)
編碼分類功能
>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.fit_transform(y)
輸入缺失值
>>>from sklearn.preprocessing import Imputer
>>>imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>>imp.fit_transform(X_train)
生成多項(xiàng)式特征
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> oly.fit_transform(X)
訓(xùn)練和測試數(shù)據(jù)
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
創(chuàng)建你的模型
監(jiān)督學(xué)習(xí)估算
線性回歸
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize=True)
支持向量機(jī)(SVM)
>>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear')
樸素貝葉斯
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
K-近鄰算法(KNN)
>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
無監(jiān)督學(xué)習(xí)估算器
主成分分析(PCA)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=0.95)
K均值聚類算法(K-Means)
>>> from sklearn.cluster import KMeans
>>> k_means = KMeans(n_clusters=3, random_state=0)
模型擬合
監(jiān)督學(xué)習(xí)
>>> lr.fit(X, y)
>>> knn.fit(X_train, y_train)
>>> svc.fit(X_train, y_train)
無監(jiān)督學(xué)習(xí)
>>> k_means.fit(X_train)
>>> pca_model = pca.fit_transform(X_train)
預(yù)測
監(jiān)督估算師
>>> y_pred = svc.predict(np.random.random((2,5)))
>>> y_pred = lr.predict(X_test)
>>> y_pred = knn.predict_proba(X_test))
無監(jiān)督估計(jì)
>>> y_pred = k_means.predict(X_test)
評估您的模型的性能
分類指標(biāo)
準(zhǔn)確度分?jǐn)?shù)
>>> knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
分類報(bào)告
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred)))
混淆矩陣
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred)))
回歸指標(biāo)
平均絕對誤差
>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2])
>>> mean_absolute_error(y_true, y_pred))
均方誤差
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(y_test, y_pred))
R 2 score
>>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred))
群集指標(biāo)
調(diào)整后的蘭德指數(shù)
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred))
同質(zhì)性
>>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_true, y_pred))
V-措施
>>> from sklearn.metrics import v_measure_score
>>> metrics.v_measure_score(y_true, y_pred))
交叉驗(yàn)證
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))
調(diào)整你的模型
網(wǎng)格搜索
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,param_grid=params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)
隨機(jī)參數(shù)優(yōu)化
>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
n_iter=8,
random_state=5)
>>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
走得更遠(yuǎn)
從我們?yōu)槌鯇W(xué)者學(xué)習(xí)scikit-learn教程開始 私沮,您將以簡單,循序漸進(jìn)的方式學(xué)習(xí)如何探索手寫數(shù)字?jǐn)?shù)據(jù)和橙,如何為其創(chuàng)建模型仔燕,如何使您的數(shù)據(jù)適合您的模型和如何預(yù)測目標(biāo)值造垛。此外,您將使用Python的數(shù)據(jù)可視化庫matplotlib來可視化您的結(jié)果晰搀。
?> PS:不要錯(cuò)過我們的Bokeh備忘錄五辽, pandas備忘錄 或數(shù)據(jù)科學(xué)的 Python備忘錄。
原文: https://www.datacamp.com/community/blog/scikit-learn-cheat-sheet
作者: Karlijn Willems