谷歌開發(fā)者關于決策樹基礎的視頻 >>> 優(yōu)酷
1.分辨橙子和蘋果
from sklearn import tree
from sklearn.datasets import load_iris
import numpy as np
from sklearn.externals.six import StringIO
import pydot
import matplotlib.pyplot as plt
# 決策樹
features = [[140, 1], [130, 1], [150, 0], [170, 0]] # 1帶表平滑的传透,0表示坑洼的
labels = [0, 0, 1, 1] # 0 蘋果 1 橘子
clf = tree.DecisionTreeClassifier() # 決策樹分類器
clf = clf.fit(features, labels)
print(clf.predict([[150, 0]]))
2.鶯尾花
這是一個非常經(jīng)典的案例次慢,之前也有一次作業(yè)時做這個,只不過不是用決策樹
# 1.引入并打印數(shù)據(jù)集
iris = load_iris()
print(iris.feature_names) # 特征值
print(iris.target_names) # 種類
print(iris.data[0])
for i in range(len(iris.target)):
print(i, iris.target[i], iris.data[i])
# 2.訓練分類器,將數(shù)據(jù)切分為訓練數(shù)據(jù)和測試數(shù)據(jù)
test_idx = [0, 50, 100]
# 訓練集
train_target = np.delete(iris.target, test_idx) # 移除3行數(shù)據(jù),即每種花刪掉一個
train_data = np.delete(iris.data, test_idx, axis=0)
# 測試集
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]
# 創(chuàng)建決策樹并訓練
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train_data, train_target)
print(clf.predict(test_data)) # 判斷
在本例中使用決策樹進行分類的基本流程就是
1.引入訓練數(shù)據(jù)和測試數(shù)據(jù)
2.調用分類器 tree.DecisionTreeClassifier()
3.利用分類器進行分類盏袄,.fit(訓練數(shù)據(jù)间雀,目標種類)
4.測試 .predict(test_data)
5.查看整體正確率
from sklearn.metrics import accuracy_score
print(accuracy_score(test_data, predictions))
3.關于特征值的選擇
特征值features直接影響分類器結果正確率极舔,例如估算到達某地所用時間群发,用“公里”來估計遠比用經(jīng)緯度差距估計來得好晰韵,多個特征值可以精確結果。
4.更加完善的鶯尾花示例
# 不同的決策樹測試
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
my_classifier = tree.DecisionTreeClassifier()
# 如果不用決策樹可以用KNeighborsClassifier ,正確率 0.96
# from sklearn.neighbors import KNeighborsClassifier
# my_classifier = KNeighborsClassifier()
my_classifier.fit(X_train, y_train)
predictions = my_classifier.predict(X_test)
# 測試正確率
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))
之所以將數(shù)據(jù)集定義為X,y是因為對于分類器來說熟妓,輸入與輸出數(shù)據(jù)就像一個從x到y(tǒng)的函數(shù)映射雪猪。
y=mx+b
給定一個m,b的初始值起愈,每次傳入訓練數(shù)據(jù)去修正只恨,不斷迭代译仗。
5.重寫分類器,使用簡單的KNN算法
利用KNN算法來根據(jù)距離確定類別坤次,以鶯尾花數(shù)據(jù)集為例古劲,因為有4個特征斥赋,所以可抽象為一個四位空間缰猴,每條數(shù)據(jù)為一個點,通過計算每次測試數(shù)據(jù)與每個訓練數(shù)據(jù)點的距離(Enclidean Distance)來確定自身類別疤剑,然后形成測試數(shù)據(jù)的結果集滑绒,最后與訓練數(shù)據(jù)結果集比較得出正確率。
# 手寫分類器
from scipy.spatial import distance
def euc(a, b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train
def predict(self, X_test):
predictions = []
for row in X_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.X_train[0])
best_index = 0
for i in range(1, len(self.X_train)):
dist = euc(row, self.X_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
my_classifier = ScrappyKNN()
my_classifier.fit(X_train, y_train)
predictions = my_classifier.predict(X_test)
# 測試正確率
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))
整體流程:
1.導入數(shù)據(jù)隘膘,X代表特征數(shù)據(jù)集疑故,y代表類別數(shù)據(jù)集,
2.分切數(shù)據(jù)弯菊,對半平分為訓練數(shù)據(jù)和測試數(shù)據(jù)
3.利用my_classifier.fit(X_train, y_train)傳入數(shù)據(jù)纵势,my_classifier.predict(X_test)計算每個測試數(shù)據(jù)與訓練數(shù)據(jù)點的距離,并把最短距離對應索引的y_train值(類別)回傳給predictions然后添加進結果集管钳。有點類似vec2word钦铁。
4.把測試得到的結果集predictions與y_test比較算出正確率。