??Iris 鳶尾花數(shù)據(jù)集是一個(gè)經(jīng)典數(shù)據(jù)集溉潭,在統(tǒng)計(jì)學(xué)習(xí)和機(jī)器學(xué)習(xí)領(lǐng)域都經(jīng)常被用作示例倦微。數(shù)據(jù)集內(nèi)包含 3 類共 150 條記錄弄兜,每類各 50 個(gè)數(shù)據(jù)药蜻,每條記錄都有 4 項(xiàng)特征:花萼長(zhǎng)度、花萼寬度挨队、花瓣長(zhǎng)度谷暮、花瓣寬度蒿往,可以通過這4個(gè)特征預(yù)測(cè)鳶尾花卉屬于(iris-setosa, iris-versicolour, iris-virginica)中的哪一品種盛垦。
據(jù)說在現(xiàn)實(shí)中,這三種花的基本判別依據(jù)其實(shí)是種子(因?yàn)榛ò攴浅H菀卓菸?/p>
0 準(zhǔn)備數(shù)據(jù)
??下面對(duì) iris 進(jìn)行探索性分析瓤漏,首先導(dǎo)入相關(guān)包和數(shù)據(jù)集:
# 導(dǎo)入相關(guān)包
import numpy as np
import pandas as pd
from pandas import plotting
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
sns.set_style("whitegrid")
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# 導(dǎo)入數(shù)據(jù)集
iris = pd.read_csv('F:\pydata\dataset\kaggle\iris.csv', usecols=[1, 2, 3, 4, 5])
??查看數(shù)據(jù)集信息:
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLengthCm 150 non-null float64
SepalWidthCm 150 non-null float64
PetalLengthCm 150 non-null float64
PetalWidthCm 150 non-null float64
Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
??查看數(shù)據(jù)集的頭 5 條記錄:
iris.head()
1 探索性分析
??先查看數(shù)據(jù)集各特征列的摘要統(tǒng)計(jì)信息:
iris.describe()
??通過Violinplot 和 Pointplot腾夯,分別從數(shù)據(jù)分布和斜率,觀察各特征與品種之間的關(guān)系:
# 設(shè)置顏色主題
antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864']
# 繪制 Violinplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.violinplot(x='Species', y='SepalLengthCm', data=iris, palette=antV, ax=axes[0, 0])
sns.violinplot(x='Species', y='SepalWidthCm', data=iris, palette=antV, ax=axes[0, 1])
sns.violinplot(x='Species', y='PetalLengthCm', data=iris, palette=antV, ax=axes[1, 0])
sns.violinplot(x='Species', y='PetalWidthCm', data=iris, palette=antV, ax=axes[1, 1])
plt.show()
# 繪制 pointplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.pointplot(x='Species', y='SepalLengthCm', data=iris, color=antV[0], ax=axes[0, 0])
sns.pointplot(x='Species', y='SepalWidthCm', data=iris, color=antV[0], ax=axes[0, 1])
sns.pointplot(x='Species', y='PetalLengthCm', data=iris, color=antV[0], ax=axes[1, 0])
sns.pointplot(x='Species', y='PetalWidthCm', data=iris, color=antV[0], ax=axes[1, 1])
plt.show()
??生成各特征之間關(guān)系的矩陣圖:
g = sns.pairplot(data=iris, palette=antV, hue= 'Species')
??使用 Andrews Curves 將每個(gè)多變量觀測(cè)值轉(zhuǎn)換為曲線并表示傅立葉級(jí)數(shù)的系數(shù)蔬充,這對(duì)于檢測(cè)時(shí)間序列數(shù)據(jù)中的異常值很有用蝶俱。
Andrews Curves 是一種通過將每個(gè)觀察映射到函數(shù)來可視化多維數(shù)據(jù)的方法。
plt.subplots(figsize = (10,8))
plotting.andrews_curves(iris, 'Species', colormap='cool')
plt.show()
??下面分別基于花萼和花瓣做線性回歸的可視化:
g = sns.lmplot(data=iris, x='SepalWidthCm', y='SepalLengthCm', palette=antV, hue='Species')
g = sns.lmplot(data=iris, x='PetalWidthCm', y='PetalLengthCm', palette=antV, hue='Species')
??最后饥漫,通過熱圖找出數(shù)據(jù)集中不同特征之間的相關(guān)性榨呆,高正值或負(fù)值表明特征具有高度相關(guān)性:
fig=plt.gcf()
fig.set_size_inches(12, 8)
fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k', square=True, mask=False, vmin=-1, vmax=1, cbar_kws={"orientation": "vertical"}, cbar=True)
??從熱圖可看出,花萼的寬度和長(zhǎng)度不相關(guān)庸队,而花瓣的寬度和長(zhǎng)度則高度相關(guān)积蜻。
2 機(jī)器學(xué)習(xí)
??接下來,通過機(jī)器學(xué)習(xí)彻消,以花萼和花瓣的尺寸為根據(jù)竿拆,預(yù)測(cè)其品種。
??在進(jìn)行機(jī)器學(xué)習(xí)之前宾尚,將數(shù)據(jù)集拆分為訓(xùn)練和測(cè)試數(shù)據(jù)集丙笋。首先,使用標(biāo)簽編碼將 3 種鳶尾花的品種名稱轉(zhuǎn)換為分類值(0, 1, 2)煌贴。
# 載入特征和標(biāo)簽集
X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = iris['Species']
# 對(duì)標(biāo)簽集進(jìn)行編碼
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
??接著御板,將數(shù)據(jù)集以 7: 3 的比例,拆分為訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù):
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 101)
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
(105, 4) (105,) (45, 4) (45,)
??檢查不同模型的準(zhǔn)確性:
# Support Vector Machine
model = svm.SVC()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the SVM is: 1.0
# Logistic Regression
model = LogisticRegression()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Logistic Regression is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the Logistic Regression is: 0.9555555555555556
# Decision Tree
model=DecisionTreeClassifier()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Decision Tree is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the Decision Tree is: 0.9555555555555556
# K-Nearest Neighbours
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the KNN is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the KNN is: 1.0
??上面使用了數(shù)據(jù)集的所有特征牛郑,下面將分別使用花瓣和花萼的尺寸:
petal = iris[['PetalLengthCm', 'PetalWidthCm', 'Species']]
train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0)
train_x_p=train_p[['PetalWidthCm','PetalLengthCm']]
train_y_p=train_p.Species
test_x_p=test_p[['PetalWidthCm','PetalLengthCm']]
test_y_p=test_p.Species
sepal = iris[['SepalLengthCm', 'SepalWidthCm', 'Species']]
train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0)
train_x_s=train_s[['SepalWidthCm','SepalLengthCm']]
train_y_s=train_s.Species
test_x_s=test_s[['SepalWidthCm','SepalLengthCm']]
test_y_s=test_s.Species
model=svm.SVC()
model.fit(train_x_p,train_y_p)
prediction=model.predict(test_x_p)
print('The accuracy of the SVM using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p)))
model.fit(train_x_s,train_y_s)
prediction=model.predict(test_x_s)
print('The accuracy of the SVM using Sepal is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the SVM using Petals is: 0.9777777777777777
The accuracy of the SVM using Sepal is: 0.8
model = LogisticRegression()
model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print('The accuracy of the Logistic Regression using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p)))
model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print('The accuracy of the Logistic Regression using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the Logistic Regression using Petals is: 0.6888888888888889
The accuracy of the Logistic Regression using Sepals is: 0.6444444444444445
model=DecisionTreeClassifier()
model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print('The accuracy of the Decision Tree using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p)))
model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print('The accuracy of the Decision Tree using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the Decision Tree using Petals is: 0.9555555555555556
The accuracy of the Decision Tree using Sepals is: 0.6666666666666666
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print('The accuracy of the KNN using Petals is: {0}'.format(metrics.accuracy_score(prediction,test_y_p)))
model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print('The accuracy of the KNN using Sepals is: {0}'.format(metrics.accuracy_score(prediction,test_y_s)))
The accuracy of the KNN using Petals is: 0.9777777777777777
The accuracy of the KNN using Sepals is: 0.7333333333333333
??從中不難看出怠肋,使用花瓣的尺寸來訓(xùn)練數(shù)據(jù)較花萼更準(zhǔn)確。正如在探索性分析的熱圖中所看到的那樣井濒,花萼的寬度和長(zhǎng)度之間的相關(guān)性非常低灶似,而花瓣的寬度和長(zhǎng)度之間的相關(guān)性非常高列林。