ML11- sklearn庫的應(yīng)用

本主題就是利用sklearn庫與機(jī)器學(xué)習(xí)的算法實(shí)現(xiàn)一個比較經(jīng)典的人臉識別實(shí)驗(yàn)。主要內(nèi)容包含
??1. 人臉數(shù)據(jù)的處理；
??2. 數(shù)據(jù)降維串慰；
??3. SVM算法的應(yīng)用灸叼；
??4. 交叉驗(yàn)證與分類分析；

人臉數(shù)據(jù)庫介紹與下載

sklearn的在線人臉庫

sklearn提供很多離線與在線數(shù)據(jù)集呈队，其中人臉提供的就是lfw（標(biāo)簽化野外人臉庫：Labeled Face of Wild）

加載函數(shù)

from sklearn.datasets import fetch_lfw_people
# faces = fetch_lfw_people()

圖像加載過程
- 圖像大約200M颅崩，加載需要時(shí)間喉刘，下面是加載截圖
- 數(shù)據(jù)句在線下載過程
加載的圖像存放位置
- 在Mac OS系統(tǒng)猴伶，存放位置為：$HOME$/scikit_learn_data/lfw_home/
- 下載速度非常慢损姜。
加載sklearn的在線數(shù)據(jù)集在蘋果系統(tǒng)中會出現(xiàn)SSL鏈接的問題，這個問題的解決可以使用Python自帶的證書安裝工具殊霞，安裝證書后解決摧阅。
- SSL證書安裝工具

安裝過程如下：
- SSL證書安裝過程

劍橋人臉庫

一個非常有名的人臉庫。官網(wǎng)下載地址：
- https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
- 人臉庫主頁

下載后的圖像信息：
- 一共400張人臉圖：40個人绷蹲，每個人10張圖像棒卷；
- 每張人臉圖 $92 \times 112$ 大小，256的灰度圖像祝钢；
- 下載的人臉庫

說明：
- 為了下載方便比规，我們采用劍橋的人臉庫。

人臉數(shù)據(jù)加載與加載格式

加載格式說明

人臉圖像格式pgm格式：
- 一種用于Unix平臺的數(shù)據(jù)格式太颤。
圖像讀取方法：
- cv2模塊讀取
  - 返回三維圖像苞俘，包含圖像的顏色深度
- matplotlib.pyplot模塊讀取
  - 返回二維或者三維數(shù)據(jù)，如果是灰度圖龄章，就范圍二維吃谣，如果是彩色圖就返回三維，包含顏色深度

import matplotlib.pyplot as plt
import cv2 
img_cv2 = cv2.imread('./att_faces/s1/1.pgm')
print(img_cv2.shape)
img_plt = plt.imread('timg-3.jpeg')
print(img_plt.shape)        
timg-3.jpeg

(112, 92, 3)
(400, 600, 3)

加載實(shí)現(xiàn)代碼

matplotlib.pyplot實(shí)現(xiàn)

import matplotlib.pyplot as plt
import numpy as np
ONE_PERSON_FACE_NUM = 10
PERSON_NUM = 40
SAMPLES_NUM= ONE_PERSON_FACE_NUM * PERSON_NUM
IMAGE_W = 92
IMAGE_H = 112

def load_faces(face_path_):
    data_faces_ = np.zeros(shape=(SAMPLES_NUM, IMAGE_H*IMAGE_W), dtype=np.int32)
    label_faces_ = np.zeros(shape=(SAMPLES_NUM, 1), dtype=np.int32)
    idx = 0 
    for i in range(1, PERSON_NUM):    # 40個目錄(s1- s40)存放40個人的人臉
        for j in range(1,ONE_PERSON_FACE_NUM + 1):   # 每個人一共10張人臉圖像(1.pgm - 10.pgm)
            path_ = face_path_ + "/s" + str(i) + "/"+ str(j) + ".pgm"
            img_ = plt.imread(path_)
            data_faces_[idx, :] = img_.reshape(IMAGE_H*IMAGE_W)
            label_faces_[idx,:]= I
            idx += 1
    return data_faces_, label_faces_

data,target = load_faces('./att_faces')

# data.shape,target.shape

import cv2
import numpy as np
ONE_PERSON_FACE_NUM = 10
PERSON_NUM = 40
SAMPLES_NUM= ONE_PERSON_FACE_NUM * PERSON_NUM
IMAGE_W = 92
IMAGE_H = 112

def load_faces(face_path_):
    data_faces_ = np.zeros(shape=(SAMPLES_NUM, IMAGE_H*IMAGE_W), dtype=np.uint8)
    label_faces_ = np.zeros(shape=(SAMPLES_NUM, 1), dtype=np.uint8)
    idx = 0 
    for i in range(1, PERSON_NUM):    # 40個目錄(s1- s40)存放40個人的人臉
        for j in range(1,ONE_PERSON_FACE_NUM + 1):   # 每個人一共10張人臉圖像(1.pgm - 10.pgm)
            path_ = face_path_ + "/s" + str(i) + "/"+ str(j) + ".pgm"
            img_ = cv2.imread(path_)
            img_gray_ = cv2.cvtColor(img_, cv2.COLOR_BGR2GRAY)
            data_faces_[idx, :] = img_gray_.reshape(IMAGE_H*IMAGE_W)
            label_faces_[idx,:]= I
            idx += 1
    return data_faces_, label_faces_

data,target = load_faces('./att_faces')

data.shape,target.shape

((400, 10304), (400, 1))

# 使用matplotlib顯示
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(data[6].reshape((IMAGE_H, IMAGE_W)), cmap=plt.cm.gray)   # 可以改變顏色的調(diào)色板
plt.show()

人臉數(shù)據(jù)樣本

數(shù)據(jù)交叉驗(yàn)證拆分

使用sklearn的交叉驗(yàn)證模塊實(shí)現(xiàn)數(shù)據(jù)集的切分：
- 訓(xùn)練集80%
- 測試集20%

from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2, random_state=42)
data_train.shape, target_train.shape

((320, 10304), (320, 1))

數(shù)據(jù)預(yù)處理-降維

由于像特征比較多做裙，一共10304個特征岗憋，建議降維處理。
降維的思想采用PCA方法锚贱。
降維的核心是保留多少特征仔戈，我們可以暫時(shí)保留20個。

降維訓(xùn)練

降維完畢我們做白化處理（規(guī)范化處理）
- 設(shè)置whiten=True

from sklearn.decomposition import PCA
n_components = 20
pca = PCA(n_components=n_components,  whiten=True, svd_solver='randomized')
pca = pca.fit(data_train)

特征臉

PCA訓(xùn)練就是奇異值分解拧廊，分解得到特征向量與特征值监徘，
- 特征值反應(yīng)特征的差異性，越大特征與其他特征的差異越大吧碾，圖像越可分凰盔。
- 特征向量就是數(shù)據(jù)的特征空間，也是表示訓(xùn)練樣本的所有人臉特征倦春。
  - 顯示特征臉户敬，可以看出特征向量落剪，保留了原來人臉的特征。

# 保留了與主要特征對應(yīng)的特征向量尿庐。上面取值20個特征忠怖。
eigenfaces = pca.components_.reshape((n_components, IMAGE_H, IMAGE_W))
pca.components_[0]

array([-0.00516626, -0.00519973, -0.00516024, ...,  0.00238201,
        0.00192719,  0.00229558])

# 顯示特征練
%matplotlib inline
import matplotlib.pyplot as plt

rows = 5
cols = 4
plt.figure(figsize=(1.8 * cols,  2.4 * rows))
for i in range(cols * rows):
    ax = plt.subplot(rows, cols, i + 1)
    plt.imshow(eigenfaces[i], cmap=plt.cm.gray)
    plt.xticks(())
    plt.yticks(())

plt.show()

特征臉

人臉降維

可以看出任何人臉都可以使用這幾個特征練表示，所以使用指定個數(shù)的特征抄瑟，得到的人臉數(shù)據(jù)具有一定的可靠性凡泣。

pca_train = pca.transform(data_train)
pca_test = pca.transform(data_test)

使用sklearn的機(jī)器學(xué)習(xí)算法訓(xùn)練與測試

訓(xùn)練方法

from sklearn.svm import SVC
classifier = SVC(kernel='rbf', C=1000, gamma=0.1)
classifier = classifier.fit(pca_train, target_train[:,0])

測試

# 注意：target_test被妖怪的設(shè)置成二維數(shù)組，需要做成1維比較锐借。
pre = classifier.predict(pca_test)
correct_num=(pre == target_test[:,0]).sum()

print(F'''
識別正確數(shù)：{correct_num}问麸，
測試樣本數(shù)：{len(target_test)}，
正確率：{(100.0 * correct_num /len(target_test)): 5.2f}''')

識別正確數(shù)：78钞翔，
測試樣本數(shù)：80严卖，
正確率： 97.50

參數(shù)選擇與交叉驗(yàn)證

SVM的參數(shù)選擇

sklearn提供了SVM的參數(shù)選擇
- sklearn.model_selection.GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [1000, 5000, 10000, 50000, 100000],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}

classifier = GridSearchCV(SVC(kernel='rbf'), param_grid, iid=True, cv=4) # iid與cv需要顯式指定
classifier = classifier.fit(pca_train, target_train[:,0])
pre = classifier.predict(pca_test)
correct_num=(pre == target_test[:,0]).sum()

print(F'''
識別正確數(shù)：{correct_num}，
測試樣本數(shù)：{len(target_test)}布轿，
正確率：{(100.0 * correct_num /len(target_test)): 5.2f}''')

識別正確數(shù)：77哮笆，
測試樣本數(shù)：80，
正確率： 96.25

數(shù)據(jù)降維的特征數(shù)選擇

采用枚舉的方式分析特征個數(shù)與識別率的關(guān)系

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# 全局變量
ONE_PERSON_FACE_NUM = 10
PERSON_NUM = 40
SAMPLES_NUM= ONE_PERSON_FACE_NUM * PERSON_NUM
IMAGE_W = 92
IMAGE_H = 112

# 圖像數(shù)據(jù)加載
def load_faces(face_path_):
    data_faces_ = np.zeros(shape=(SAMPLES_NUM, IMAGE_H*IMAGE_W), dtype=np.int32)
    label_faces_ = np.zeros(shape=(SAMPLES_NUM, 1), dtype=np.int32)
    idx = 0 
    for i in range(1, PERSON_NUM):    # 40個目錄(s1- s40)存放40個人的人臉
        for j in range(1,ONE_PERSON_FACE_NUM + 1):   # 每個人一共10張人臉圖像(1.pgm - 10.pgm)
            path_ = face_path_ + "/s" + str(i) + "/"+ str(j) + ".pgm"
            img_ = plt.imread(path_)
            data_faces_[idx, :] = img_.reshape(IMAGE_H*IMAGE_W)
            label_faces_[idx,:]= I
            idx += 1
    return data_faces_, label_faces_
print('數(shù)據(jù)加載開始......')
data,target = load_faces('./att_faces')

# 數(shù)據(jù)切分
print('數(shù)據(jù)切分......')
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2, random_state=42)

# 數(shù)據(jù)降維
print('數(shù)據(jù)降維訓(xùn)練.......')
from sklearn.decomposition import PCA
n_components = 20
pca = PCA(n_components=n_components,  whiten=True, svd_solver='randomized')
pca = pca.fit(data_train)
print('數(shù)據(jù)降維......')
pca_train = pca.transform(data_train)
pca_test = pca.transform(data_test)

# 選取合適的特征數(shù)測試
print('不同特征數(shù)選取測試.....')
result_rate = {}
for  n_components in range(3, 100+1):
    # 降維訓(xùn)練
    pca = PCA(n_components=n_components,  whiten=True, svd_solver='randomized')
    pca = pca.fit(data_train)
    # 數(shù)據(jù)降維
    pca_train = pca.transform(data_train)
    pca_test = pca.transform(data_test)
    # 對降維的數(shù)據(jù)進(jìn)行學(xué)習(xí)訓(xùn)練
    param_grid = {
        'C': [1000, 5000, 10000, 50000, 100000],
        'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
    }

    classifier = GridSearchCV(SVC(kernel='rbf'), param_grid, iid=True, cv=4) # iid與cv需要顯式指定
    classifier = classifier.fit(pca_train, target_train[:,0])
    pre = classifier.predict(pca_test)
    correct_num=(pre == target_test[:,0]).sum()
    #print(n_components, ':', correct_num)
    result_rate[n_components] = correct_num
    

# 數(shù)據(jù)可視化
print('數(shù)據(jù)可視化......')
data_rate = pd.DataFrame(data={
    'n_components': list(result_rate.keys()),
    'correct_num': list(result_rate.values())
})

ax = sns.lineplot(data=data_rate, x='n_components', y='correct_num')
ax.figure.set_size_inches((12, 6))
plt.show()

數(shù)據(jù)加載開始......
數(shù)據(jù)切分......
數(shù)據(jù)降維訓(xùn)練.......
數(shù)據(jù)降維......
不同特征數(shù)選取測試.....
數(shù)據(jù)可視化......

交叉驗(yàn)證數(shù)據(jù)可視化

# 根據(jù)上面圖形汰扭，基本上可以評估出性價(jià)比最佳特征數(shù)10的樣子稠肘，這樣可以提升速度，并且得到較好的識別率萝毛。
# 選擇svc最優(yōu)參數(shù)是：
classifier.best_estimator_

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.005, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

# 降維
pca = PCA(n_components=10,  whiten=True, svd_solver='randomized')
pca = pca.fit(data_train)

# 數(shù)據(jù)降維
pca_train = pca.transform(data_train)
pca_test = pca.transform(data_test)

# 分類
best_classifier = SVC(kernel='rbf', C=1000, gamma=0.001)
best_classifier = best_classifier.fit(pca_train, target_train[:,0])
pre = best_classifier.predict(pca_test)
correct_num=(pre == target_test[:,0]).sum()
print(pre.shape)
correct_num

(80,)





78

模型評估

分類報(bào)告

metrics.classification_report

from sklearn.metrics import classification_report, confusion_matrix
cls_report = classification_report(target_test[:,0], pre)
print(cls_report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      0.67      0.80         3
           2       1.00      1.00      1.00         1
           3       1.00      1.00      1.00         2
           4       1.00      1.00      1.00         4
           5       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         3
           8       1.00      1.00      1.00         6
           9       1.00      1.00      1.00         2
          10       1.00      1.00      1.00         2
          11       1.00      1.00      1.00         2
          12       1.00      1.00      1.00         3
          13       1.00      1.00      1.00         2
          14       1.00      1.00      1.00         1
          15       1.00      1.00      1.00         3
          16       0.67      1.00      0.80         2
          18       1.00      1.00      1.00         3
          19       1.00      1.00      1.00         1
          20       1.00      1.00      1.00         1
          21       1.00      1.00      1.00         1
          22       1.00      1.00      1.00         1
          23       1.00      1.00      1.00         3
          24       1.00      1.00      1.00         2
          25       1.00      1.00      1.00         1
          26       1.00      1.00      1.00         1
          27       1.00      1.00      1.00         4
          28       1.00      0.50      0.67         2
          29       1.00      1.00      1.00         2
          30       1.00      1.00      1.00         1
          33       1.00      1.00      1.00         3
          34       1.00      1.00      1.00         1
          35       1.00      1.00      1.00         1
          36       1.00      1.00      1.00         1
          37       0.67      1.00      0.80         2
          38       1.00      1.00      1.00         2
          39       1.00      1.00      1.00         4

    accuracy                           0.97        80
   macro avg       0.98      0.98      0.97        80
weighted avg       0.98      0.97      0.97        80

混淆矩陣

metrics.confusion_matrix

con_matrix = confusion_matrix(target_test[:,0], pre)
print(con_matrix.shape)
print(con_matrix)

(36, 36)
[[4 0 0 ... 0 0 0]
 [0 2 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 2 0 0]
 [0 0 0 ... 0 2 0]
 [0 0 0 ... 0 0 4]]

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末项阴，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子笆包，更是在濱河造成了極大的恐慌环揽，老刑警劉巖，帶你破解...
沈念sama閱讀 211,290評論 6贊 491
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件庵佣，死亡現(xiàn)場離奇詭異歉胶，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)巴粪，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,107評論 2贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門通今，熙熙樓的掌柜王于貴愁眉苦臉地迎上來撰糠，“玉大人痴突，你說我怎么就攤上這事」。” “怎么了派哲？”我有些...
開封第一講書人閱讀 156,872評論 0贊 347
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵臼氨，是天一觀的道長。經(jīng)常有香客問我狮辽，道長一也，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 56,415評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任喉脖，我火速辦了婚禮椰苟，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘树叽。我一直安慰自己舆蝴，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 65,453評論 6贊 385
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布题诵。她就那樣靜靜地躺著洁仗，像睡著了一般。火紅的嫁衣襯著肌膚如雪性锭。梳的紋絲不亂的頭發(fā)上赠潦，一...
開封第一講書人閱讀 49,784評論 1贊 290
城市分裂傳說
那天，我揣著相機(jī)與錄音草冈，去河邊找鬼她奥。笑死，一個胖子當(dāng)著我的面吹牛怎棱，可吹牛的內(nèi)容都是我干的哩俭。我是一名探鬼主播，決...
沈念sama閱讀 38,927評論 3贊 406
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼拳恋，長吁一口氣：“原來是場噩夢啊……” “哼凡资！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起谬运，我...
開封第一講書人閱讀 37,691評論 0贊 266
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤隙赁，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后吩谦，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體鸳谜，經(jīng)...
沈念sama閱讀 44,137評論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,472評論 2贊 326
?白月光啟示錄
正文我和宋清朗相戀三年式廷，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了咐扭。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 38,622評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡滑废，死狀恐怖蝗肪，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情蠕趁，我是刑警寧澤薛闪，帶...
沈念sama閱讀 34,289評論 4贊 329
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站俺陋，受9級特大地震影響豁延，放射性物質(zhì)發(fā)生泄漏昙篙。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,887評論 3贊 312
男人毒藥：我在死后第九天來索命
文/蒙蒙一诱咏、第九天我趴在偏房一處隱蔽的房頂上張望苔可。院中可真熱鬧，春花似錦袋狞、人聲如沸焚辅。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,741評論 0贊 21
一樁弒父案苟鸯，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽同蜻。三九已至，卻和暖如春早处，著一層夾襖步出監(jiān)牢的瞬間湾蔓，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,977評論 1贊 265
情欲美人皮
我被黑心中介騙來泰國打工砌梆，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留卵蛉，地道東北人。一個月前我還...
沈念sama閱讀 46,316評論 2贊 360
代替公主和親
正文我出身青樓么库，卻偏偏與公主長得像傻丝，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子诉儒，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,490評論 2贊 348

ML11- sklearn庫的應(yīng)用

人臉數(shù)據(jù)庫介紹與下載

sklearn的在線人臉庫

劍橋人臉庫

人臉數(shù)據(jù)加載與加載格式

加載格式說明

加載實(shí)現(xiàn)代碼

數(shù)據(jù)交叉驗(yàn)證拆分

數(shù)據(jù)預(yù)處理-降維

降維訓(xùn)練

特征臉

人臉降維

使用sklearn的機(jī)器學(xué)習(xí)算法訓(xùn)練與測試

訓(xùn)練方法

測試

參數(shù)選擇與交叉驗(yàn)證

SVM的參數(shù)選擇

數(shù)據(jù)降維的特征數(shù)選擇

模型評估

分類報(bào)告

混淆矩陣

推薦閱讀更多精彩內(nèi)容