1. 數(shù)據(jù)維度
PCA 主成分分析
principle component analysis
PCA是一套全面應(yīng)用于各類數(shù)據(jù)分析的分析方法,包括特征集壓縮feature set compression
每當(dāng)進行數(shù)據(jù)可視化的時候吆录,都可以應(yīng)用主成分分析
二維數(shù)據(jù)
一維數(shù)據(jù)
并非嚴格一維數(shù)據(jù)锅论,某些地方會出現(xiàn)一些偏差药薯,但是為了理解這些數(shù)據(jù)饿序,我樂意將這些偏差信息看成是干擾信息熬苍,并將其看作是一維的數(shù)據(jù)集:
PCA特別擅長處理坐標系的移位和旋轉(zhuǎn)
6. 用于數(shù)據(jù)轉(zhuǎn)換的PCA
如果你擁有的任何形狀的數(shù)據(jù)劳澄,
PCA finds a new coordinate system that's obtained from the old one by translation and rotation only
PCA moves the center of the coordinate system with the center of the data
PCA move the x-axis into the principle axis of variation ,where you see the most variation relative to all the data points
PCA move the y-axis down the road into a orthogonal less important directions of variation
主成分分析為你找到這些軸地技,并告訴你這些軸的重要性
7. 新坐標系的中心
(2,3)
8. 新坐標系的主軸
△x=1
△y=1
9.新系統(tǒng)的第二主成分
△x=-1
△y=1
在PCA分析法中書寫向量時,最低輸出向量值被規(guī)定為1
歸一化 PCA 成分向量后秒拔,
△x(黑)= 根號 2 分之一
△y(黑)= 根號 2 分之一 #新的x軸
新的x軸和新的y軸所屬的向量是正交的
△x(紅)= 負根號 2 分之一
△y(紅)= 根號 2 分之一 #新的y軸
11. 練習(xí)查找新軸
通過PCA還可以得出一個重要值莫矗,那就是軸的散布值 spread value
如果散布率較小,那么散布值對于主軸來說傾向于是一個很大的值砂缩,而對于第二主成分軸來說則小很多
12. 哪些數(shù)據(jù)可用于PCA
Part of the beauty of PCA is that the data doesn't have to be perfectly 1D in order to find the principal axis!
13. 軸何時占主導(dǎo)地位
長軸是否占優(yōu)勢
所謂長軸占優(yōu)勢是指軸的重要值importance value作谚,或者說長軸特征值要大于短軸的特征值
14.可測量的特征與潛在的特征練習(xí)
給定一些房屋的參數(shù),如果想預(yù)測它的價格庵芭,該使用以下那個算法呢妹懒?
□ 決策樹分類器
□ SVC
□ √線性回歸
因為我們預(yù)期的輸出是連續(xù)性的,所以使用分類器是不合適的
15. 從四個特征到兩個
給定一些房屋的參數(shù)双吆,預(yù)測它的價格
可衡量特征:
square footage
no. of rooms
school ranking
neighborhood safety
潛在特征
size
neighborhood
16. 在保留信息的同時壓縮
將四項特征壓縮為兩項眨唬,以便我們能真正獲得核心的信息的最好方法是什么?
我們實際要調(diào)查的是size
neighborhood
這兩個特征
哪個是最合適的選擇參數(shù)的工具好乐?
□ SelectKBest(K 為要保留的特征數(shù)量)
□ √ SelectPercentile 指定你希望保留的特征的百分比
因為我們已知希望得到兩個特征匾竿,所以使用SelectKBest,它將保留最強大的兩個特征,并拋棄除此之外的所有其他特征
如果我們知道本來有多少個可選特征蔚万,也知道最后需要多少個特征岭妖,那么也可以使用 SelectPercentile
17.復(fù)合特征
我有很多特征可以使用,但是假設(shè)只有一小部分特征在驅(qū)動數(shù)據(jù)模式笛坦,然后我將根據(jù)這個找出一個復(fù)合特征区转,以便弄清楚潛在的現(xiàn)象
這里的復(fù)合特征/組合特征,也被稱為主要成分principle component ,是一個非常強大的算法版扩,本課中,我們主要在特征降維的情況中討論它侄泽,降低特征的維度礁芦,從而將一大堆特征縮減至幾個特征
PCA也是非監(jiān)督學(xué)習(xí)中一種非常強大的獨立算法
例子:將square footage
no.room
轉(zhuǎn)化成size
上圖看上去有些像線性回歸,但是PCA并不是線性回歸,線性回歸的目的是預(yù)測與輸入值相對應(yīng)的輸出值柿扣,而PCA不是要預(yù)測任何值肖方,而是算出數(shù)據(jù)的大致方向,使得我們的數(shù)據(jù)能夠在盡可能少地損失信息的同時映射在該方向上
在我找到了主成分未状,也就是這個向量的方向后俯画,我會對所有的數(shù)據(jù)點進行一個處理,這里稱為映射司草,數(shù)據(jù)最初是二維的艰垂,但是在我把它映射到主成分上后,它就變成了一維數(shù)據(jù)
18. 最大方差
variance
- the willingness/flexibility of an algorithm to learn
-
technical term in statistics -- roughly the 'spread' of a data distribution(similar to standard deviation)
對于具有較大方差的特征埋虹,它的樣本散布的值范圍極大猜憎,若方差較小,則各個特征通常是緊密聚集在一起
在上圖中搔课,在數(shù)據(jù)周邊畫一個橢圓胰柑,使得橢圓內(nèi)包含大部分數(shù)據(jù),這個橢圓可以用兩個數(shù)字的參數(shù)來表示爬泥,即橢圓的長軸距離和短軸距離柬讨,那么在這兩條線中,哪一條線所指的方向是數(shù)據(jù)的最大方差袍啡?即哪一個方向上的數(shù)據(jù)更為分散踩官?
長軸的線是數(shù)據(jù)最大方差的方向
19. 最大方差的優(yōu)點
principal component of a data set is the direction that has the largest variance because ?
why do you think we define the principle component this way?
what's the advantage of looking for the direction that has the largest variance?
when we are doing our project of these two dimension feature space down on to one dimension,why do we project all the data points down onto this heavy red line instead of projecting them onto this shorter line?
□ 計算復(fù)雜度低
□ √可以最大程度保留來自原始數(shù)據(jù)的信息量
□ 只是一種慣例,并沒有什么實際的原因
當(dāng)我們沿著最大方差的維度進行映射時葬馋,它能夠保留原始數(shù)據(jù)中最多的信息
20. 最大方差與信息損失
safety problems
+ school ranking
→(PCA) neighborhood quality
find the direction of maximal variance
最大方差的方向就是將信息的損失減到最小的方向
當(dāng)我將這些二維的點投射到這條一維的線上時卖鲤,就會丟失信息,丟失的信息量等于某個特定的點與它在這條線上的新位置之間的距離
21. 信息損失和主成分
信息丟失:各個點與其在該線上的新特征上新投影的點之間的距離總和
當(dāng)我們將方差最大化的同時畴嘶,我們實際上是將點與其在該線上的投影之間的距離最小化
projection onto direction of maximal variance minimizes distance from old(higher-dimensional) data point to its new transformed value
→ minimizes information loss
23. 用于特征轉(zhuǎn)換的 PCA
PCA as a general algorithm for feature transformation
我們將所有這四個特征一起放入PCA中蛋逾,它可以自動將這些特征結(jié)合成新的特征,并對這些新特征的相對能力劃分等級窗悯,如果我們的案例中有兩個隱藏特征推動數(shù)據(jù)中大部分變化区匣,那么PCA將選出這些特征,并將其作為第一和第二主成分蒋院,第一個主成分即影響最大的特征亏钩。
由于第一個主成分是混合產(chǎn)生的,可能包含所有特征或多或少的元素欺旧,但是該非監(jiān)督學(xué)習(xí)算法非常強大姑丑,可以幫助你從根本上了解數(shù)據(jù)中的隱藏特征,如果你對房價一無所知辞友,PCA仍可讓你獲得自己的見解栅哀,如總體上有兩個因素推動房價的變動震肮,至于這兩個因素是不是neighborhood
和size
,則取決于你自己,現(xiàn)在除了進行降維操作留拾,你還會了解到有關(guān)數(shù)據(jù)變化的模式的重要信息
25. PCA 的回顧/定義
review/definition of PCA
- systematized way to transform input features into principal component
- use principal components as new features in regression/classification
- you can also rank the principle components,the more variance you have of the data along a given principal component,the higher that principal component is ranked.so the one that has the most variance will be the first principal component,second will be the second principal component,and so on .
- the principal components are all perpendicular to each other in a sense,so the second principal component is mathematically guaranteed to not overlap at all with the first principal component,and the third will not overlap with the first through the second ,and so on.so you can treat them as independent features in a sense.
- there is a maximum number of principal components you can find,it's equal to the number of input features that you had in you data set.usually, you'll only use the first handful of principal components,but you could go all the way out and use the maximum number,in that case though,you are not really gaining anything,you're just representing your features in a different way,so the PCA won't give you the wrong answer,but it doesn't give you any advantages over just using the original input features if you're using all of the principal components together in a regression or classification task.
26. 將 PCA 應(yīng)用到實際數(shù)據(jù)
在以下幾段視頻中戳晌,Katie 和 Sebastian 研究安然的一些財務(wù)數(shù)據(jù),并著眼于 PCA 的應(yīng)用痴柔。
請記住沦偎,要獲得包含項目代碼的版本庫以及此數(shù)據(jù)集,請訪問以下網(wǎng)址:
https://github.com/udacity/ud120-projects
安然數(shù)據(jù)位于:final_project/
28. sklearn 中的 PCA
def doPCA():
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(data)
returen pca
pca = doPCA()
print pca.explained_variance_ratio_ #方差比咳蔚,是特征值的具體表現(xiàn)形式豪嚎,可以了解第一/二個主成分占數(shù)據(jù)變動的百分比
first_pc = pca.components_[0]
second_pc = pca.components_[1]
transformed_data = pca.transform(data)
for ii,jj in zip(transformed_data,data):
plt.scatter(first_pc[0]**ii[0],first_pc[1]**ii[0],color='r')
plt.scatter(second_pc[0]**ii[1],second_pc[1]**ii[1],color='c')
plt.scatter(jj[0],jj[1],color='b')
29.何時使用 PCA
- latent features driving the patterns in data(big shots at Enron)
if you want to access to latent features that you think might be showing up in the patterns in your data,maybe the entire point of what you're trying to do is figure out if there's a latent feature,in other words,you just want to know the size of the first principal components,then measure who the big shots are at Enron. - dimensionality reduction
-- visualize high dimensional data
sometimes you will have more than two features,you have to represent three or four or many numbers about a data point if you only have two dimensions in which to draw ,and so what you can do is project it down to the first two principal components and just plot that,and just draw that scatter plot.
-- reduce noise
the hope is that the first or the second,your strongest principal components are capturing the actual patterns in the data,and the smaller principle components are just representing noisy variations about those patterns,so by throwing away the less important principle components,you're getting rid of that noise.
-- make other algorithms(regression,classification) work better with fewer inputs(eigenfaces)
using PCA as pre-processing before you use another algorithm,so a regression or a classification task,if you have very high dimensionality, and if you have a complex,say,classification algorithm,the algorithm can be very high variance,it can end up fitting to noise in the data,it can end up running really low,there are lots of things that can happen when you have very high input dimensionality with some of these algorithms,but, of course,the algorithm might work really well for the problem at hand,so one of the things you can do is use PCA to reduce the dimensionality of your input features,so that then your,say classification algorithm works better.
in the example of eigenfaces,a method of applying PCA to pictures of people,this is a very high dimensionality space,you have many many pixels in the picture,but say,you want to identify who is pictured in the image,you are running some kind of facial identification,so with PCA you can reduce the very high input dimensionality into something that's maybe a factor of ten lower,and feed this into SVM,which can then do the actual classification of trying to figure out who's pictured,so now the inputs ,instead of being the original pixels or the images,are the principal components.
30. 用于人臉識別的PCA
PCA for facial recognition
what makes facial recognition in pictures good for PCA?
□ √pictures of faces generally have high input dimensionality (many pixels)
人臉照片通常有很高的輸入維度(很多像素)
在這種情況下,縮減非常重要屹篓,因為SVM很難處理一百萬個特征
□ √faces have general patterns that could be captured in smaller number of dimensions(two eyes on top,mouth /chin on bottom,etc.)
人臉具有一些一般性形態(tài)疙渣,這些形態(tài)可以以較小維數(shù)的方式捕捉,比如人一般都有兩只眼睛堆巧,眼睛基本都位于接近臉的頂部的位置
在兩張頭像中妄荔,并不是一百萬個像素點都存在差異,而是只有幾個主要的差異點谍肤,我們或許可以用PCA挑選出這些點啦租,并讓它們發(fā)揮最大用處
□ ×facial recognition is simple using machine learning(humans do it easily)
使用機器學(xué)習(xí)技術(shù),人臉識別是非常容易的(因為人類可以輕易做到)
很難用決策樹來實現(xiàn)人臉識別
31. 特征臉方法代碼
在人臉識別中荒揣,結(jié)合使用PCA和SVM是很強大的
"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
"""
print __doc__
from time import time
import logging
import pylab as pl
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
###############################################################################
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
np.random.seed(42)
# for machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print "Total dataset size:"
print "n_samples: %d" % n_samples
print "n_features: %d" % n_features
print "n_classes: %d" % n_classes
###############################################################################
# Split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
###############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train) #figuring out what the principle components are
print "the raio is ", pca.explained_variance_ratio_ #每個主成分的可釋方差 0.19346534 0.15116844
print "done in %0.3fs" % (time() - t0)
eigenfaces = pca.components_.reshape((n_components, h, w)) #asks for the eigenfaces
print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train) #transform data into the principle components representation
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)
###############################################################################
# Train a SVM classification model
print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train) #SVC using the principle components as the features
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_
###############################################################################
# Quantitative evaluation of the model quality on the test set
print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca) #SVC try to identify in the test set who appears in a given picture.
print "done in %0.3fs" % (time() - t0)
print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))
###############################################################################
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())
# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
pl.show()
The eigenfaces are basically the principle components of the face data.
at last ,the algorithm will show you the eigenfaces.
在SVM中篷角,將PCA產(chǎn)生的合成圖像用作特征,在預(yù)測圖片中的臉的身份時非常有用
33. PCA 迷你項目
我們在討論 PCA 時花費了大量時間來探討理論問題系任,因此恳蹲,在此迷你項目中,我們將要求你寫一些 sklearn 代碼俩滥。特征臉方法代碼很有趣嘉蕾,而且內(nèi)容豐富,足以勝任這一整個迷你項目的試驗平臺霜旧。
可在 pca/eigenfaces.py 中找到初始代碼错忱。此代碼主要取自此處 sklearn 文檔中的示例。
請注意挂据,在運行代碼時以清,對于在 pca/eigenfaces.py
的第 94 行調(diào)用的 SVC
函數(shù),有一個參數(shù)有改變崎逃。對于“class_weight”參數(shù)掷倔,參數(shù)字符串“auto”對于 sklearn 版本 0.16 和更早版本是有效值,但將被 0.19 舍棄个绍。如果運行 sklearn 版本 0.17 或更高版本今魔,預(yù)期的參數(shù)字符串應(yīng)為“balanced”勺像。如果在運行 pca/eigenfaces.py
時收到錯誤或警告障贸,請確保第 98 行包含與你安裝的 sklearn 版本匹配的正確參數(shù)错森。
sklearn 0.16或更早版本 class_weight='auto'
sklearn 0.16或更高版本 class_weight='balanced'
34.每個主成分的可釋方差
我們提到 PCA 會對主成分進行排序,第一個主成分具有最大方差篮洁,第二個主成分 具有第二大方差涩维,依此類推。第一個主成分可以解釋多少方差袁波?第二個呢瓦阐?
print "the raio is ", pca.explained_variance_ratio_ #每個主成分的可釋方差 0.19346534 0.15116844
第一主成分解釋了多少變異量? 0.19346534
第二主成分呢篷牌? 0.15116844
我們發(fā)現(xiàn)睡蟋,有時 Pillow 模塊(本例中使用的)可能會造成麻煩。如果你收到與 fetch_lfw_people() 命令相關(guān)的錯誤枷颊,請嘗試以下命令:
pip install --upgrade PILLOW
35.要使用多少個主成分戳杀?
現(xiàn)在你將嘗試保留不同數(shù)量的主成分。在類似這樣的多類分類問題中(要應(yīng)用兩個以上標簽)夭苗,準確性這個指標不像在兩個類的情形中那么直觀信卡。相反,更常用的指標是 F1 分數(shù)f1-score
题造。
我們將在評估指標課程中學(xué)習(xí) F1 分數(shù)f1-score
傍菇,但你自己要弄清楚好的分類器的特點是具有高 F1 分數(shù)f1-score
還是低 F1 分數(shù)f1-score
。你將通過改變主成分數(shù)量并觀察 F1 分數(shù)f1-score
如何相應(yīng)地變化來確定界赔。
將更多主成分添加為特征以便訓(xùn)練分類器時丢习,你是希望它的性能更好還是更差?
as you add more principal components as features for training your classifier,do you expect it to get better or worse performance?
□ √ could go either way
While ideally, adding components should provide us additional signal to improve our performance, it is possible that we end up at a complexity where we overfit.
36. F1 分數(shù)與使用的主成分數(shù)
將 n_components 更改為以下值:[10, 15, 25, 50, 100, 250]淮悼。對于每個主成分咐低,請注意 Ariel Sharon 的 F1 分數(shù)。(對于 10 個主成分敛惊,代碼中的繪制功能將會失效渊鞋,但你應(yīng)該能夠看到 F1 分數(shù)。)
如果看到較高的 F1 分數(shù)瞧挤,這意味著分類器的表現(xiàn)是更好還是更差锡宋?
Ariel Sharon f-score
n_components = 150 f-score=0.65
n_components = 10 f-score=0.11
n_components = 15 f-score=0.33
n_components = 50 f-score=0.67
n_components = 100 f-score=0.67
n_components = 250 f-score=0.62
if you see a higher f1-score ,dose it mean the classifier is doing better,or worse?
□ √ better
37. 維度降低與過擬合
在使用大量主成分時,是否看到過擬合的任何證據(jù)特恬?PCA 維度降低是否有助于提高性能执俩?
did you see any evidence of overfitting when using a large number of PCs?
□ √ yes,performance starts to drop with many PCs.
38. 選擇主成分
selecting a number of principle components
think about selecting how many principle components you should look at.
there is no cut and dry answer for how many principle components you should use,you kind of have to figure it out
what's a good way to figure out how many PCs to use?
□ × just take top 10%
□ √train on different number of PCs,and see how accuracy responds-cut off when it becomes apparent that adding more PCs doesn't by you much more discrimination
□ × perform feature selection on input features before putting them into PCA,then use as many PCs as you have input features.
PCA is going to find a way to combine information from potentially many different input features together,so if you are throwing out input features before you do PCA,you are throwing information that PCA might be able to kind of rescue in a sense.it's fine to do feature selection on the principle components after you have make them,but you want to be very careful about throwing out information before performing PCA.
PCA can be fairly computationally expensive,so if you have a very large input feature space and you know that a lot of them are potentially completely irrelevant features. go ahead and try tossing them out,but proceed with caution.