1. 數(shù)據(jù)維度
PCA 主成分分析
principle component analysis
PCA是一套全面應(yīng)用于各類數(shù)據(jù)分析的分析方法,包括特征集壓縮feature set compression
6. 用于數(shù)據(jù)轉(zhuǎn)換的PCA
PCA finds a new coordinate system that's obtained from the old one by translation and rotation only
PCA moves the center of the coordinate system with the center of the data
PCA move the x-axis into the principle axis of variation ,where you see the most variation relative to all the data points
PCA move the y-axis down the road into a orthogonal less important directions of variation
7. 新坐標系的中心
8. 新坐標系的主軸
歸一化 PCA 成分向量后秒拔,
△x(黑)= 根號 2 分之一
△y(黑)= 根號 2 分之一 #新的x軸
△x(紅)= 負根號 2 分之一
△y(紅)= 根號 2 分之一 #新的y軸
11. 練習(xí)查找新軸
通過PCA還可以得出一個重要值莫矗,那就是軸的散布值 spread value
12. 哪些數(shù)據(jù)可用于PCA
Part of the beauty of PCA is that the data doesn't have to be perfectly 1D in order to find the principal axis!
13. 軸何時占主導(dǎo)地位
所謂長軸占優(yōu)勢是指軸的重要值importance value作谚,或者說長軸特征值要大于短軸的特征值
□ 決策樹分類器
□ √線性回歸
15. 從四個特征到兩個
square footage
no. of rooms
school ranking
neighborhood safety
16. 在保留信息的同時壓縮
□ SelectKBest(K 為要保留的特征數(shù)量)
□ √ SelectPercentile 指定你希望保留的特征的百分比
如果我們知道本來有多少個可選特征蔚万,也知道最后需要多少個特征岭妖,那么也可以使用 SelectPercentile
這里的復(fù)合特征/組合特征,也被稱為主要成分principle component ,是一個非常強大的算法版扩,本課中,我們主要在特征降維的情況中討論它侄泽,降低特征的維度礁芦,從而將一大堆特征縮減至幾個特征
例子:將square footage
18. 最大方差
- the willingness/flexibility of an algorithm to learn
technical term in statistics -- roughly the 'spread' of a data distribution(similar to standard deviation)
19. 最大方差的優(yōu)點
principal component of a data set is the direction that has the largest variance because ?
why do you think we define the principle component this way?
what's the advantage of looking for the direction that has the largest variance?
when we are doing our project of these two dimension feature space down on to one dimension,why do we project all the data points down onto this heavy red line instead of projecting them onto this shorter line?
□ 計算復(fù)雜度低
□ √可以最大程度保留來自原始數(shù)據(jù)的信息量
□ 只是一種慣例,并沒有什么實際的原因
20. 最大方差與信息損失
safety problems
+ school ranking
→(PCA) neighborhood quality
find the direction of maximal variance
21. 信息損失和主成分
projection onto direction of maximal variance minimizes distance from old(higher-dimensional) data point to its new transformed value
→ minimizes information loss
23. 用于特征轉(zhuǎn)換的 PCA
PCA as a general algorithm for feature transformation
25. PCA 的回顧/定義
review/definition of PCA
- systematized way to transform input features into principal component
- use principal components as new features in regression/classification
- you can also rank the principle components,the more variance you have of the data along a given principal component,the higher that principal component is ranked.so the one that has the most variance will be the first principal component,second will be the second principal component,and so on .
- the principal components are all perpendicular to each other in a sense,so the second principal component is mathematically guaranteed to not overlap at all with the first principal component,and the third will not overlap with the first through the second ,and so on.so you can treat them as independent features in a sense.
- there is a maximum number of principal components you can find,it's equal to the number of input features that you had in you data set.usually, you'll only use the first handful of principal components,but you could go all the way out and use the maximum number,in that case though,you are not really gaining anything,you're just representing your features in a different way,so the PCA won't give you the wrong answer,but it doesn't give you any advantages over just using the original input features if you're using all of the principal components together in a regression or classification task.
26. 將 PCA 應(yīng)用到實際數(shù)據(jù)
在以下幾段視頻中戳晌,Katie 和 Sebastian 研究安然的一些財務(wù)數(shù)據(jù),并著眼于 PCA 的應(yīng)用痴柔。
28. sklearn 中的 PCA
def doPCA():
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
returen pca
pca = doPCA()
print pca.explained_variance_ratio_ #方差比咳蔚,是特征值的具體表現(xiàn)形式豪嚎,可以了解第一/二個主成分占數(shù)據(jù)變動的百分比
first_pc = pca.components_[0]
second_pc = pca.components_[1]
transformed_data = pca.transform(data)
for ii,jj in zip(transformed_data,data):
29.何時使用 PCA
- latent features driving the patterns in data(big shots at Enron)
if you want to access to latent features that you think might be showing up in the patterns in your data,maybe the entire point of what you're trying to do is figure out if there's a latent feature,in other words,you just want to know the size of the first principal components,then measure who the big shots are at Enron. - dimensionality reduction
-- visualize high dimensional data
sometimes you will have more than two features,you have to represent three or four or many numbers about a data point if you only have two dimensions in which to draw ,and so what you can do is project it down to the first two principal components and just plot that,and just draw that scatter plot.
-- reduce noise
the hope is that the first or the second,your strongest principal components are capturing the actual patterns in the data,and the smaller principle components are just representing noisy variations about those patterns,so by throwing away the less important principle components,you're getting rid of that noise.
-- make other algorithms(regression,classification) work better with fewer inputs(eigenfaces)
using PCA as pre-processing before you use another algorithm,so a regression or a classification task,if you have very high dimensionality, and if you have a complex,say,classification algorithm,the algorithm can be very high variance,it can end up fitting to noise in the data,it can end up running really low,there are lots of things that can happen when you have very high input dimensionality with some of these algorithms,but, of course,the algorithm might work really well for the problem at hand,so one of the things you can do is use PCA to reduce the dimensionality of your input features,so that then your,say classification algorithm works better.
in the example of eigenfaces,a method of applying PCA to pictures of people,this is a very high dimensionality space,you have many many pixels in the picture,but say,you want to identify who is pictured in the image,you are running some kind of facial identification,so with PCA you can reduce the very high input dimensionality into something that's maybe a factor of ten lower,and feed this into SVM,which can then do the actual classification of trying to figure out who's pictured,so now the inputs ,instead of being the original pixels or the images,are the principal components.
30. 用于人臉識別的PCA
PCA for facial recognition
what makes facial recognition in pictures good for PCA?
□ √pictures of faces generally have high input dimensionality (many pixels)
□ √faces have general patterns that could be captured in smaller number of dimensions(two eyes on top,mouth /chin on bottom,etc.)
□ ×facial recognition is simple using machine learning(humans do it easily)
31. 特征臉方法代碼
Faces recognition example using eigenfaces and SVMs
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
print __doc__
from time import time
import logging
import pylab as pl
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
# for machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print "Total dataset size:"
print "n_samples: %d" % n_samples
print "n_features: %d" % n_features
print "n_classes: %d" % n_classes
# Split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train) #figuring out what the principle components are
print "the raio is ", pca.explained_variance_ratio_ #每個主成分的可釋方差 0.19346534 0.15116844
print "done in %0.3fs" % (time() - t0)
eigenfaces = pca.components_.reshape((n_components, h, w)) #asks for the eigenfaces
print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train) #transform data into the principle components representation
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)
# Train a SVM classification model
print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train) #SVC using the principle components as the features
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_
# Quantitative evaluation of the model quality on the test set
print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca) #SVC try to identify in the test set who appears in a given picture.
print "done in %0.3fs" % (time() - t0)
print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
The eigenfaces are basically the principle components of the face data.
at last ,the algorithm will show you the eigenfaces.
33. PCA 迷你項目
我們在討論 PCA 時花費了大量時間來探討理論問題系任,因此恳蹲,在此迷你項目中,我們將要求你寫一些 sklearn 代碼俩滥。特征臉方法代碼很有趣嘉蕾,而且內(nèi)容豐富,足以勝任這一整個迷你項目的試驗平臺霜旧。
可在 pca/eigenfaces.py 中找到初始代碼错忱。此代碼主要取自此處 sklearn 文檔中的示例。
請注意挂据,在運行代碼時以清,對于在 pca/eigenfaces.py
的第 94 行調(diào)用的 SVC
函數(shù),有一個參數(shù)有改變崎逃。對于“class_weight”參數(shù)掷倔,參數(shù)字符串“auto”對于 sklearn 版本 0.16 和更早版本是有效值,但將被 0.19 舍棄个绍。如果運行 sklearn 版本 0.17 或更高版本今魔,預(yù)期的參數(shù)字符串應(yīng)為“balanced”勺像。如果在運行 pca/eigenfaces.py
時收到錯誤或警告障贸,請確保第 98 行包含與你安裝的 sklearn 版本匹配的正確參數(shù)错森。
sklearn 0.16或更早版本 class_weight='auto'
sklearn 0.16或更高版本 class_weight='balanced'
我們提到 PCA 會對主成分進行排序,第一個主成分具有最大方差篮洁,第二個主成分 具有第二大方差涩维,依此類推。第一個主成分可以解釋多少方差袁波?第二個呢瓦阐?
print "the raio is ", pca.explained_variance_ratio_ #每個主成分的可釋方差 0.19346534 0.15116844
第一主成分解釋了多少變異量? 0.19346534
第二主成分呢篷牌? 0.15116844
我們發(fā)現(xiàn)睡蟋,有時 Pillow 模塊(本例中使用的)可能會造成麻煩。如果你收到與 fetch_lfw_people() 命令相關(guān)的錯誤枷颊,請嘗試以下命令:
pip install --upgrade PILLOW
現(xiàn)在你將嘗試保留不同數(shù)量的主成分。在類似這樣的多類分類問題中(要應(yīng)用兩個以上標簽)夭苗,準確性這個指標不像在兩個類的情形中那么直觀信卡。相反,更常用的指標是 F1 分數(shù)f1-score
我們將在評估指標課程中學(xué)習(xí) F1 分數(shù)f1-score
傍菇,但你自己要弄清楚好的分類器的特點是具有高 F1 分數(shù)f1-score
還是低 F1 分數(shù)f1-score
。你將通過改變主成分數(shù)量并觀察 F1 分數(shù)f1-score
as you add more principal components as features for training your classifier,do you expect it to get better or worse performance?
□ √ could go either way
While ideally, adding components should provide us additional signal to improve our performance, it is possible that we end up at a complexity where we overfit.
36. F1 分數(shù)與使用的主成分數(shù)
將 n_components 更改為以下值:[10, 15, 25, 50, 100, 250]淮悼。對于每個主成分咐低,請注意 Ariel Sharon 的 F1 分數(shù)。(對于 10 個主成分敛惊,代碼中的繪制功能將會失效渊鞋,但你應(yīng)該能夠看到 F1 分數(shù)。)
如果看到較高的 F1 分數(shù)瞧挤,這意味著分類器的表現(xiàn)是更好還是更差锡宋?
Ariel Sharon f-score
n_components = 150 f-score=0.65
n_components = 10 f-score=0.11
n_components = 15 f-score=0.33
n_components = 50 f-score=0.67
n_components = 100 f-score=0.67
n_components = 250 f-score=0.62
if you see a higher f1-score ,dose it mean the classifier is doing better,or worse?
□ √ better
37. 維度降低與過擬合
在使用大量主成分時,是否看到過擬合的任何證據(jù)特恬?PCA 維度降低是否有助于提高性能执俩?
did you see any evidence of overfitting when using a large number of PCs?
□ √ yes,performance starts to drop with many PCs.
38. 選擇主成分
selecting a number of principle components
think about selecting how many principle components you should look at.
there is no cut and dry answer for how many principle components you should use,you kind of have to figure it out
what's a good way to figure out how many PCs to use?
□ × just take top 10%
□ √train on different number of PCs,and see how accuracy responds-cut off when it becomes apparent that adding more PCs doesn't by you much more discrimination
□ × perform feature selection on input features before putting them into PCA,then use as many PCs as you have input features.
PCA is going to find a way to combine information from potentially many different input features together,so if you are throwing out input features before you do PCA,you are throwing information that PCA might be able to kind of rescue in a sense.it's fine to do feature selection on the principle components after you have make them,but you want to be very careful about throwing out information before performing PCA.
PCA can be fairly computationally expensive,so if you have a very large input feature space and you know that a lot of them are potentially completely irrelevant features. go ahead and try tossing them out,but proceed with caution.