- 降維方法:
- principal component analysis
- conical correlation analysis
- singular value decomposition
- 原始數(shù)據(jù)預(yù)處理,三步驟
- data preprocessing
- feature engineering
- feature selection;其中特征選擇又有3方法,即
- filter限府;select the best subset
- wrapper; generate a subset---->learning algorithm 循環(huán);
- embedded method; generate a subset---->learning algorithm + performance 循環(huán)痢缎;
-
The process of machine learning機(jī)器學(xué)習(xí)步驟
Some classification algorithms
- nearest neighbour
- Linear svm
- RBF svm
- Gaussian process
- decision tree
- random forest
- neural net
- ada boost
- naive bayes
-
QDA
-
幾種算法
A. Regression- Ordinal Regression序數(shù)回歸: data in rank ordered categories
- Poisson Regression: predicts event counts
- Fast forest quantile regression: predicts a distribution
- Linear regression: fast training, linear model
- Bayesian linear regression: linear model, small data sets
- neural network regression: accurate, long training times
- decision forest regression: accurate, fast training times
- boosted decision tree regression: accurate, fast training times, large memory footprint
B. Clustering - K-means: unsupervised learning
C. Anomaly detection 異常檢測 - PCA-Based Anomaly detection: fast training times
- Two-class classification: under 100 features, aggressive boundary
D. Two-class classification - two-class SVM: under 100 features, linear model
- two-class averaged perceptron: fast training, linear model
- two-class bayes point machine: fast training, linear model
- two-class decision forest
- two-class regression
- two-class boosted decision tree
- two-class decision jungle
- two-class locally deep SVM
- two-class neural network
E. Multiclass Classification - multiclass logistic regression
- multiclass neural network
- multiclass decision forest
- multiclass decision jungle
- one-v-all multiclass: depend on the two-class classifier
Semi-supervised learning
Between supervised learning and unsupervised learning; 少部分?jǐn)?shù)據(jù)有l(wèi)abel胁勺,大多數(shù)數(shù)據(jù)沒有l(wèi)abel; 有高準(zhǔn)確率独旷,且與supervised learning相比署穗,它訓(xùn)練成本低很多寥裂。-
Reinforcement Learning增強(qiáng)學(xué)習(xí)
從一系列動(dòng)作中,學(xué)習(xí)到最大反饋方程案疲,此處反饋方程可以是“bad actions”或“good action”封恰; 增強(qiáng)學(xué)習(xí)常常用于自動(dòng)駕駛中,即通過周遭環(huán)境的一系列反饋來做出決定褐啡。
-
機(jī)器學(xué)習(xí)算法诺舔,分類圖
一個(gè)tip
如果訓(xùn)練過程中,數(shù)據(jù)結(jié)果很好春贸,但在評估階段結(jié)果很差混萝,那很有可能是overfitting了遗遵。-
常用validation的三種方法
hold-out validation萍恕,預(yù)留校驗(yàn)數(shù)據(jù);適用大數(shù)據(jù)樣本
-
k-fold cross validation车要,將訓(xùn)練集分成k等份允粤;適用小數(shù)據(jù)樣本
leave-one-out validation(LOOCV),特殊的k-fold交叉校驗(yàn)翼岁,重復(fù)直至每個(gè)觀察樣本都作為過了校驗(yàn)數(shù)據(jù)类垫。
-
評估模型的幾種方法
-
A. accuracy(精確率), precision(查準(zhǔn)率),recall(查全率)
如何判斷哪個(gè)模型效果最好琅坡,可以通過F score悉患,相關(guān)定義方程如下:
F越大越好
-
B. ROC curves
其中ROC 曲線圖的優(yōu)點(diǎn)是不受類分布(不平衡類分布)的 影響
-
C. AUC (area under curve)
其中,auc越高越好
-
D. R平方榆俺,coefficient of determination售躁,【0,1】
It is a standard way of measuring how well the model fits the data.
缺點(diǎn)是:R總是這增長,從不會(huì)減少茴晋,所以數(shù)據(jù)更多的模型陪捷,它的R值總是更大,就會(huì)認(rèn)為該模型更好诺擅;此外市袖,如果訓(xùn)練數(shù)據(jù)更高階,那么噪聲很容易被誤認(rèn)為待訓(xùn)練數(shù)據(jù)烁涌,即噪聲參與了模型的訓(xùn)練
一個(gè)tip
有時(shí)候一個(gè)準(zhǔn)確率很高的模型并不能說它是有用的苍碟,比如,一個(gè)模型說99%無癌癥撮执,1%有癌癥驰怎,這是一個(gè)樣本分布不均勻的案例, 此時(shí)需要建立兩個(gè)模型二打,模型A用來判定有癌癥县忌,模型B用來判定無癌癥-
Bias和Variance問題
underfit屬于high bias
overfit屬于high variant
判斷模型的好壞的過程中掂榔,如果訓(xùn)練集效果很好,但是校驗(yàn)集不好症杏,那么是high variance問題(即overfit)装获;如果訓(xùn)練集和校驗(yàn)集效果都不好,那么是high bias問題(即underfit)厉颤。
解決方法: