1 了解問題的背景 Look at the big picture
2 獲取數據 Get the data
3 通過數據對問題和背景深入了解 Discover and visualise the data to gain insights
4 針對算法預處理數據 Prepare the data for machine learning algorithms
5 選擇模型撒蟀,訓練 Select a model and train it
6 調節(jié)模型 Fine-tune your model
7 展示解決方案 Present the solution
8 啟用/監(jiān)控/維護 Launch, monitor, and maintain your system
1 了解問題的背景 Look at the big picture
(2020.04.04)
項目流程 Machine learning pipeline: a sequence of data processing components. Components in a pipeline typically run asynchronously.
了解問題后蜕劝,設計系統: 1) unsupervised/supervised/reinforcement learning, 2) classification/regression/others, 3)batch learning/online learning.
問題形成階段的checklist. Frame the problem: the checklist:
1) Define the objective in business terms?
2) How would your solution be used?
3) What are the current solutions/workaround(if any)? 替代方法
4) How should you frame the problem(supervised/unsupervised, classification/regression/others, batch/online learning and etc)?
5) how should performance be measured?
6) Is the performance measure aligned with the business objective?
7) What should be minimum performance needed to reach the business objective?
8) What are the comparable problems? Can you reuse experience or tools?
9) Is human expertise available?
10) How would you solve the problem manually?
11) List the assumptions you (or others) have made so far.
12) Verify assumptions if possible.
(2020.03.29 Sun)
Select a performance measure
一個典型的回歸問題measure: Root Mean Square Error(RMSE)予权,系統預測誤差的標準差.
Mean Absolute Error (MAE): 誤差絕對值的平均值愕够,用于處理有較多離群點(outlier)的情況.
RMSE/MAE都是用來計算兩個矢量的距離愉豺。
其中吊圾,RMSE計算的是平方值谭溉,對應Euclidian norm粥惧,稱為norm范數喘鸟。
MAE稱為Manhattan norm匆绣,稱為norm范數。
范數:?
范數越高什黑,越關注大值large value崎淳,忽略小值neglect small value。因此RMSE對異常值outlier比MAE更加敏感愕把,但當異常值exponentially rare拣凹,RMSE表現更好。
2 獲取數據 Get the data
(2020.04.04)
清單 Checklist(automate as much as possible so you can easily get fresh data):
1) List the data you need and how much you need.
2) Find and document where you can get the data.
3) Check how much space it will take.
4) Check legal obligation, and get authorisation if necessary.
5) Get access authorisation.
6) Create the workspace (with enough storage space).
7) Get the data.
8) Convert the data to a format you can easily manipulate (without changing the data itself).
9) Ensure sensitive information is deleted or protected (e.g., anonymised)
10) Check the size and type of data (time series, sample, geographical, etc.).
11) Sample a test set, put it aside, and never look at it (no data snooping).
PS:
1) Data snooping bias: when you estimate the generalisation error using the test set, your estimate will be too optimistic and you will launch a system that will not perform as well as expected.
3 通過數據對問題和背景深入了解 Discover and visualise the data to gain insights
(2020.04.04)
清單checklist (try to get insights from a field expert for these steps):
1) Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
2) Create a Jupyter notebook to keep a record of your data exploration.
3) Study each attribute and its characteristics: Name / Type (categorical, int/float, bounded/unbounded, text, structured, etc.) / % of missing value / noisiness and types of noise (stochastic, outliers, rounding error, etc.) / possibly useful for the task / type of distribution (Gaussian, uniform, logarithmic, etc.).
4) For supervised learning task, identify the target attribute(s).
5) Visualise the data.
6) Study the correlations between attributes.
7) Study how would you solve the problem manually.
8) Identify the promising transformations you may want to apply.
9) Identify extra data that would be useful.
10) Document what you have learned.
(2020.04.06)
常規(guī)檢查: a) 一些數據異常(e.g., missing value/outliers/etc.)并做清洗和處理恨豁,b)發(fā)現變量之間的相關性嚣镜,特別是變量與目標變量的相關性,c)有的變量呈長尾分布橘蜜,可考慮用logarithmic變換菊匿,d)變量合并(PCA/SVD/etc.).
4 針對算法預處理數據 Prepare the data for machine learning algorithms
(2020.04.04)
Notes:?
a) Work on copies of the data (keep the original dataset intact).
b) Write functions for all data transformations you apply, for 5 reasons:
? ? -So you can easily prepare the data the next time you get a fresh dataset
? ? -So you can apply these transformations in future projects
? ? -To clean and prepare the test set
? ? -To clean and prepare new data instance once your solution is alive
? ? -To make it easy to treat your preparation choices as hyperparameters
1) Data cleaning:?
? ? *Fix or remove outliers (optional)
? ? *Fill in missing values (e.g., with 0, mean, median...) or drop their rows (or columns)
2) Feature selection (optional):
? ? *Drop the attributes that provide no useful information for the task
3) Feature engineering, where appropriate:
? ? *Discretise continuous features
? ? *Decompose features (e.g., categorical, date/time, etc.)
? ? *Add promising transformation of features (e.g., log(x), sqrt(x), x^n, etc. For attribute with long-tail distribution, you may want logarithm. You may find interesting correlations between attributes, in particular with target attribute. Try out various attribute combinations.)
? ? *Aggregate features into promising new features
4) Feature scaling: standardise or normalise features
5 選擇模型,訓練 Select a model and train it
(2020.04.05 Sun)
Notes: a) If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware this penalises complex models such as large neural nets or Random forests). b) Once again, try to automate these steps as much as possible.
1) Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forest, neural net, etc.) using standard parameters
2) Measure and compare their performance?
? ? -For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds
3) Analyse the most significant variables for each algorithm
4) Analyse the types of errors the models make
? ? -What data would a human have used to avoid these errors?
5) Have a quick round of feature selection and engineering
6) Have one or two more quick iterations of the five previous steps
7) Short-list the top 3 to 5 most promising models, preferring models that make different types of errors.
6 調節(jié)模型Fine-tune your model
所謂訓練模型计福,分為訓練模型參數(權重)和訓練模型超參數(不會隨著訓練進行而變化的參數)跌捆。
調節(jié)模型,即調節(jié)超參象颖。三種方法: Grid search/Random search/Ensemble method.
(2020.04.05 Sun)
Notes:
a) You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning
b) As always automate what you can
1) Fine-tune the hyperparameters using cross-validation
? ? -Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with 0 or with median values? Or just drop the rows?)
? ? -Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimisation approach (e.g., Using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams)
2) Try ensemble methods. Combining your best models will often perform better than running them individually.
3) Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error.
PS: Don't tweak your model after measuring the generalisation error- you would just start overfitting the test set.
(2020.03.29 Sat)
Grid search: 遍歷超參的所有組合佩厚。可應用Scikit-Learn的GridSearchCV實現這個功能说订,只需傳遞參數名和所有取值抄瓦,使用cross-validation交叉驗證。
from sklearn.model_selection import GridSearchCV
para_grid = [ {'n_estimators': [3,10,30], 'max_features': [2,3,4,6]}, {'bootstrap': [False],'n_estimator': [3,10], 'max_features':[3,4,5]}]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, para_grid, cv = 5, scoring ='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
這個案例中克蚂,para_grid的第一種情況有3*4種搭配闺鲸,第二種參數組合有1*2*3種搭配,累計18種參數組合埃叭。參數cv設置5摸恍,代表每種參數組合經過5次交叉驗證。所以累計有5*18輪訓練。
grid_search.best_params_/grid_search.best_estimator_/grid_search.cv_results_
Randomised Search:
顧名思義立镶,對超參的search space做隨機查找壁袄。
from sklearn.model_selection import RandomizedSearchCV
Ensemble methods
e.g., Random Forest.
評估模型Evaluate models (esp., 超參)
機器學習的目的是得到可以泛化(generalise)的模型,即在前所未有的數據上表現很好的模型媚媒,而過擬合則是核心難點嗜逻。
將訓練集進一步分成訓練集(training set)和驗證集(validation set)。
三種評估(和調整超參的)方法:留出驗證(hold-out validation)缭召、K折驗證(k-fold validation)和打亂數據的重復K折驗證(iterated K-fold validation with shuffling)
Hold-out validation:
數據先分為訓練集(其中含訓練集和驗證集)和測試集栈顷。驗證集的存在避免了用測試集來調節(jié)模型。
流程: 打亂數據-->定義驗證集/訓練集-->在訓練集上訓練數據在驗證機上評估模型-->一旦調節(jié)好超參嵌巷,通常在所有非測試數據上從頭開始訓練最終模型萄凤。
```
model = get_model()
model.train(training_data)
validatoin_score = model.evaluation(validation_data)
```
# 開始重新訓練
```
model = get_model()
model.train(np.concatenate([training_data, validation_data])
test_score = model.evaluate(test_data)
```
這種方法最簡單,但遇到數據較少的情況則各部分樣本過少搪哪。
K-fold validation K折驗證:
對數據集全體做K個大小相同的分區(qū)靡努,在其中的k-1個分區(qū)上做訓練,1個分區(qū)上做驗證晓折。循環(huán)這個過程一共K次(相當于每個分區(qū)都做一次驗證集)惑朦,每次模型評估有一個性能值,K個的平均作為模型的最終分數漓概。如果模型的性能變化較大漾月,這個方法很有用。需要留出獨立的測試集測試胃珍。得到平均分數后栅屏,再用訓練集+驗證集訓練模型。
k = 4
nam_valid_samples = len(data) // k
v_scores = []
for fold in range(k):
? ? #選擇驗證數據分區(qū)
? ? #剩余數據做訓練
? ? #創(chuàng)建新模型的實例(未訓練)
? ? v_scores.append(上一步的分數)
# v_scores中所有K個值的平均值堂鲜,即最終驗證分數。
# 在所有非測試數據及上訓練最終模型
(注: 所以為啥要去K個值得平均值?)
打亂數據的重復K折驗證(iterated K-fold validation with shuffling):
與前一種方法不同的是护奈,該方法在每次將數據劃分為K個分區(qū)之前先將數據打亂挪凑。并且重復P次望艺,所以總共訓練P*K個模型。計算代價很大。在Kaggle中常用梦抢。
7 展示解決方案 Present your solution
1) Document what you have done.
2) Create a nice presentation
? ? -Make sure you highlight the big picture first
3) Explain why your solution achieves the business objective.
4) Don't forget to present the interesting points you noticed along the way.
? ? -Describe what worked and what did not
? ? -List your assumptions and your system's limitations.
5) Ensure your key findings are communicated through beautiful visualisations or easy-to-remember statements (e.g., 'the median income is the number-one predictor of housing prices').
8 啟用/監(jiān)控/維護 Launch, monitor, and maintain your system
1) Get your solution ready for production (plug into production data inputs, write unit test, etc.)
2) Write monitoring code to check your system's live performance at regular intervals and trigger alerts where it drops.
? ? -Beware of slow degradation too: models tend to 'rot' as data evolves
? ? -Measuring performance may require a human pipeline (e.g., via a crowdsourcing service)
? ? -Also monitor your input's quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particularly important for online learning systems.
3) Retrain your models on a regular basis on fresh data (automate as much as possible).?
基本操作
(2020.04.04)
1 查看DataFrame中各類基本信息?
import pandas as pd, import matplotlib.pyplot as plt
df.head(n): 頭部n個元素,默認n =5
df.info(): quick description of data
df['some_field'].value_counts(): 各值數量
df.hist(bins = 50, figsize=(20,15))
plt.show(): 這兩條指令聯合使用畫出各變量的直方圖村缸,bins表示柱的個數汉形,即顆粒度
(2020.04.05)
np.random.permutation(n): 生成0到n-1之間所有整數的隨機排列,用于shuffle indices.
數據打亂的最簡單方法
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 42)
其中housing是含有各字段的dataframe鸵闪,random_state allows you to set the random generator seed.
一種采樣方法 Stratified sampling:
the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. 分層采樣檐晕。該方法避免了采樣偏差sampling bias.
from sklearn.model_selection import StratifiedShuffleSplit as SSS
split = SSS(n_splits= 1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
? ? strat_train_set = housing.loc[train_index]
? ? strat_test_set = housing.loc[test_index]
散點圖
1) housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude')? #, alpha = 0.1)
2) from pandas.tools.plotting import scatter_matrix
attributes = ['median_house_value', 'median_income', 'total_rooms', 'hose_median_age']
scatter_matrix(housing[attributes], figsize=(12,8))
or
housing.plot(kind='scatter', x='median_income', y = 'median_house_value', alpha=0.1)
數據相關性
corr_matrix = housing.corr() #得到相關矩陣
>> corr_matrix['median_house_value'].sort_values( ascending=False) #返回median_house_value與其他變量的相關性
數據清洗 Data cleaning
處理missing values的方法
1) housing.dropna(subset=['total_bedrooms']) #拋棄
2) housing.drop('total_bedrooms', axis = 1), median = housing['total_bedrooms'].median()
3) housing['total_bedrooms'].fillna(median) #填充
針對numerical變量(非text),還可以使用Imputer
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'median')
housing_num = housing.drop('ocean_proximity', axis = 1)
imputer.fit(housing_num)
>> imputer.statistics_? ?>> housing_num.median().values
之后用經過訓練的imputer代替missing values,用median
x = imputer.transform(housing_num) # type(x) = np.array
housing_tr = pd.DataFrame(x, columns = housing_num.columns)
處理文本和非數值型變量 Handle text and categorical attributes
將text labels轉換成numbers
>> from sklearn.preprocessing import LabelEncoder as LE
>> encoder =? LE() # encoder.classes_ 可查看內容
>> housing_cat = housing['ocean_proximity']
>> housing_cat_encoded = encoder.fit_transform(housing_cat) # type(ho.._c_e..) = array
轉換成數字的問題: ML算法會假定兩個相近的值比兩個相較遠的值更加相似辟灰。
對策: one-hot encoding, i.e., only one attribute will be equal to 1 (hot), while the others will be 0 (cold).
>> from sklearn.preprocessing import OneHotEncoder as ohe
>> encoder = ohe()
>> housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
>> housing_cat_1hot 返回spart matrix of type <calss numpy.float64>
>> housing_cat_1hot.toarray() 返回一個sparse matrix
text轉換成二元binary值
>> from sklearn.preprocessing import LabelBinarizer as lb
>> encoder = lb()
>> housing_cat_1hot = encoder.fit_transform(housing_cat) # type(housing_cat_1hot) = array
特征尺度變換 Feature scaling
兩種个榕,min-max (normalisation) scaling和standardisation.
min-max, a.k.a., normalisation: values are shifted and rescaled so that they end up ranging from 0 to 1. 公式: (x -min) / (max - min)
standardisation: first it subtracts the mean value, then it divides by the variance so the resulting distribution has unit variance. Standardised? value have a 0 mean. 優(yōu)點: standardisation is much less affected by outliers.不被異常值困擾. 公式: (x -mean) / var. 注意sklearn中的StandardScaler.
變換的流程 Transform Pipeline
>> from sklearn.pipeline import Pipeline
>> from sklearn.preprocessin import StandardScaler
...
Underfitting 欠擬合
誤差過大,可能是欠擬合芥喇。The features do not provide enough information to make good predictions, or that model is not powerful enough.
對策:
1) select a more powerful model
2) feed the training algorithm with better features
3) reduce the constraints on the model
(2020.04.06-08)
Overfitting 過擬合(more content later)
在訓練集性能良好西采,在訓練集性能不好。降低過擬合的方法叫做正則化regularisation继控。
1) L1/L2正則化L1/L2 regularisation械馆。在監(jiān)督學習中使用的正則化方法,目標是降低誤差函數武通,即實際與真實值的差值最小霹崎。在誤差函數中加入懲罰項,即正則化項厅须。
J' = J + \lambda |w| L1正則化
J' = J + \lambda |w| ^2 L2正則化
(為什么正則化方法可以減少過擬合?參考這里: 代價函數或誤差函數對w求導仿畸,發(fā)現對w0有導數,而對b無導數也就是正則化對b的變化沒影響朗和。L2正則化過程中错沽,w的系數由未正則化過程中的1轉變?yōu)樾∮?的整數,也就是系數衰減weight decay眶拉,因此正則化項的加入千埃,使得w有衰減,而過擬合時忆植,權重往往較大放可,減小了權重可看做是減少了過擬合。(為什么會這樣:)過擬合時朝刊,擬合函數需要顧及每一個點耀里,所以形成的擬合函數波動也很大,在小區(qū)間里函數波動劇烈拾氓。這就意味著函數在小區(qū)間的導數足夠大冯挎,因自變量可大可小不能控制故只能導數足夠大。而正則化特別是L2正則化正是通過約束函數的權重來避免導數足夠大而過擬合咙鞍。 )
2) 決策樹算法中用剪枝房官。
3) 數據增廣Data augmentation,人為生成新數據擴大訓練集size续滋,根據已有數據翰守。可用于減少overfitting疲酌。在CV中通過旋轉圖像rotate/縮放resize等獲得蜡峰,在NLP中通過同義詞擴充數據集,語音處理中通過對數據加入白噪聲∈率荆可在訓練時早像,邊訓練邊生成(generate training instances on the fly),
4) 神經網絡中的Dropout肖爵,在訓練過程中對部分神經元進行前向傳播和后向傳播卢鹦,另一部分神經元保持不變。該方法使得每個神經元只用樣本集中的部分樣本劝堪,相當于對樣本集進行采樣冀自,即bagging。最終得到多個神經網絡的組合秒啦。訓練過程中熬粗,每個neuron,含input neuron而不包含output neurons余境,有一定概率(p)被暫時dropped out, e.g., it will be entirely ignored during this training step, but maybe active during the next step.概率p被稱為dropout rate驻呐,常被設定成50%。訓練完成后芳来,神經元不再被dropped含末。考慮到每次的訓練中即舌,每個神經元要么被選中要么被dropout佣盒,所以一共有2^N中可能的神經元組合(N:神經元總數),如果訓練1000步顽聂,即可認為是訓練了1000種不同的神經網絡(設2^N >> 1000)肥惭。但這些網絡不是相互獨立,因為他們共享weights紊搪。訓練結果也可以認為是(訓練中遇到的所有)神經網絡的averaging ensemble蜜葱。訓練時如果發(fā)現產生overfitting,則提高dropout rate可減少overfitting耀石;underfitting同理笼沥。對于大層(large layer),可提高dropout rate娶牌,small layer則減少dropout rate。該方法的缺點是slow down convergence馆纳,增長了訓練時間诗良,但是會得到a better model。
from tensorflow.contrib.layers import dropout
5) Max-Norm regularisatoin
用于神經網絡鲁驶,incoming connections的權重w滿足: w的l2范數() <= r鉴裹,其中的r是max-norm超參數。為實現這個方法,在每個訓練步驟之后計算w的l2范數径荔,如有必要可對w進行clip操作(?)督禽。
降低r可以增加正則化數量(amount of regularisation),進而減小overfitting总处。同事Max-norm regularisation可以減輕vanishing/exploding gredient(梯度消失 梯度爆炸?)問題(如果不是Batch normalisation)
6) early stopping狈惫,即驗證集validation set誤差出現增大之后,或validation set performance starts dropping鹦马,提前停止訓練胧谈。在Tensorflow中的實現: 每隔特定間隔(e.g., 50 steps),評估validation set上的性能荸频,并保存一個winner snapshot if it outperforms previous winner snapshots.獲得了winner snapshot之后對間隔計數(steps)菱肖,設定一個上限(e.g., 2000 steps),一旦在winner snapshot之后間隔達到上限則終止訓練旭从。之后restore the last winner snapshot. Early stopping與其他regularisation技術相結合會取得更好的效果稳强。
7) Ensemble method,bagging通過多個模型的結果和悦,減少模型的方差退疫,boosting不僅能減少偏差,還能減少方差
交叉驗證 Cross-validation
K-fold x-validation
>> from sklearn.model_selection import cross_val_score
>> scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring = 'neg_mean_squared_error', cv= 10)
rmse_scores = np.sqrt(-scores)
Notes:
(2020.04.08)
1 矢量x的l2范數 l2 norm of x: 向量中元素的平方和的平方根?
相應的摹闽,l1范數是向量中元素的絕對值之和?
而l0范數是一個向量中非0元素的個數蹄咖。l1/l2范數也可理解為矢量x到原點的距離。
reference:
1 A. Geron, Hands-on Machine Learning with Scikit-Learn & Tensorflow
2 弗朗索瓦著付鹿,張亮譯澜汤,Python深度學習