Chapter 2 End-to-End Machine Learning Project

  1. Machine Learning open Datahouse
    ? Popular open data repositories:
    — UC Irvine Machine Learning Repository
    — Kaggle datasets
    — Amazon’s AWS datasets
    ? Meta portals (they list open data repositories):
    http://dataportals.org/
    http://opendatamonitor.eu/
    http://quandl.com/
    ? Other pages listing many popular open data repositories:
    — Wikipedia’s list of Machine Learning datasets
    — Quora.com question
    — Datasets subreddit

  2. If there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

If the data was huge, you could either split your batch learning work across multiple servers (using the MapReduce technique, as we will see later), or you could use an online learning technique instead.

  1. Select a Performance Measure:
    A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the system makes in its predictions.
image.png

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error(MAE).

image.png

When a feature has a bell-shaped normal distribution (also called a Gaussian distribution), which is very common, the “68-95-99.7” rule applies: about 68% of the values fall within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ.

  1. The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs
    very well and is generally preferred.

  2. When using pandas, the info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

You can find out what categories exist and how many districts belong to each category by using the value_counts() method:

image.png

Let’s look at the other fields. The describe() method shows a summary of the numerical attributes.

  1. When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not perform as well as expected. This is called data snooping bias.

Creating a test set is theoretically quite simple: just pick some instances randomly, typically 20% of the dataset, and set them aside:

image.png

Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid. One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42) ) 13 before calling np.random.permutation() , so that it always generates the same shuffled indices.

But both these solutions will break next time you fetch an updated dataset.
A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier). For example, you could compute a hash of each instance’s identifier, keep only the last byte of the hash, and put the instance in the test set if this value is lower or equal to 51 (~20% of 256). This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set. Here is a possible implementation:

image.png

Unfortunately, the housing dataset does not have an identifier column. The simplest solution is to use the row index as the ID:

image.png

If you use the row index as a unique identifier, you need to make sure that new data gets appended to the end of the dataset, and no row ever gets deleted. If this is not possible, then you can try to use the most stable features to build a unique identifier. For example, a district’s latitude and longitude are guaranteed to be stable for a few million years, so you could combine them into an ID like so:

image.png

Scikit-Learn provides a few functions to split datasets into multiple subsets in various ways. The simplest function is train_test_split , which does pretty much the same thing as the function split_train_test defined earlier, with a couple of additional features. First there is a random_state parameter that allows you to set the random generator seed as explained previously, and second you can pass it multiple datasets with an identical number of rows, and it will split them on the same indices (this is very useful, for example, if you have a separate DataFrame for labels):

image.png

So far we have considered purely random sampling methods. This is generally fine if your dataset is large enough (especially relative to the number of attributes), but if it is not, you run the risk of introducing a significant sampling bias.

For example, the US population is composed of 51.3% female and 48.7% male, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. If they used purely random sampling, there would be about 12% chance of sampling a skewed test set with either less than 49% female or more than 54% female. Either way, the survey results would be significantly biased.

It is important to have a sufficient number of instances in your dataset for each stratum, or else the estimate of the stratum’s
importance may be biased. This means that you should not have too many strata, and each stratum should be large enough. The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5:

image.png

Now you are ready to do stratified sampling based on the income category. For this you can use Scikit-Learn’s StratifiedShuffleSplit class:


image.png
  1. Looking for Correlations
    Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method:
image.png
  1. The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships.
    Another way to check for correlation between attributes is to use Pandas scatter_matrix function, which plots every numerical attribute against every other numerical attribute.

  2. Data Clean
    Scikit-Learn provides a handy class to take care of missing values: Imputer.
    Here is how to use it.
    First, you need to create an Imputer instance, specifying that you want to replace each attribute’s missing values with the median of that attribute:

image.png

Since the median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute ocean_proximity :

image.png

Now you can fit the imputer instance to the training data using the fit() method:

image.png

The imputer has simply computed the median of each attribute and stored the result in its statistics_ instance variable. Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the imputer to all the numerical attributes:

image.png

Now you can use this “trained” imputer to transform the training set by replacing missing values by the learned medians:

image.png

The result is a plain Numpy array containing the transformed features. If you want to put it back into a Pandas DataFrame, it’s simple:

image.png
  1. Handling Text and Categorical Attributes
    Scikit-Learn provides a transformer for this task called LabelEncoder :
image.png

This is better: now we can use this numerical data in any ML algorithm. You can look at the mapping that this encoder has learned using the classes_ attribute (“<1HOCEAN” is mapped to 0, “INLAND” is mapped to 1, etc.):

image.png

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. Obviously this is not the case.

To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). Scikit-Learn provides a OneHotEncoder encoder to convert integer categorical values into one-hot vectors. Let’s encode the categories as one-hot vectors. Note that fit_transform() expects a 2D array, but housing_cat_encoded is a 1D array, so we need to reshape it

image.png

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for one 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements. You can use it mostly like a normal 2D array, but if you really want to convert it to a (dense) NumPy array, just call the toarray() method:

image.png

We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

image.png

Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

  1. Custom Transformers
    Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need is to create a class and implement three methods: fit()(returning self ), transform() , and fit_transform() . You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods ( get_params() and set_params() ) that will be useful for auto‐matic hyperparameter tuning. For example, here is a small transformer class that adds the combined attributes we discussed earlier:
image.png
  1. Feature Scaling
    There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason.

Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers. For example, suppose a district had a median income equal to 100 (by mistake). Min-max scaling would then crush all the other values from 0–15 down to 0–0.15, whereas standardization would not be much affected. Scikit-Learn provides a transformer called StandardScaler for standardization.

  1. Transformation Pipelines
    As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations. Here is a small pipeline for the numerical attributes:
image.png

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method). The names can be anything you like. When you call the pipeline’s fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method. The pipeline exposes the same methods as the final estimator. In this example, the last estimator is a StandardScaler , which is a transformer, so the pipeline has a transform() method that applies all the transforms to the data in sequence (it also has a fit_transform method that we could have used instead of calling fit() and then transform()).

You now have a pipeline for numerical values, and you also need to apply the LabelBinarizer on the categorical values: how can you join these transformations into a single pipeline? Scikit-Learn provides a FeatureUnion class for this. You give it a list of transformers (which can be entire transformer pipelines), and when its transform() method is called it runs each transformer’s transform() method in parallel, waits for their output, and then concatenates them and returns the result (and of course calling its fit() method calls all each transformer’s fit() method). A full pipeline handling both numerical and categorical attributes may look like this:

image.png

Each subpipeline starts with a selector transformer: it simply transforms the data by selecting the desired attributes (numerical or categorical), dropping the rest, and converting the resulting DataFrame to a NumPy array. There is nothing in Scikit-Learn to handle Pandas DataFrames, so we need to write a simple custom transformer for this task:

image.png
  1. Better Evaluation Using Cross-Validation
    A great alternative is to use Scikit-Learn’s cross-validation feature. The following code performs K-fold cross-validation: it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:
image.png

You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib , which is more efficient at serializing large NumPy arrays:

image.png
  1. Fine-Tune Your Model------Grid Search
    One way to do that would be to fiddle with the hyperparameters manually, until you
    find a great combination of hyperparameter values. This would be very tedious work,
    and you may not have time to explore many combinations.

Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation. For example, the following code searches for the best combination of hyperparameter values for the RandomForestRegressor:

image.png

When you have no idea what value a hyperparameter should have, a simple approach is to try out consecutive powers of 10 (or a smaller number if you want a more fine-grained search, as shown in this example with the n_estimators hyperparameter).
This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of n_estimators and max_features hyperparameter values specified in the first dict. Then try all 2 × 3 = 6 combinations of hyperparameter values in the
second dict , but this time with the bootstrap hyperparameter set to False instead of True (which is the default value for this hyperparameter).
When it is done you can get the best combination of parameters like this:

image.png

You can also get the best estimator directly:

image.png

And of course the evaluation scores are also available:

image.png
  1. Fine-Tune Your Model------Randomized Search
    The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:
    ? If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid
    search approach).
    ? You have more control over the computing budget you want to allocate to hyper‐parameter search, simply by setting the
    number of iterations.
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末昧识,一起剝皮案震驚了整個濱河市嘀韧,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖吏砂,帶你破解...
    沈念sama閱讀 221,635評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件包帚,死亡現(xiàn)場離奇詭異庆揩,居然都是意外死亡贸人,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,543評論 3 399
  • 文/潘曉璐 我一進(jìn)店門鲸阔,熙熙樓的掌柜王于貴愁眉苦臉地迎上來偷霉,“玉大人,你說我怎么就攤上這事褐筛±嗌伲” “怎么了?”我有些...
    開封第一講書人閱讀 168,083評論 0 360
  • 文/不壞的土叔 我叫張陵渔扎,是天一觀的道長硫狞。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么残吩? 我笑而不...
    開封第一講書人閱讀 59,640評論 1 296
  • 正文 為了忘掉前任财忽,我火速辦了婚禮,結(jié)果婚禮上世剖,老公的妹妹穿的比我還像新娘定罢。我一直安慰自己笤虫,他們只是感情好旁瘫,可當(dāng)我...
    茶點故事閱讀 68,640評論 6 397
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著琼蚯,像睡著了一般酬凳。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上遭庶,一...
    開封第一講書人閱讀 52,262評論 1 308
  • 那天宁仔,我揣著相機(jī)與錄音,去河邊找鬼峦睡。 笑死翎苫,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的榨了。 我是一名探鬼主播煎谍,決...
    沈念sama閱讀 40,833評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼龙屉!你這毒婦竟也來了呐粘?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,736評論 0 276
  • 序言:老撾萬榮一對情侶失蹤转捕,失蹤者是張志新(化名)和其女友劉穎作岖,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體五芝,經(jīng)...
    沈念sama閱讀 46,280評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡痘儡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,369評論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了枢步。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片谤辜。...
    茶點故事閱讀 40,503評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖价捧,靈堂內(nèi)的尸體忽然破棺而出丑念,到底是詐尸還是另有隱情,我是刑警寧澤结蟋,帶...
    沈念sama閱讀 36,185評論 5 350
  • 正文 年R本政府宣布脯倚,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏推正。R本人自食惡果不足惜恍涂,卻給世界環(huán)境...
    茶點故事閱讀 41,870評論 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望植榕。 院中可真熱鬧再沧,春花似錦、人聲如沸尊残。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,340評論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽寝衫。三九已至顷扩,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間慰毅,已是汗流浹背隘截。 一陣腳步聲響...
    開封第一講書人閱讀 33,460評論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留汹胃,地道東北人婶芭。 一個月前我還...
    沈念sama閱讀 48,909評論 3 376
  • 正文 我出身青樓,卻偏偏與公主長得像着饥,于是被迫代替她去往敵國和親犀农。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,512評論 2 359

推薦閱讀更多精彩內(nèi)容