Best Practices for Feature Engineering

Feature engineering, the process creating new input features for machine learning, is one of the most effective ways to improve predictive models.

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. ~ Andrew Ng

Through feature engineering, you can isolate key information, highlight patterns, and bring in domain expertise.

Unsurprisingly, it can be easy to get stuck because feature engineering is so open-ended.

In this guide, we’ll discuss 20 best practices and heuristics that will help you navigate feature engineering.

What is Feature Engineering?

Feature engineering is an informal topic, and there are many possible definitions. The machine learning workflow is fluid and iterative, so there’s no one “right answer.”

We explain our approach in more detail in our free 7-day email crash course.

In a nutshell, we define feature engineering as creating new features from your existing ones to improve model performance.

A typical data science process might look like this:

  1. Project Scoping / Data Collection
  2. Exploratory Analysis
  3. Data Cleaning
  4. Feature Engineering
  5. Model Training (including cross-validation to tune hyper-parameters)
  6. Project Delivery / Insights

What is Not Feature Engineering?

That means there are certain steps we do not consider to be feature engineering:

  • We do not consider initial data collection to be feature engineering.
  • Similarly, we do not consider creating the target variable to be feature engineering.
  • We do not consider removing duplicates, handling missing values, or fixing mislabeled classes to be feature engineering. We put these under data cleaning.
  • We do not consider scaling or normalization to be feature engineering because these steps belong inside the cross-validation loop (i.e. after you’ve already built your analytical base table).
  • Finally, we do not consider feature selection or PCA to be feature engineering. These steps also belong inside your cross-validation loop.

Again, this is simply our categorization. Reasonable data scientists may disagree, and that’s perfectly fine.

With those disclaimers out of the way, let’s dive into the best practices and heuristics!

Indicator Variables

The first type of feature engineering involves using indicator variables to isolate key information.

Now, some of you may be wondering, “shouldn’t a good algorithm learn the key information on its own?”

Well, not always. It depends on the amount of data you have and the strength of competing signals. You can help your algorithm “focus” on what’s important by highlighting it beforehand.

  • Indicator variable from thresholds: Let’s say you’re studying alcohol preferences by U.S. consumers and your dataset has an age feature. You can create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking age.
  • Indicator variable from multiple features: You’re predicting real-estate prices and you have the features n_bedrooms and n_bathrooms. If houses with 2 beds and 2 baths command a premium as rental properties, you can create an indicator variable to flag them.
  • Indicator variable for special events: You’re modeling weekly sales for an e-commerce site. You can create two indicator variables for the weeks of Black Friday and Christmas.
  • Indicator variable for groups of classes: You’re analyzing website conversions and your dataset has the categorical feature traffic_source. You could create an indicator variable for paid_traffic by flagging observations with traffic source values of "Facebook Ads" or"Google Adwords".

Interaction Features

The next type of feature engineering involves highlighting interactions between two or more features.

Have you ever heard the phrase, “the sum is greater than the parts?” Well, some features can be combined to provide more information than they would as individuals.

Specifically, look for opportunities to take the sum, difference, product, or quotient of multiple features.

*Note: We don’t recommend using an automated loop to create interactions for all your features. This leads to “feature explosion.”

  • Sum of two features: Let’s say you wish to predict revenue based on preliminary sales data. You have the features sales_blue_pens and sales_black_pens. You could sum those features if you only care about overall sales_pens.
  • Difference between two features: You have the features house_built_date andhouse_purchase_date. You can take their difference to create the featurehouse_age_at_purchase.
  • Product of two features: You’re running a pricing test, and you have the featureprice and an indicator variable conversion. You can take their product to create the feature earnings.
  • Quotient of two features: You have a dataset of marketing campaigns with the featuresn_clicks and n_impressions. You can divide clicks by impressions to create click_through_rate, allowing you to compare across campaigns of different volume.

Feature Representation

This next type of feature engineering is simple yet impactful. It’s called feature representation.

Your data won’t always come in the ideal format. You should consider if you’d gain information by representing the same feature in a different way.

  • Date and time features: Let’s say you have the feature purchase_datetime. It might be more useful to extract purchase_day_of_week and purchase_hour_of_day. You can also aggregate observations to create features such as purchases_over_last_30_days.
  • Numeric to categorical mappings: You have the feature years_in_school. You might create a new feature grade with classes such as "Elementary School", "Middle School", and "High School".
  • Grouping sparse classes: You have a feature with many classes that have low sample counts. You can try grouping similar classes and then grouping the remaining ones into a single "Other" class.
  • Creating dummy variables: Depending on your machine learning implementation, you may need to manually transform categorical features into dummy variables. You should always do this after grouping sparse classes.

External Data

An underused type of feature engineering is bringing in external data. This can lead to some of the biggest breakthroughs in performance.

For example, one way quantitative hedge funds perform research is by layering together different streams of financial data.

Many machine learning problems can benefit from bringing in external data. Here are some examples:

  • Time series data: The nice thing about time series data is that you only need one feature, some form of date, to layer in features from another dataset.
  • External API’s: There are plenty of API’s that can help you create features. For example, the Microsoft Computer Vision API can return the number of faces from an image.
  • Geocoding: Let’s say have you street_address, city, and state. Well, you can geocodethem into latitude and longitude. This will allow you to calculate features such as local demographics (e.g. median_income_within_2_miles) with the help of another dataset.
  • Other sources of the same data: How many ways could you track a Facebook ad campaign? You might have Facebook’s own tracking pixel, Google Analytics, and possibly another third-party software. Each source can provide information that the others don’t track. Plus, any differences between the datasets could be informative (e.g. bot traffic that one source ignores while another source keeps).

Error Analysis (Post-Modeling)

The final type of feature engineering we’ll cover falls under a process called error analysis. This is performed after training your first model.

Error analysis is a broad term that refers to analyzing the misclassified or high error observations from your model and deciding on your next steps for improvement.

Possible next steps include collecting more data, splitting the problem apart, or engineering new features that address the errors. To use error analysis for feature engineering, you’ll need to understand why your model missed its mark.

Here’s how:

  • Start with larger errors: Error analysis is typically a manual process. You won’t have time to scrutinize every observation. We recommend starting with those that had higher error scores. Look for patterns that you can formalize into new features.
  • Segment by classes: Another technique is to segment your observations and compare the average error within each segment. You can try creating indicator variables for the segments with the highest errors.
  • Unsupervised clustering: If you have trouble spotting patterns, you can run an unsupervised clustering algorithm on the misclassified observations. We don’t recommend blindly using those clusters as a new feature, but they can make it easier to spot patterns. Remember, the goal is to understand why observations were misclassified.
  • Ask colleagues or domain experts: This is a great complement to any of the other three techniques. Asking a domain expert is especially useful if you’ve identified a pattern of poor performance (e.g. through segmentations) but don’t yet understand why.

Conclusion

As you see, there are many possibilities for feature engineering. We’ve covered 20 best practices and heuristics, but they are by no means exhaustive!

Remember these general guidelines as you start to experiment on your own:

Good features to engineer…

  • Can be computed for future observations.
  • Are usually intuitive to explain.
  • Are informed by domain knowledge or exploratory analysis.
  • Must have the potential to be predictive. Don’t just create features for the sake of it.
  • Never touch the target variable. This a trap that beginners sometimes fall into. Whether you’re creating indicator variables or interaction features, never use your target variable. That’s like “cheating” and it would give you very misleading results.

Finally, don’t worry if this feels overwhelming right now! You’ll naturally get better at feature engineering through practice and experience.

In fact, if this is your first exposure to some of these tactics, we highly recommend picking up a dataset and solidifying what you’ve learned. Here are some more resources that can help you in your journey:

Have any questions about feature engineering? Did we miss one of your favorite heuristics? Let us know in the comments!

來源:https://elitedatascience.com/feature-engineering-best-practices

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子埋哟,更是在濱河造成了極大的恐慌坐梯,老刑警劉巖,帶你破解...
    沈念sama閱讀 222,252評論 6 516
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件森瘪,死亡現(xiàn)場離奇詭異牡属,居然都是意外死亡,警方通過查閱死者的電腦和手機柜砾,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,886評論 3 399
  • 文/潘曉璐 我一進店門湃望,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人痰驱,你說我怎么就攤上這事证芭。” “怎么了担映?”我有些...
    開封第一講書人閱讀 168,814評論 0 361
  • 文/不壞的土叔 我叫張陵废士,是天一觀的道長。 經(jīng)常有香客問我蝇完,道長官硝,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,869評論 1 299
  • 正文 為了忘掉前任短蜕,我火速辦了婚禮氢架,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘朋魔。我一直安慰自己岖研,他們只是感情好,可當我...
    茶點故事閱讀 68,888評論 6 398
  • 文/花漫 我一把揭開白布警检。 她就那樣靜靜地躺著孙援,像睡著了一般。 火紅的嫁衣襯著肌膚如雪扇雕。 梳的紋絲不亂的頭發(fā)上拓售,一...
    開封第一講書人閱讀 52,475評論 1 312
  • 那天,我揣著相機與錄音镶奉,去河邊找鬼础淤。 笑死崭放,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的值骇。 我是一名探鬼主播莹菱,決...
    沈念sama閱讀 41,010評論 3 422
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼吱瘩!你這毒婦竟也來了道伟?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,924評論 0 277
  • 序言:老撾萬榮一對情侶失蹤使碾,失蹤者是張志新(化名)和其女友劉穎蜜徽,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,469評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,552評論 3 342
  • 正文 我和宋清朗相戀三年杈湾,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片盆色。...
    茶點故事閱讀 40,680評論 1 353
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖祟剔,靈堂內(nèi)的尸體忽然破棺而出隔躲,到底是詐尸還是另有隱情,我是刑警寧澤物延,帶...
    沈念sama閱讀 36,362評論 5 351
  • 正文 年R本政府宣布宣旱,位于F島的核電站,受9級特大地震影響叛薯,放射性物質(zhì)發(fā)生泄漏浑吟。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 42,037評論 3 335
  • 文/蒙蒙 一耗溜、第九天 我趴在偏房一處隱蔽的房頂上張望组力。 院中可真熱鬧,春花似錦抖拴、人聲如沸忿项。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,519評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至寞酿,卻和暖如春家夺,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背伐弹。 一陣腳步聲響...
    開封第一講書人閱讀 33,621評論 1 274
  • 我被黑心中介騙來泰國打工拉馋, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 49,099評論 3 378
  • 正文 我出身青樓煌茴,卻偏偏與公主長得像随闺,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子蔓腐,可洞房花燭夜當晚...
    茶點故事閱讀 45,691評論 2 361

推薦閱讀更多精彩內(nèi)容