數(shù)據(jù)和特征決定了機器學(xué)習(xí)的上限雇卷,而模型和算法只是逼近這個上限而已洪橘。那特征工程到底是什么呢颠区?顧名思義帚戳,其本質(zhì)是一項工程活動玷或,目的是最大限度地從原始數(shù)據(jù)中提取特征以供算法和模型使用。通過總結(jié)和歸納片任,人們認(rèn)為特征工程包括以下方面:
博客轉(zhuǎn)載
[ http://blog.csdn.net/u010472823/article/details/53509658 ]
數(shù)據(jù)預(yù)處理實戰(zhàn)
官方文檔 [ http://scikit-learn.org/stable/modules/preprocessing.html]
Country | Age | Salary | Purchased |
---|---|---|---|
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | Yes | |
France | 35 | 58000 | Yes |
Spain | 52000 | No | |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
首先偏友,有上述表格可以發(fā)現(xiàn),樣例數(shù)據(jù)中存在缺失值对供。 一般刪除數(shù)據(jù)或補全數(shù)據(jù)位他。在缺失值補全之前,這里需要理解三個概念产场,眾數(shù)鹅髓,均值,中位數(shù)涝动。
眾數(shù):數(shù)據(jù)中出現(xiàn)次數(shù)最多個數(shù)
均值:數(shù)據(jù)的求和平均迈勋。
中位數(shù):數(shù)據(jù)排序后的中間數(shù)據(jù)炬灭。
具體選擇類中類型填充需要依據(jù)場景選擇醋粟。
首先,我們需要導(dǎo)入sklearn 中的Imputer 類重归,在sklearn 的 preprocessing 包下
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X[:,1:3] = imputer.fit(X[:,1:3]).transform(X[:,1:3])
strategy采用均值策略米愿,填補上述數(shù)據(jù)中的2,3 兩列鼻吮。axis = 0指列
strategy : string, optional (default="mean")
The imputation strategy.
- If "mean", then replace missing values using the mean along
the axis.
- If "median", then replace missing values using the median along
the axis.
- If "most_frequent", then replace missing using the most frequent
value along the axis.
>>> print(X)
[['France' 44.0 72000.0]
['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
這里采用均值的策略補全了缺失數(shù)據(jù)育苟。
由于X[0],y是類別數(shù)據(jù),需要進行標(biāo)簽編碼,采用sklearn .preprocessing包下的LabelEncoder.
from sklearn.preprocessing import LabelEncoder
>>> X[:,0] = LabelEncoder().fit_transform(X[:,0])
>>> y = LabelEncoder().fit_transform(y)
>>> print(y)
[0 1 0 0 1 1 0 1 0 1]
對于預(yù)測值采用標(biāo)簽編碼是沒有問題的椎木,然而违柏,在類目特征中博烂,標(biāo)簽編碼轉(zhuǎn)換是不夠的,國家一列漱竖,特征按照0-2順序編碼禽篱,這里還需要對數(shù)據(jù)進行亞編碼,one-hot encoding. 采用sklearn.preprocessing 包下的 OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
>>> X = OneHotEncoder(categorical_features=[0]).fit_transform(X).toarray()
>>> print(X)
[[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
7.20000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
4.80000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
5.40000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
6.10000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
6.37777778e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
5.80000000e+04]
[ 0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
5.20000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
7.90000000e+04]
[ 0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
8.30000000e+04]
[ 1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
6.70000000e+04]]
在回歸情況下馍惹,我們需要對特征值進行縮放躺率,年齡和薪酬是屬于不同量集的。
from sklearn.preprocessing import StandardScaler
sd = StandardScaler().fit(X)
X = sd.transform(X)
print(X)
[[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 7.58874362e-01
7.49473254e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 -1.71150388e+00
-1.43817841e+00]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 -1.27555478e+00
-8.91265492e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 -1.13023841e-01
-2.53200424e-01]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 1.77608893e-01
6.63219199e-16]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -5.48972942e-01
-5.26656882e-01]
[ -8.16496581e-01 -6.54653671e-01 1.52752523e+00 0.00000000e+00
-1.07356980e+00]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 1.34013983e+00
1.38753832e+00]
[ -8.16496581e-01 1.52752523e+00 -6.54653671e-01 1.63077256e+00
1.75214693e+00]
[ 1.22474487e+00 -6.54653671e-01 -6.54653671e-01 -2.58340208e-01
2.93712492e-01]]
其他標(biāo)準(zhǔn)化縮放方法 如MinMaxScaler() 區(qū)間縮放万矾。
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X = min_max_scaler.fit_transform(X)
歸一化方法 Normalizer()悼吱,將特征向量長度歸一到單位向量。
>>> normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
>>> normalizer.transform(X)
到此良狈,基本的數(shù)據(jù)預(yù)處理到此完成后添,接下來就是模型訓(xùn)練和預(yù)測拉~
其余的一些操作:
#df之間合并
df = pd.concat([df1,df2])
#查看df的信息
df.info()
#查看各個維度的統(tǒng)計數(shù)據(jù),各個對象名稱
df.describe()
df.describe(include='o').columns
#統(tǒng)計某個維度的個數(shù)
print train_df['column_name'].value_counts()
#屬性列刪除
df= df.drop(['Name'], axis=1)
#刪除列中重復(fù)數(shù)據(jù),刪除某一列重復(fù)的數(shù)據(jù)
df = df.drop_duplicates()
df = df.drop_duplicates('columns_name')