幾乎所有的機器學習算法都需要對數(shù)據(jù)進行準備鸵钝,不同的算法根據(jù)其假設俺抽,可能要求不同的數(shù)據(jù)轉(zhuǎn)化川抡。原文作者的建議是:使用一個數(shù)據(jù)驅(qū)動的方法,組合多種數(shù)據(jù)準備方法和多種算法缩搅,比較表現(xiàn)優(yōu)劣越败,建立起數(shù)據(jù)轉(zhuǎn)化和算法的對應關(guān)系。
1. 調(diào)整數(shù)據(jù)尺度(Rescale data)
當你的數(shù)據(jù)處于不同的尺度時硼瓣,把所有的數(shù)據(jù)屬性都統(tǒng)一到一個標準尺度究飞,對提升ML算法表現(xiàn)是很有幫助的。這種rescale可以看出是一種標準化堂鲤,通常會把屬性范圍調(diào)整到0-1亿傅。使用的場景:如gradient descent的最優(yōu)化算法、回歸瘟栖、神經(jīng)網(wǎng)絡和使用距離的算法(以權(quán)重作為輸入)葵擎。使用scikit-learn 中的MinMaxScaler類實現(xiàn)。代碼如下:
import pandas as pd
import numpy as np
#import scipy
from sklearn.preprocessing import MinMaxScaler
data_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(data_link,names=names)
data = df.values
X = data[:,0:8]
Y = data[:,8]
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX = scaler.fit_transform(X)
np.set_printoptions(precision = 3)
print(X)
print(rescaledX)
結(jié)果如下:
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
[[ 0.353 0.744 0.59 ..., 0.501 0.234 0.483]
[ 0.059 0.427 0.541 ..., 0.396 0.117 0.167]
[ 0.471 0.92 0.525 ..., 0.347 0.254 0.183]
...,
[ 0.294 0.608 0.59 ..., 0.39 0.071 0.15 ]
[ 0.059 0.633 0.492 ..., 0.449 0.116 0.433]
[ 0.059 0.467 0.574 ..., 0.453 0.101 0.033]]
2. 正態(tài)化數(shù)據(jù)(standardize data)
標準化使用高斯分布半哟,把不同mean和SD的數(shù)據(jù)轉(zhuǎn)化為標準分布(mean為0酬滤,SD為1)。使用場景:線性回歸寓涨、logistic回歸和線性區(qū)分問題盯串。 使用scikit-learn 中的StandardScaler類實現(xiàn)。代碼如下:
import pandas as pd
import numpy as np
#import scipy
from sklearn.preprocessing import StandardScaler
data_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(data_link,names=names)
data = df.values
X = data[:,0:8]
Y = data[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision = 3)
print(X)
print(rescaledX)
結(jié)果如下:
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
[[ 0.64 0.848 0.15 ..., 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 ..., -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 ..., -1.103 0.604 -0.106]
...,
[ 0.343 0.003 0.15 ..., -0.735 -0.685 -0.276]
[-0.845 0.16 -0.471 ..., -0.24 -0.371 1.171]
[-0.845 -0.873 0.046 ..., -0.202 -0.474 -0.871]]
3. 標準化數(shù)據(jù)(normalize data)
把每一行調(diào)整到長度為1(線性代數(shù)中的a unit norm戒良,這個不太懂体捏,需要深入挖掘)。用于稀疏數(shù)據(jù)糯崎,使用場景:使用權(quán)重輸入的神經(jīng)網(wǎng)絡和使用距離的k近鄰算法几缭。使用scikit-learn 中的 Normalizer 類實現(xiàn)。代碼如下:
import pandas as pd
import numpy as np
#import scipy
from sklearn.preprocessing import Normalizer
data_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(data_link,names=names)
data = df.values
X = data[:,0:8]
Y = data[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
np.set_printoptions(precision = 3)
print(X)
print(normalizedX)
結(jié)果如下:
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
[[ 0.034 0.828 0.403 ..., 0.188 0.004 0.28 ]
[ 0.008 0.716 0.556 ..., 0.224 0.003 0.261]
[ 0.04 0.924 0.323 ..., 0.118 0.003 0.162]
...,
[ 0.027 0.651 0.388 ..., 0.141 0.001 0.161]
[ 0.007 0.838 0.399 ..., 0.2 0.002 0.313]
[ 0.008 0.736 0.554 ..., 0.241 0.002 0.182]]
4. 二值數(shù)據(jù)(binarize data)
使用閾值沃呢,把數(shù)據(jù)轉(zhuǎn)化為二值年栓,大于閾值設置為1,小于閾值設置為0薄霜。使用場景:當生成crisp values的時候使用(不知道什么意思)韵洋,還有就是feature egnineering的時候增加屬性。使用scikit-learn 中的 Binarizer 類實現(xiàn)黄锤。代碼如下:
import pandas as pd
import numpy as np
#import scipy
from sklearn.preprocessing import Binarizer
data_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(data_link,names=names)
data = df.values
X = data[:,0:8]
Y = data[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
np.set_printoptions(precision = 3)
print(X)
print(binaryX)
結(jié)果如下:
[[ 6. 148. 72. ..., 33.6 0.627 50. ]
[ 1. 85. 66. ..., 26.6 0.351 31. ]
[ 8. 183. 64. ..., 23.3 0.672 32. ]
...,
[ 5. 121. 72. ..., 26.2 0.245 30. ]
[ 1. 126. 60. ..., 30.1 0.349 47. ]
[ 1. 93. 70. ..., 30.4 0.315 23. ]]
[[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
...,
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]]
5. 總結(jié):
可以使用以下4種方法準備數(shù)據(jù):rescale搪缨,standardize,normalize鸵熟,binarizer副编。但是更為重要的,我認為是:在實踐中熟練不同數(shù)據(jù)準備方法的使用場景流强,在使用中建立其對算法提升的理論和知覺痹届,更為重要呻待。
6. 知識點:
- MinMaxscaler
- StandardScaler
- Normalizer
- Binarizer
- xx.fit(X)
- scaler.transform(X)
原文鏈接:How To Prepare Your Data For Machine Learning in Python with Scikit-Learn