數(shù)值映射和獨(dú)熱編碼
在機(jī)器學(xué)習(xí)和深度學(xué)習(xí)中,當(dāng)輸入特征為類別型 Categorical 數(shù)據(jù)時(shí)旭愧,為了實(shí)現(xiàn)特征擴(kuò)充使得這些特征可以參與網(wǎng)絡(luò)的線性求和及后續(xù)的激活溯泣,可以根據(jù)類別特征是否具有量值屬性而將其按照如下兩種方式進(jìn)行處理:
如果類別特征具有量值屬性,且可以在后續(xù)計(jì)算中應(yīng)該以不同大小的數(shù)值形式參與計(jì)算榕茧,如尺寸,型號(hào)等客给,那么可以直接以映射的形式分配數(shù)值編碼
如果類別特征沒有量值屬性用押,可以將分類設(shè)置成相應(yīng)數(shù)量的多個(gè)特征,并將輸入的值在對(duì)應(yīng)特征分類下設(shè)置為 1靶剑,如此不僅有效的處理了類別特征蜻拨,還可以使這些特征有效的參與計(jì)算,這種方法稱為獨(dú)熱編碼 One-hot encoding
one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).[1] A similar implementation in which all bits are '1' except one '0' is sometimes called one-cold. - Wiki
上述兩種方式都可以方便的通過 Pandas 進(jìn)行:
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']],
columns=['color', 'size', 'prize', 'class_label'])
df
Out[2]:
color size prize class_label
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
對(duì)尺寸這個(gè)具有量值意義的特征進(jìn)行量值映射桩引,在此等級(jí)這個(gè)屬性也不具有量值意義缎讼,但由于只有兩個(gè)分類,因此在此演示采用映射的形式進(jìn)行坑匠,需要注意的是也可以通過后續(xù)對(duì)于顏色的處理方式進(jìn)行:
In [3]:
# mapping the size
size_mapping = {'XL': 3, 'L': 2, 'M': 1}
df['size'] = df['size'].map(size_mapping)
# mapping the class
class_mapping = {label: index for index, label in enumerate(set(df['class_label']))}
df['class_label'] = df['class_label'].map(class_mapping)
df
Out[3]:
color size prize class_label
0 green 1 10.1 1
1 red 2 13.5 0
2 blue 3 15.3 1
對(duì)顏色這列沒有量值意義的分類使用 pd.get_dummies( ) 進(jìn)行獨(dú)熱編碼血崭,并在編碼后去掉原數(shù)據(jù)中的 color 列:
In [8]:
one_hot_encoded = pd.concat([df, pd.get_dummies(df['color'], prefix='color')], axis=1)
one_hot_encoded
Out[8]:
color size prize class_label color_blue color_green color_red
0 green 1 10.1 1 0 1 0
1 red 2 13.5 0 0 0 1
2 blue 3 15.3 1 1 0 0
In [10]:
one_hot_encoded.drop('color', axis=1)
Out[10]:
size prize class_label color_blue color_green color_red
0 1 10.1 1 0 1 0
1 2 13.5 0 0 0 1
2 3 15.3 1 1 0 0
在 Keras 中利用 np_utils.to_categorical( ) 進(jìn)行 One-hot key encoding 的實(shí)現(xiàn)過程如下:
In [1]
from keras.utils import np_utils
# print first ten (integer-valued) training labels
print('Integer-valued labels:')
print(y_train[:10])
# one-hot encode the labels
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)
# print first ten (one-hot) training labels
print('One-hot labels:')
print(y_train[:10])
Out[1]
Integer-valued labels:
[5 0 4 1 9 2 1 3 1 4]
One-hot labels:
[[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]