深度學(xué)習(xí)中的數(shù)據(jù)處理 - 數(shù)據(jù)集的建立

在深度學(xué)習(xí)的模型構(gòu)建過程中數(shù)據(jù)集的重要性不言而喻剃诅，其建立過程包括以下幾點(diǎn)：

數(shù)據(jù)讀入和借助可視化工具輔助數(shù)據(jù)分析
根據(jù)基本的分析結(jié)果對(duì)特征數(shù)據(jù)進(jìn)行選擇吴叶，舍棄不重要的特征
對(duì)于類別數(shù)據(jù)進(jìn)行獨(dú)熱編碼或映射
數(shù)據(jù)特征的標(biāo)準(zhǔn)化和數(shù)據(jù)集的劃分

在前面幾個(gè)項(xiàng)目的學(xué)習(xí)中對(duì)于不同的數(shù)據(jù)來源，其實(shí)際的處理過程各有差異夺巩，在此對(duì)幾個(gè)經(jīng)典的例子放在一起進(jìn)行一個(gè)對(duì)比和總結(jié)。

本筆記所示代碼源自 Udacity Deep Learning Nano Degree，版權(quán)歸屬于 Udacity西设，Jupyter notebook 完整代碼請(qǐng)見我的 GihHub 。

Data processing example from student admission project

In [1]:

# Importing pandas and numpy
import pandas as pd
import numpy as np

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('student_data.csv')

# Printing out the first 3 rows of our data
data[:3]

Out[1]:

    admit   gre gpa rank
0   0       380 3.61    3
1   1       660 3.67    3
2   1       800 4.00    1

In [2]:

# Importing matplotlib
import matplotlib.pyplot as plt

# Function to help us plot
def plot_points(data):
    X = np.array(data[['gre','gpa']])
    y = np.array(data['admit'])
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'cyan', edgecolor = 'k')
    plt.xlabel('Test (GRE)')
    plt.ylabel('Grades (GPA)')

# Plotting the points
plot_points(data)
plt.show()

Student admission

In [3]:

# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)

# Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis=1)

# Print the first 3 rows of our data
one_hot_data[:3]

Out[3]:

    admit   gre gpa rank_1  rank_2  rank_3  rank_4
0   0       380 3.61    0   0       1       0
1   1       660 3.67    0   0       1       0
2   1       800 4.00    1   0       0       0

In [4]:

# Scaling the data
processed_data = one_hot_data[:]

# Scaling the columns
processed_data['gre'] = processed_data['gre'] / 800
processed_data['gpa'] = processed_data['gpa'] / 4.0
processed_data[:3]


Out[4]:
    admit   gre     gpa     rank_1  rank_2  rank_3  rank_4
0   0       0.475   0.9025  0       0       1       0
1   1       0.825   0.9175  0       0       1       0
2   1       1.000   1.0000  1       0       0       0

In [5]:

# choose the data randomly
sample = np.random.choice(processed_data.index, size=int(len(processed_data)*0.9), replace=False)
train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:3])
print(test_data[:3])

Out [5]:

Number of training samples is 360
Number of testing samples is 40
     admit  gre     gpa  rank_1  rank_2  rank_3  rank_4
302      1  0.5  0.7875       0       1       0       0
121      1  0.6  0.6675       0       1       0       0
249      0  0.8  0.9325       0       0       1       0
    admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
3       1  0.800  0.7975       0       0       0       1
12      1  0.950  1.0000       1       0       0       0
13      0  0.875  0.7700       0       1       0       0


In [6]:

import keras

# Separate data and one-hot encode the output
# Note: We're also turning the data into numpy arrays, in order to train the model in Keras
# use keras.utils.to_categorical to one-hot encoding targets
features = np.array(train_data.drop('admit', axis=1))
targets = np.array(keras.utils.to_categorical(train_data['admit'], 2))
features_test = np.array(test_data.drop('admit', axis=1))
targets_test = np.array(keras.utils.to_categorical(test_data['admit'], 2))

print(features[:3])
print(targets[:3])

[[ 0.5     0.7875  0.      1.      0.      0.    ]
 [ 0.6     0.6675  0.      1.      0.      0.    ]
 [ 0.8     0.9325  0.      0.      1.      0.    ]]
[[ 0.  1.]
 [ 0.  1.]
 [ 1.  0.]]

Data processing example from bike rental project

In [7]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [8]:

data_path = 'Bike-Sharing-Dataset/hour.csv'
rides = pd.read_csv(data_path)


In [9]:

# check to see what style is available in your current working environment
plt.style.available

Out[9]:

['seaborn-deep',
 'seaborn-talk',
 'seaborn-paper',
 'bmh',
 'grayscale',
 'seaborn-bright',
 'seaborn-colorblind',
 'ggplot',
 'seaborn-notebook',
 'seaborn-muted',
 'dark_background',
 'seaborn-dark-palette',
 'seaborn-white',
 'seaborn-darkgrid',
 'classic',
 'seaborn-poster',
 'seaborn-pastel',
 'fivethirtyeight',
 'seaborn-ticks',
 '_classic_test',
 'seaborn-whitegrid',
 'seaborn-dark',
 'seaborn']

In [10]:

# choose the style you like
plt.style.use('ggplot')

fig, ax = plt.subplots(nrows=1, ncols=1) # add this line to take control of the figure configuration later
rides[:24 * 10].plot(x='dteday', y='cnt', ax=ax, figsize=(10, 5)) #set ax=ax to take control of the figure
ax.legend().set_visible(False)
ax.set(title='Rental counts in the first 10 days', ylabel='Rental Counts', xlabel='Date'); 
# this very semicolon stop plt printing out working messages

Bike rental

In [11]:

# this demonstrates how you can one-hot encoding more than one column using pandas
dummy_fields = ['season', 'weathersit', 'mnth', 'hr', 'weekday']
for each in dummy_fields:
    dummies = pd.get_dummies(rides[each], prefix=each, drop_first=False)
    rides = pd.concat([rides, dummies], axis=1)

fields_to_drop = ['instant', 'dteday', 'season', 'weathersit', 
                  'weekday', 'atemp', 'mnth', 'workingday', 'hr']
data = rides.drop(fields_to_drop, axis=1)

In [12]:

# scaling the data with standard values
quant_features = ['casual', 'registered', 'cnt', 'temp', 'hum', 'windspeed']
# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for each in quant_features:
    mean, std = data[each].mean(), data[each].std()
    scaled_features[each] = [mean, std]
    data[each] = (data[each] - mean) / std 
    # this line should be write this way for simplicity's sake

In [13]:

# Save data for approximately the last 21 days 
test_data = data[-21*24:]

# Now remove the test data from the data set 
data = data[:-21*24]

# Separate the data into features and targets
target_fields = ['cnt', 'casual', 'registered']
features, targets = data.drop(target_fields, axis=1), data[target_fields]
test_features, test_targets = test_data.drop(target_fields, axis=1), test_data[target_fields]


In [14]:

# Hold out the last 60 days or so of the remaining data as a validation set
train_features, train_targets = features[:-60*24], targets[:-60*24]
val_features, val_targets = features[-60*24:], targets[-60*24:]

最后編輯于：2018.01.25 03:27:23

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末答朋，一起剝皮案震驚了整個(gè)濱河市贷揽，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌梦碗，老刑警劉巖禽绪，帶你破解...
沈念sama閱讀 217,907評(píng)論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件蓖救，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡印屁，警方通過查閱死者的電腦和手機(jī)循捺，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,987評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來雄人，“玉大人从橘，你說我怎么就攤上這事∧埽” “怎么了洋满？”我有些...
開封第一講書人閱讀 164,298評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)珍坊。經(jīng)常有香客問我牺勾，道長(zhǎng)，這世上最難降的妖魔是什么阵漏？我笑而不...
開封第一講書人閱讀 58,586評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任驻民，我火速辦了婚禮，結(jié)果婚禮上履怯，老公的妹妹穿的比我還像新娘回还。我一直安慰自己，他們只是感情好叹洲，可當(dāng)我...
茶點(diǎn)故事閱讀 67,633評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布柠硕。她就那樣靜靜地躺著，像睡著了一般运提。火紅的嫁衣襯著肌膚如雪蝗柔。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,488評(píng)論 1贊 302
城市分裂傳說
那天民泵，我揣著相機(jī)與錄音癣丧，去河邊找鬼。笑死栈妆，一個(gè)胖子當(dāng)著我的面吹牛胁编，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播鳞尔，決...
沈念sama閱讀 40,275評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼嬉橙，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了寥假？” 一聲冷哼從身側(cè)響起憎夷，我...
開封第一講書人閱讀 39,176評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎昧旨，沒想到半個(gè)月后拾给，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體祥得，經(jīng)...
沈念sama閱讀 45,619評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,819評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年蒋得，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了级及。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 39,932評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡额衙，死狀恐怖饮焦，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情窍侧，我是刑警寧澤县踢，帶...
沈念sama閱讀 35,655評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站伟件，受9級(jí)特大地震影響硼啤，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜斧账，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,265評(píng)論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一谴返、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧咧织，春花似錦嗓袱、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,871評(píng)論 0贊 22
一樁弒父案渠抹，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至闪萄，卻和暖如春梧却，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背桃煎。一陣腳步聲響...
開封第一講書人閱讀 32,994評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留大刊，地道東北人为迈。一個(gè)月前我還...
沈念sama閱讀 48,095評(píng)論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像缺菌，于是被迫代替她去往敵國和親葫辐。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,884評(píng)論 2贊 354