深度學(xué)習(xí)中的數(shù)據(jù)處理 - 數(shù)據(jù)集的建立

在深度學(xué)習(xí)的模型構(gòu)建過程中數(shù)據(jù)集的重要性不言而喻剃诅,其建立過程包括以下幾點(diǎn):

  • 數(shù)據(jù)讀入和借助可視化工具輔助數(shù)據(jù)分析

  • 根據(jù)基本的分析結(jié)果對(duì)特征數(shù)據(jù)進(jìn)行選擇吴叶,舍棄不重要的特征

  • 對(duì)于類別數(shù)據(jù)進(jìn)行 獨(dú)熱編碼或映射

  • 數(shù)據(jù)特征的標(biāo)準(zhǔn)化和數(shù)據(jù)集的劃分

在前面幾個(gè)項(xiàng)目的學(xué)習(xí)中對(duì)于不同的數(shù)據(jù)來源,其實(shí)際的處理過程各有差異夺巩,在此對(duì)幾個(gè)經(jīng)典的例子放在一起進(jìn)行一個(gè)對(duì)比和總結(jié)。

本筆記所示代碼源自 Udacity Deep Learning Nano Degree,版權(quán)歸屬于 Udacity西设,Jupyter notebook 完整代碼請(qǐng)見 我的 GihHub

Data processing example from student admission project

In [1]:

# Importing pandas and numpy
import pandas as pd
import numpy as np

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('student_data.csv')

# Printing out the first 3 rows of our data
data[:3]

Out[1]:

    admit   gre gpa rank
0   0       380 3.61    3
1   1       660 3.67    3
2   1       800 4.00    1

In [2]:

# Importing matplotlib
import matplotlib.pyplot as plt

# Function to help us plot
def plot_points(data):
    X = np.array(data[['gre','gpa']])
    y = np.array(data['admit'])
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'cyan', edgecolor = 'k')
    plt.xlabel('Test (GRE)')
    plt.ylabel('Grades (GPA)')

# Plotting the points
plot_points(data)
plt.show()
Student admission
In [3]:

# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)

# Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis=1)

# Print the first 3 rows of our data
one_hot_data[:3]

Out[3]:

    admit   gre gpa rank_1  rank_2  rank_3  rank_4
0   0       380 3.61    0   0       1       0
1   1       660 3.67    0   0       1       0
2   1       800 4.00    1   0       0       0

In [4]:

# Scaling the data
processed_data = one_hot_data[:]

# Scaling the columns
processed_data['gre'] = processed_data['gre'] / 800
processed_data['gpa'] = processed_data['gpa'] / 4.0
processed_data[:3]


Out[4]:
    admit   gre     gpa     rank_1  rank_2  rank_3  rank_4
0   0       0.475   0.9025  0       0       1       0
1   1       0.825   0.9175  0       0       1       0
2   1       1.000   1.0000  1       0       0       0

In [5]:

# choose the data randomly
sample = np.random.choice(processed_data.index, size=int(len(processed_data)*0.9), replace=False)
train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:3])
print(test_data[:3])

Out [5]:

Number of training samples is 360
Number of testing samples is 40
     admit  gre     gpa  rank_1  rank_2  rank_3  rank_4
302      1  0.5  0.7875       0       1       0       0
121      1  0.6  0.6675       0       1       0       0
249      0  0.8  0.9325       0       0       1       0
    admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
3       1  0.800  0.7975       0       0       0       1
12      1  0.950  1.0000       1       0       0       0
13      0  0.875  0.7700       0       1       0       0


In [6]:

import keras

# Separate data and one-hot encode the output
# Note: We're also turning the data into numpy arrays, in order to train the model in Keras
# use keras.utils.to_categorical to one-hot encoding targets
features = np.array(train_data.drop('admit', axis=1))
targets = np.array(keras.utils.to_categorical(train_data['admit'], 2))
features_test = np.array(test_data.drop('admit', axis=1))
targets_test = np.array(keras.utils.to_categorical(test_data['admit'], 2))

print(features[:3])
print(targets[:3])

[[ 0.5     0.7875  0.      1.      0.      0.    ]
 [ 0.6     0.6675  0.      1.      0.      0.    ]
 [ 0.8     0.9325  0.      0.      1.      0.    ]]
[[ 0.  1.]
 [ 0.  1.]
 [ 1.  0.]]

Data processing example from bike rental project

In [7]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [8]:

data_path = 'Bike-Sharing-Dataset/hour.csv'
rides = pd.read_csv(data_path)


In [9]:

# check to see what style is available in your current working environment
plt.style.available

Out[9]:

['seaborn-deep',
 'seaborn-talk',
 'seaborn-paper',
 'bmh',
 'grayscale',
 'seaborn-bright',
 'seaborn-colorblind',
 'ggplot',
 'seaborn-notebook',
 'seaborn-muted',
 'dark_background',
 'seaborn-dark-palette',
 'seaborn-white',
 'seaborn-darkgrid',
 'classic',
 'seaborn-poster',
 'seaborn-pastel',
 'fivethirtyeight',
 'seaborn-ticks',
 '_classic_test',
 'seaborn-whitegrid',
 'seaborn-dark',
 'seaborn']

In [10]:

# choose the style you like
plt.style.use('ggplot')

fig, ax = plt.subplots(nrows=1, ncols=1) # add this line to take control of the figure configuration later
rides[:24 * 10].plot(x='dteday', y='cnt', ax=ax, figsize=(10, 5)) #set ax=ax to take control of the figure
ax.legend().set_visible(False)
ax.set(title='Rental counts in the first 10 days', ylabel='Rental Counts', xlabel='Date'); 
# this very semicolon stop plt printing out working messages
Bike rental
In [11]:

# this demonstrates how you can one-hot encoding more than one column using pandas
dummy_fields = ['season', 'weathersit', 'mnth', 'hr', 'weekday']
for each in dummy_fields:
    dummies = pd.get_dummies(rides[each], prefix=each, drop_first=False)
    rides = pd.concat([rides, dummies], axis=1)

fields_to_drop = ['instant', 'dteday', 'season', 'weathersit', 
                  'weekday', 'atemp', 'mnth', 'workingday', 'hr']
data = rides.drop(fields_to_drop, axis=1)

In [12]:

# scaling the data with standard values
quant_features = ['casual', 'registered', 'cnt', 'temp', 'hum', 'windspeed']
# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for each in quant_features:
    mean, std = data[each].mean(), data[each].std()
    scaled_features[each] = [mean, std]
    data[each] = (data[each] - mean) / std 
    # this line should be write this way for simplicity's sake

In [13]:

# Save data for approximately the last 21 days 
test_data = data[-21*24:]

# Now remove the test data from the data set 
data = data[:-21*24]

# Separate the data into features and targets
target_fields = ['cnt', 'casual', 'registered']
features, targets = data.drop(target_fields, axis=1), data[target_fields]
test_features, test_targets = test_data.drop(target_fields, axis=1), test_data[target_fields]


In [14]:

# Hold out the last 60 days or so of the remaining data as a validation set
train_features, train_targets = features[:-60*24], targets[:-60*24]
val_features, val_targets = features[-60*24:], targets[-60*24:]
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末答朋,一起剝皮案震驚了整個(gè)濱河市贷揽,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌梦碗,老刑警劉巖禽绪,帶你破解...
    沈念sama閱讀 217,907評(píng)論 6 506
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件蓖救,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡印屁,警方通過查閱死者的電腦和手機(jī)循捺,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,987評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來雄人,“玉大人从橘,你說我怎么就攤上這事∧埽” “怎么了洋满?”我有些...
    開封第一講書人閱讀 164,298評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)珍坊。 經(jīng)常有香客問我牺勾,道長(zhǎng),這世上最難降的妖魔是什么阵漏? 我笑而不...
    開封第一講書人閱讀 58,586評(píng)論 1 293
  • 正文 為了忘掉前任驻民,我火速辦了婚禮,結(jié)果婚禮上履怯,老公的妹妹穿的比我還像新娘回还。我一直安慰自己,他們只是感情好叹洲,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,633評(píng)論 6 392
  • 文/花漫 我一把揭開白布柠硕。 她就那樣靜靜地躺著,像睡著了一般运提。 火紅的嫁衣襯著肌膚如雪蝗柔。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,488評(píng)論 1 302
  • 那天民泵,我揣著相機(jī)與錄音癣丧,去河邊找鬼。 笑死栈妆,一個(gè)胖子當(dāng)著我的面吹牛胁编,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播鳞尔,決...
    沈念sama閱讀 40,275評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼嬉橙,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了寥假?” 一聲冷哼從身側(cè)響起憎夷,我...
    開封第一講書人閱讀 39,176評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎昧旨,沒想到半個(gè)月后拾给,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體祥得,經(jīng)...
    沈念sama閱讀 45,619評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,819評(píng)論 3 336
  • 正文 我和宋清朗相戀三年蒋得,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了级及。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,932評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡额衙,死狀恐怖饮焦,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情窍侧,我是刑警寧澤县踢,帶...
    沈念sama閱讀 35,655評(píng)論 5 346
  • 正文 年R本政府宣布,位于F島的核電站伟件,受9級(jí)特大地震影響硼啤,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜斧账,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,265評(píng)論 3 329
  • 文/蒙蒙 一谴返、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧咧织,春花似錦嗓袱、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,871評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至闪萄,卻和暖如春梧却,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背桃煎。 一陣腳步聲響...
    開封第一講書人閱讀 32,994評(píng)論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留大刊,地道東北人为迈。 一個(gè)月前我還...
    沈念sama閱讀 48,095評(píng)論 3 370
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像缺菌,于是被迫代替她去往敵國和親葫辐。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,884評(píng)論 2 354