數(shù)據(jù)預(yù)處理
Day 1的任務(wù)是數(shù)據(jù)預(yù)處理坝锰。開始任務(wù)~
Step1 Import the libs
Screen Shot 2019-01-04 at 2.54.46 PM.png
Numpy包含數(shù)學(xué)函數(shù), Pandas是用來管理導(dǎo)入數(shù)據(jù)集和對(duì)數(shù)據(jù)集進(jìn)行操作.
如果和我一樣對(duì)Pandas一竅不通. 可以用這篇文章來學(xué)習(xí).pandas學(xué)習(xí)
code如下:
#Step 1: Import the libs
import numpy as numpy
import pandas as pd
Step2 Import dataset
Screen Shot 2019-01-04 at 2.54.52 PM.png
數(shù)據(jù)集一般是csv格式. 每行為一條記錄. read_csv后csv中的數(shù)據(jù)被保存到dataframe中.
然后我們從dataframe中分離出自變量和因變量, 分別為矩陣和向量.
code如下:
#Step 2: Import dataset
dataset = pd.read_csv('../datasets/Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values
print("Step 2: Importing dataset")
print("X")
print(X)
print("Y")
print(Y)
Step3 Handling the missing data
Screen Shot 2019-01-04 at 2.54.57 PM.png
因?yàn)閿?shù)據(jù)極少很規(guī)范, 所以我們通常需要對(duì)缺失的數(shù)據(jù)進(jìn)行處理. 這樣就不會(huì)在機(jī)器學(xué)習(xí)的時(shí)候被bad data所影響. 一般用Imputer來處理. 而且我們一般用平均數(shù)或者中位數(shù)來替換缺失的值. 例子里缺失值的占位表現(xiàn)形式是NaN.
code如下:
#step 3: Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[ : , 1:3] = imputer.transform(X[ : , 1:3])
print("---------------------")
print("Step 3: Handling the missing data")
print("step2")
print("X")
print(X)
Step4 Encoding categorical data
Screen Shot 2019-01-04 at 2.55.04 PM.png
分類數(shù)據(jù)一般不能是label. 需要是數(shù)字. 像例子中的因變量為YES和NO.我們需要用LabelEncoder類來轉(zhuǎn)換.
- LabelEncoder: 編碼值介于0和n_classes-1之間的標(biāo)簽, 還可用于將非數(shù)字標(biāo)簽(只要它們可比較)轉(zhuǎn)換為數(shù)字標(biāo)簽.
- OneHotEncoder: 使用K-K方案對(duì)分類整數(shù)特征進(jìn)行編碼.
code如下:
#Step 4: Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])
#Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print("---------------------")
print("Step 4: Encoding categorical data")
print("X")
print(X)
print("Y")
print(Y)
Step5 Splitting the datasets into training sets and Test sets
Screen Shot 2019-01-04 at 2.55.12 PM.png
數(shù)據(jù)集會(huì)被拆分成兩部分, 一部分為訓(xùn)練集, 用來訓(xùn)練模型. 一部分為測(cè)試集, 用來測(cè)試訓(xùn)練模型的性能. 一般為80:20的原則.
code如下:
#Step 5: Splitting the datasets into training sets and Test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
print("---------------------")
print("Step 5: Splitting the datasets into training sets and Test sets")
print("X_train")
print(X_train)
print("X_test")
print(X_test)
print("Y_train")
print(Y_train)
print("Y_test")
print(Y_test)
Step6 Feature Scaling
Screen Shot 2019-01-04 at 2.55.17 PM.png
在機(jī)器學(xué)習(xí)中, 高數(shù)量級(jí)特征比低數(shù)量級(jí)特征有更高的權(quán)重.
我們用特征標(biāo)準(zhǔn)化或Z分布解決這個(gè)問題.
code如下:
#Step 6: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("---------------------")
print("Step 6: Feature Scaling")
print("X_train")
print(X_train)
print("X_test")
print(X_test)