機器學(xué)習(xí) - 01章

前言

本案例來源于 Github， ageron / handson-ml曙痘，目前正在借助該項目熟悉 numpy, pandas, matplotlib 的用法束凑。

項目為英文贮喧，我將根據(jù)我的進度逐步翻譯，并在代碼中加上自己的注釋舔庶，以供學(xué)習(xí)和回顧抛蚁。每當我翻譯完一章后，我將提交 notebook 的代碼惕橙，屆時將會放出倉庫地址瞧甩。

第 1 章 - 機器學(xué)習(xí)概覽

這些代碼用于生成第1章的一些圖表弥鹦。

配置

首先，讓我們看看此 notebook 能否兼容 python2 和 python3 環(huán)境吼虎，通過導(dǎo)入幾個常用的模塊，確保 MatplotLib 以內(nèi)聯(lián)的方式繪圖苍鲜，然后準備一個函數(shù)來保存圖表：

# 引入以下模塊以同時兼容 python2 和 python3
from __future__ import division, print_function, unicode_literals

# 引入常用模塊
import numpy as np
import os

# 確保此 notebook 穩(wěn)定運行
np.random.seed(42)

# 設(shè)置繪圖格式
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# 保存圖表
PROJECT_ROOT_DIR = '.'
CHAPTER_ID = 'fundamentals'

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, 'images', CHAPTER_ID, fig_id + '.png')
    print('Saving figure', fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# 忽略無用的警告信息
import warnings
warnings.filterwarnings(action='ignore', module='scipy', message='^internal gelsd')

例 1-1

此函數(shù)只用作合并 datasets/lifesat/oecd_bli_2015.csv 和 datasets/lifesat/gdp_per_captita.csv 中的數(shù)據(jù)思灰。這將會冗長而繁瑣混滔，并且這并不屬于機器學(xué)習(xí)，這也是我沒將它寫進書中的原因坯屿。

本書中代碼用到的數(shù)據(jù)文件在當前目錄中油湖，我只是對它進行了調(diào)整，來獲取 datasets/lifesat

# 清洗數(shù)據(jù) (本章后續(xù)會詳細介紹)
def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli['INEQUALITY'] == 'TOT']
    oecd_bli = oecd_bli.pivot(index='Country', columns='Indicator', values='Value')

    gdp_per_capita.rename(columns={'2015': 'GDP per capita'}, inplace=True)
    gdp_per_capita.set_index('Country', inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, left_index=True, right=gdp_per_capita, right_index=True)
    full_country_stats.sort_values(by='GDP per capita', inplace=True)

    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))

    return full_country_stats[['GDP per capita', 'Life satisfaction']].iloc[keep_indices]

import os

# 設(shè)置數(shù)據(jù)的儲存路徑
datapath = os.path.join('datasets', 'lifesat', '')

# 例子
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# 加載數(shù)據(jù)
oecd_bli = pd.read_csv(datapath + 'oecd_bli_2015.csv', thousands=',')
gdp_per_capita = pd.read_csv(datapath + 'gdp_per_capita.csv', thousands=',', delimiter='\t', encoding='latin1', na_values='n/a')

# 清洗數(shù)據(jù)
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)

# 合并數(shù)據(jù)
# np.r_是按列連接兩個矩陣, 要求列數(shù)相等
# np.c_是按行連接兩個矩陣, 要求行數(shù)相等
x = np.c_[country_stats['GDP per capita']]
y = np.c_[country_stats['Life satisfaction']]

# 可視化數(shù)據(jù)
country_stats.plot(kind='scatter', x='GDP per capita', y='Life satisfaction')
plt.show()

# 創(chuàng)建線型模型
model = sklearn.linear_model.LinearRegression()

# 訓(xùn)練模型
model.fit(x, y)

# 預(yù)測塞浦路斯地區(qū)的情況
x_new = [[22587]]
print(model.predict(x_new))

塞浦路斯

[[5.96242338]]

注：你可以忽略此 notebook 剩下的部分乏德，那些只是生成一些本章的圖表

加載并清洗 Life satisfaction 數(shù)據(jù)

如果你愿意吠昭，你可以從 OECD 的網(wǎng)站獲取最新數(shù)據(jù)矢棚。從 http://stats.oecd.org/index.aspx?DataSetCode=BLI 下載 CSV 文件，然后保存到 datasets/lifesat/ 中蒲肋。

# 從 csv 文件讀取數(shù)據(jù), 以逗號為分隔符
oecd_bli = pd.read_csv(datapath + 'oecd_bli_2015.csv', thousands=',')

# 提取出 INEQUALITY 字段為 'TOT' 的數(shù)據(jù)
oecd_bli = oecd_bli[oecd_bli['INEQUALITY'] == 'TOT']

# 轉(zhuǎn)換為以國家為索引
oecd_bli = oecd_bli.pivot(index='Country', columns='Indicator', values='Value')

# 展示前兩行數(shù)據(jù)
oecd_bli.head(2)

oecd_bli.head(2)

# head() 默認展示前 5 行數(shù)據(jù)
oecd_bli['Life satisfaction'].head()

oecd_bli['Life satisfaction'].head()

加載并清洗 GDP per capita 數(shù)據(jù)

如上所述，如果你愿意申窘，你可以更新 GDP per capita 數(shù)據(jù)。從 http://goo.gl/j1MSKe(=> imf.org) 下載最新數(shù)據(jù)熟吏，然后保存到 datasets/lifesat/ 中玄窝。（國內(nèi)自備梯子）

# 讀取數(shù)據(jù)
gdp_per_capita = pd.read_csv(datapath+'gdp_per_capita.csv', thousands=',', delimiter='\t', encoding='latin1', na_values="n/a")

# 更改 "2015" 字段名稱為 "GDP per capita"
gdp_per_capita.rename(columns={'2015': 'GDP per capita'}, inplace=True)

# 將索引設(shè)為 "Country"
gdp_per_capita.set_index("Country", inplace=True)

gdp_per_capita.head(2)

gdp_per_capita.head(2)

# 合并數(shù)據(jù)集
full_country_stats = pd.merge(left=oecd_bli, left_index=True, right=gdp_per_capita, right_index=True)

# 根據(jù) "GDP per capita" 排序
full_country_stats.sort_values(by='GDP per capita', inplace=True)

full_country_stats.head(2)

full_country_stats.head(2)

# 查詢 "United States" 的 "GDP per capita" 和 "Life satisfaction"
full_country_stats[['GDP per capita', 'Life satisfaction']].loc['United States']

full_country_stats[['GDP per capita', 'Life satisfaction']].loc['United States']

# 排除的索引值
remove_indices = [0, 1, 6, 8, 33, 34, 35]

# 保留的索引值 (range(36) 與 remove_indices 取差集)
keep_indices = list(set(range(36)) - set(remove_indices))

# 保留的 "GDP per capita" 和 "Life satisfaction" 數(shù)據(jù)
sample_data = full_country_stats[['GDP per capita', 'Life satisfaction']].iloc[keep_indices]

# 排除的 "GDP per capita" 和 "Life satisfaction" 數(shù)據(jù)
missing_data = full_country_stats[['GDP per capita', 'Life satisfaction']].iloc[remove_indices]

# 設(shè)置打印格式 (kind: 圖表類型, x: x軸文字, y: y軸文字, figsize: 圖表尺寸)
sample_data.plot(kind='scatter', x='GDP per capita', y='Life satisfaction', figsize=(5, 3))

# 圖框范圍 (x_min, x_max, y_min, y_max)
plt.axis([0, 60000, 0, 10])

# 標注點信息設(shè)置
position_text = {
    'Hungary': (5000, 1),
    'Korea': (18000, 1.7),
    'France': (29000, 2.4),
    'Australia': (40000, 3.0),
    'United States': (52000, 3.8),
}

# 畫出標注點
for country, pos_text in position_text.items():
    # pos_data_x 為 "GDP per capita", pos_data_y 為 "Life satisfaction"
    pos_data_x, pos_data_y = sample_data.loc[country]
    # 為 "United States" 起別名
    country = 'U.S.' if country == 'United States' else country
    # 設(shè)置標注點的參數(shù)
    plt.annotate(
        country,
        xy = (pos_data_x, pos_data_y),
        xytext = pos_text,
        arrowprops = dict(facecolor='black', width=0.5, shrink=0.1, headwidth=5)
    )
    # 畫出標注點 ("ro" 中的 "r表示紅色", "o" 表示圓)
    plt.plot(pos_data_x, pos_data_y, 'ro')

# 保存圖片
# save_fig('money_happy_scatterplot')
plt.show()

money_happy_scatterplot

# 將數(shù)據(jù)儲存為 datasets/lifesat/lifesat.csv
sample_data.to_csv(os.path.join('datasets', 'lifesat', 'lifesat.csv'))

# 查詢 position_text 中的國家數(shù)據(jù)
sample_data.loc[list(position_text.keys())]

sample_data.loc[list(position_text.keys())]

import numpy as np

# 設(shè)置打印格式 (kind: 圖表類型, x: x軸文字, y: y軸文字, figsize: 圖表尺寸)
sample_data.plot(kind='scatter', x='GDP per capita', y='Life satisfaction', figsize=(5, 3))

# 圖框范圍 (x_min, x_max, y_min, y_max)
plt.axis([0, 60000, 0, 10])

# 返回一個線型的范圍數(shù)組 (起始值, 結(jié)束值, 數(shù)組長度), 例如: np.linspace(1, 10, 4) 將返回 [1, 4, 7, 10]
X = np.linspace(0, 60000, 1000)

# plt.plot(x, fx, color)
#   繪制直線, 參數(shù)為: 自變量, 函數(shù)關(guān)系(因變量), 線條顏色
# plt.text(x, y, text)
#   文字標注, 參數(shù)為: 文字起始點橫坐標, 文字起始點縱坐標, 文字內(nèi)容
# r'string'
#   原始字符串, 當字符串前面加上 "r" 后, 則該字符串為 "raw string", 其中的 "\" 不會當作轉(zhuǎn)義字符, 常見于正則表達式

plt.plot(X, 2 * X / 100000, 'r')
plt.text(40000, 2.7, r'$\theta_0 = 0$', fontsize=14, color='r')
plt.text(40000, 1.8, r'$\theta_1 = 2 \times 10 ^ {-5}$', fontsize=14, color='r')

plt.plot(X, 8 - 5 * X / 100000, 'g')
plt.text(5000, 9.1, r'$\theta_0 = 8$', fontsize=14, color='g')
plt.text(5000, 8.2, r'$\theta_1 = -5 \times 10 ^ {-5}$', fontsize=14, color='g')

plt.plot(X, 4 + 5 * X / 100000, 'b')
plt.text(5000, 3.5, r'$\theta_0 = 4$', fontsize=14, color='b')
plt.text(5000, 2.6, r'$\theta_1 = 5 \times 10 ^ {-5}$', fontsize=14, color='b')

# save_fig('tweaking_model_params_plot')

plt.show()

tweaking_model_params_plot

未完待續(xù)...

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末俩块，一起剝皮案震驚了整個濱河市黎休，隨后出現(xiàn)的幾起案子玉凯，更是在濱河造成了極大的恐慌，老刑警劉巖捎拯，帶你破解...
沈念sama閱讀 219,039評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件盲厌，死亡現(xiàn)場離奇詭異，居然都是意外死亡建芙，警方通過查閱死者的電腦和手機懂扼，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,426評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來屡限，“玉大人炕倘，你說我怎么就攤上這事翰撑“⊙耄” “怎么了？”我有些...
開封第一講書人閱讀 165,417評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵瓜饥，是天一觀的道長浴骂。經(jīng)常有香客問我，道長趣苏，這世上最難降的妖魔是什么梯轻？我笑而不...
開封第一講書人閱讀 58,868評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮彬伦，結(jié)果婚禮上伊诵，老公的妹妹穿的比我還像新娘。我一直安慰自己曹宴，他們只是感情好，可當我...
茶點故事閱讀 67,892評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布份氧。她就那樣靜靜地躺著弯屈，像睡著了一般。火紅的嫁衣襯著肌膚如雪资厉。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,692評論 1贊 305
城市分裂傳說
那天湘捎，我揣著相機與錄音窄刘，去河邊找鬼。笑死娩践，一個胖子當著我的面吹牛烹骨，可吹牛的內(nèi)容都是我干的材泄。我是一名探鬼主播，決...
沈念sama閱讀 40,416評論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼峦树，長吁一口氣：“原來是場噩夢啊……” “哼旦事！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起歪赢，我...
開封第一講書人閱讀 39,326評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤单料，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后扫尖，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,782評論 1贊 316
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡甩恼，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,957評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年条摸，在試婚紗的時候發(fā)現(xiàn)自己被綠了铸屉。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,102評論 1贊 350
活死人
序言：一個原本活蹦亂跳的男人離奇死亡彻坛，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出钙蒙，到底是詐尸還是另有隱情间驮，我是刑警寧澤，帶...
沈念sama閱讀 35,790評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布烤咧，位于F島的核電站偏陪，受9級特大地震影響煮嫌，放射性物質(zhì)發(fā)生泄漏抱虐。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,442評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧人乓，春花似錦、人聲如沸内地。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,996評論 0贊 22
一樁弒父案举农，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至航背，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間玖媚，已是汗流浹背键畴。一陣腳步聲響...
開封第一講書人閱讀 33,113評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留涡贱，地道東北人。一個月前我還...
沈念sama閱讀 48,332評論 3贊 373
代替公主和親
正文我出身青樓问词，卻偏偏與公主長得像嘀粱，于是被迫代替她去往敵國和親辰狡。傳聞我的和親對象是個殘疾皇子垄分，可洞房花燭夜當晚...
茶點故事閱讀 45,044評論 2贊 355

機器學(xué)習(xí) - 01章

前言

配置

例 1-1

注： 你可以忽略此 notebook 剩下的部分乏德，那些只是生成一些本章的圖表

加載并清洗 Life satisfaction 數(shù)據(jù)

加載并清洗 GDP per capita 數(shù)據(jù)

推薦閱讀更多精彩內(nèi)容

注：你可以忽略此 notebook 剩下的部分乏德，那些只是生成一些本章的圖表