通過本手冊毅戈,你將收獲以下知識:
- matplotlib 及環(huán)境配置
- 數(shù)據(jù)圖的組成結(jié)構(gòu)隙咸,與 matplotlib 對應(yīng)的名稱
- 常見的數(shù)據(jù)繪圖類型,與繪制方法
您可能需要以下的準(zhǔn)備與先修知識:
- Python開發(fā)環(huán)境及matplotlib工具包
- Python基礎(chǔ)語法
- Python numpy 包使用
1.matplotlib安裝配置
linux可以通過以下方式安裝matplotlib
sudo pip install numpy
sudo pip install scipy
sudo pip install matplotlib
windows墻裂推薦大家使用anaconda
2.一副可視化圖的基本結(jié)構(gòu)
通常及穗,使用 numpy 組織數(shù)據(jù), 使用 matplotlib API 進(jìn)行數(shù)據(jù)圖像繪制肝箱。 一幅數(shù)據(jù)圖基本上包括如下結(jié)構(gòu):
- Data: 數(shù)據(jù)區(qū)哄褒,包括數(shù)據(jù)點(diǎn)、描繪形狀
- Axis: 坐標(biāo)軸煌张,包括 X 軸呐赡、 Y 軸及其標(biāo)簽、刻度尺及其標(biāo)簽
- Title: 標(biāo)題骏融,數(shù)據(jù)圖的描述
- Legend: 圖例链嘀,區(qū)分圖中包含的多種曲線或不同分類的數(shù)據(jù)
其他的還有圖形文本 (Text)、注解 (Annotate)等其他描述
3.畫法
下面以常規(guī)圖為例档玻,詳細(xì)記錄作圖流程及技巧怀泊。按照繪圖結(jié)構(gòu),可將數(shù)據(jù)圖的繪制分為如下幾個步驟:
- 導(dǎo)入 matplotlib 包相關(guān)工具包
- 準(zhǔn)備數(shù)據(jù)误趴,numpy 數(shù)組存儲
- 繪制原始曲線
- 配置標(biāo)題霹琼、坐標(biāo)軸、刻度冤留、圖例
- 添加文字說明碧囊、注解
- 顯示、保存繪圖結(jié)果
3.1 導(dǎo)包
會用到 matplotlib.pyplot纤怒、pylab 和 numpy
#coding:utf-8
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
3.2 準(zhǔn)備數(shù)據(jù)
numpy 常用來組織源數(shù)據(jù):
# 定義數(shù)據(jù)部分
x = np.arange(0., 10, 0.2)
y1 = np.cos(x)
y2 = np.sin(x)
y3 = np.sqrt(x)
#x = all_df['house_age']
#y = all_df['house_price']
3.3繪制基本曲線
使用 plot 函數(shù)直接繪制上述函數(shù)曲線糯而,可以通過配置 plot 函數(shù)參數(shù)調(diào)整曲線的樣式、粗細(xì)泊窘、顏色熄驼、標(biāo)記等:
# 繪制 3 條函數(shù)曲線
# $y=\sqrt{x}$
plt.rcParams["figure.figsize"] = (12,8)
plt.plot(x, y1, color='blue', linewidth=1.5, linestyle='-', marker='.', label=r'$y = cos{x}$')
plt.plot(x, y2, color='green', linewidth=1.5, linestyle='-', marker='*', label=r'$y = sin{x}$')
plt.plot(x, y3, color='m', linewidth=1.5, linestyle='-', marker='x', label=r'$y = \sqrt{x}$')
3.3.1 關(guān)于顏色的補(bǔ)充
主要是color參數(shù):
- r 紅色
- g 綠色
- b 藍(lán)色
- c cyan
- m 紫色
- y 土黃色
- k 黑色
- w 白色
3.3.2 linestyle參數(shù)
linestyle 參數(shù)主要包含虛線、點(diǎn)化虛線烘豹、粗虛線瓜贾、實(shí)線,如下:
3.3.3 marker參數(shù)
marker參數(shù)設(shè)定在曲線上標(biāo)記的特殊符號携悯,以區(qū)分不同的線段祭芦。常見的形狀及表示符號如下圖所示:
3.4 設(shè)置坐標(biāo)軸
可通過如下代碼,移動坐標(biāo)軸 spines
# 坐標(biāo)軸上移
ax = plt.subplot(111)
#ax = plt.subplot(2,2,1)
ax.spines['right'].set_color('none') # 去掉右邊的邊框線
ax.spines['top'].set_color('none') # 去掉上邊的邊框線
# 移動下邊邊框線憔鬼,相當(dāng)于移動 X 軸
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
# 移動左邊邊框線龟劲,相當(dāng)于移動 y 軸
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
可通過如下代碼,設(shè)置刻度尺間隔 lim轴或、刻度標(biāo)簽 ticks
# 設(shè)置 x, y 軸的刻度取值范圍
plt.xlim(x.min()*1.1, x.max()*1.1)
plt.ylim(-1.5, 4.0)
# 設(shè)置 x, y 軸的刻度標(biāo)簽值
plt.xticks([2, 4, 6, 8, 10], [r'two', r'four', r'6', r'8', r'10'])
plt.yticks([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0],
[r'-1.0', r'0.0', r'1.0', r'2.0', r'3.0', r'4.0'])
可通過如下代碼昌跌,設(shè)置 X、Y 坐標(biāo)軸和標(biāo)題:
# 設(shè)置標(biāo)題照雁、x軸蚕愤、y軸
plt.title(r'$the \ function \ figure \ of \ cos(), \ sin() \ and \ sqrt()$', fontsize=19)
plt.xlabel(r'$the \ input \ value \ of \ x$', fontsize=18, labelpad=88.8)
plt.ylabel(r'$y = f(x)$', fontsize=18, labelpad=12.5)
3.5 設(shè)置文字描述、注解
可通過如下代碼,在數(shù)據(jù)圖中添加文字描述 text:
plt.text(0.8, 0.9, r'$x \in [0.0, \ 10.0]$', color='k', fontsize=15)
plt.text(0.8, 0.8, r'$y \in [-1.0, \ 4.0]$', color='k', fontsize=15)
可通過如下代碼萍诱,在數(shù)據(jù)圖中給特殊點(diǎn)添加注解 annotate:
# 特殊點(diǎn)添加注解
plt.scatter([8,],[np.sqrt(8),], 50, color ='m') # 使用散點(diǎn)圖放大當(dāng)前點(diǎn)
plt.annotate(r'$2\sqrt{2}$', xy=(8, np.sqrt(8)), xytext=(8.5, 2.2), fontsize=16, color='#090909', arrowprops=dict(arrowstyle='->', connectionstyle='arc3, rad=0.1', color='#090909'))
3.6 設(shè)置圖例
可使用如下兩種方式悬嗓,給繪圖設(shè)置圖例:
- 1: 在 plt.plot 函數(shù)中添加 label 參數(shù)后,使用 plt.legend(loc=’up right’)
- 2: 不使用參數(shù) label, 直接使用如下命令:
plt.legend(['cos(x)', 'sin(x)', 'sqrt(x)'], loc='upper right')
3.7 網(wǎng)格線開關(guān)
可使用如下代碼砂沛,給繪圖設(shè)置網(wǎng)格線:
# 顯示網(wǎng)格線
plt.grid(True)
3.8 顯示與圖像保存
plt.show() # 顯示
#savefig('../figures/plot3d_ex.png',dpi=48) # 保存烫扼,前提目錄存在
4. 完整的繪制程序
#coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
# 定義數(shù)據(jù)部分
x = np.arange(0., 10, 0.2)
y1 = np.cos(x)
y2 = np.sin(x)
y3 = np.sqrt(x)
# 繪制 3 條函數(shù)曲線
plt.plot(x, y1, color='blue', linewidth=1.5, linestyle='-', marker='.', label=r'$y = cos{x}$')
plt.plot(x, y2, color='green', linewidth=1.5, linestyle='-', marker='*', label=r'$y = sin{x}$')
plt.plot(x, y3, color='m', linewidth=1.5, linestyle='-', marker='x', label=r'$y = \sqrt{x}$')
# 坐標(biāo)軸上移
ax = plt.subplot(111)
ax.spines['right'].set_color('none') # 去掉右邊的邊框線
ax.spines['top'].set_color('none') # 去掉上邊的邊框線
# 移動下邊邊框線,相當(dāng)于移動 X 軸
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data', 0))
# 移動左邊邊框線碍庵,相當(dāng)于移動 y 軸
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
# 設(shè)置 x, y 軸的取值范圍
plt.xlim(x.min()*1.1, x.max()*1.1)
plt.ylim(-1.5, 4.0)
# 設(shè)置 x, y 軸的刻度值
plt.xticks([2, 4, 6, 8, 10], [r'2', r'4', r'6', r'8', r'10'])
plt.yticks([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0],
[r'-1.0', r'0.0', r'1.0', r'2.0', r'3.0', r'4.0'])
# 添加文字
plt.text(0.8, 0.8, r'$x \in [0.0, \ 10.0]$', color='k', fontsize=15)
plt.text(0.8, 0.9, r'$y \in [-1.0, \ 4.0]$', color='k', fontsize=15)
# 特殊點(diǎn)添加注解
plt.scatter([8,],[np.sqrt(8),], 50, color ='m') # 使用散點(diǎn)圖放大當(dāng)前點(diǎn)
plt.annotate(r'$2\sqrt{2}$', xy=(8, np.sqrt(8)), xytext=(8.5, 2.2), fontsize=16, color='#090909', arrowprops=dict(arrowstyle='->', connectionstyle='arc3, rad=0.1', color='#090909'))
# 設(shè)置標(biāo)題映企、x軸、y軸
plt.title(r'$the \ function \ figure \ of \ cos(), \ sin() \ and \ sqrt()$', fontsize=19)
plt.xlabel(r'$the \ input \ value \ of \ x$', fontsize=18, labelpad=88.8)
plt.ylabel(r'$y = f(x)$', fontsize=18, labelpad=12.5)
# 設(shè)置圖例及位置
plt.legend(loc='up right')
# plt.legend(['cos(x)', 'sin(x)', 'sqrt(x)'], loc='up right')
# 顯示網(wǎng)格線
plt.grid(True)
# 顯示繪圖
plt.show()
5.常用圖像
細(xì)節(jié)看這里静浴,看這里堰氓,看這里
想成為可視化專家的你,工具手冊在哪里苹享?在這里双絮!更全的在這里
- 曲線圖:matplotlib.pyplot.plot(data)
- 灰度圖:matplotlib.pyplot.hist(data)
- 散點(diǎn)圖:matplotlib.pyplot.scatter(data)
- 箱式圖:matplotlib.pyplot.boxplot(data)
x = np.arange(-5,5,0.1)
y = x ** 2
plt.plot(x,y)
x = np.random.normal(size=1000)
plt.hist(x, bins=10)
plt.rcParams["figure.figsize"] = (8,8)
x = np.random.normal(size=1000)
y = np.random.normal(size=1000)
plt.scatter(x,y)
plt.boxplot(x)
箱式圖科普
- 上邊緣(Q3+1.5IQR)、下邊緣(Q1-1.5IQR)得问、IQR=Q3-Q1
- 上四分位數(shù)(Q3)囤攀、下四分位數(shù)(Q1)
- 中位數(shù)
- 異常值
- 處理異常值時與標(biāo)準(zhǔn)的異同:統(tǒng)計(jì)邊界是否受異常值影響、容忍度的大小
6.案例:自行車租賃數(shù)據(jù)分析與可視化
step1. 導(dǎo)入數(shù)據(jù)宫纬,做簡單的數(shù)據(jù)處理
import pandas as pd # 讀取數(shù)據(jù)到DataFrame
import urllib # 獲取網(wǎng)絡(luò)數(shù)據(jù)
import tempfile # 創(chuàng)建臨時文件系統(tǒng)
import shutil # 文件操作
import zipfile # 壓縮解壓
temp_dir = tempfile.mkdtemp() # 建立臨時目錄
data_source = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip' # 網(wǎng)絡(luò)數(shù)據(jù)地址
zipname = temp_dir + '/Bike-Sharing-Dataset.zip' # 拼接文件和路徑
urllib.urlretrieve(data_source, zipname) # 獲得數(shù)據(jù)
zip_ref = zipfile.ZipFile(zipname, 'r') # 創(chuàng)建一個ZipFile對象處理壓縮文件
zip_ref.extractall(temp_dir) # 解壓
zip_ref.close()
daily_path = 'data/day.csv'
daily_data = pd.read_csv(daily_path) # 讀取csv文件
daily_data['dteday'] = pd.to_datetime(daily_data['dteday']) # 把字符串?dāng)?shù)據(jù)傳換成日期數(shù)據(jù)
drop_list = ['instant', 'season', 'yr', 'mnth', 'holiday', 'workingday', 'weathersit', 'atemp', 'hum'] # 不關(guān)注的列
daily_data.drop(drop_list, inplace = True, axis = 1) # inplace=true在對象上直接操作
shutil.rmtree(temp_dir) # 刪除臨時文件目錄
daily_data.head() # 看一看數(shù)據(jù)~
step2. 配置參數(shù)
from __future__ import division, print_function # 引入3.x版本的除法和打印
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# 在notebook中顯示繪圖結(jié)果
%matplotlib inline
# 設(shè)置一些全局的資源參數(shù)焚挠,可以進(jìn)行個性化修改
import matplotlib
# 設(shè)置圖片尺寸 14" x 7"
# rc: resource configuration
matplotlib.rc('figure', figsize = (14, 7))
# 設(shè)置字體 14
matplotlib.rc('font', size = 14)
# 不顯示頂部和右側(cè)的坐標(biāo)線
matplotlib.rc('axes.spines', top = False, right = False)
# 不顯示網(wǎng)格
matplotlib.rc('axes', grid = False)
# 設(shè)置背景顏色是白色
matplotlib.rc('axes', facecolor = 'white')
step3. 關(guān)聯(lián)分析
散點(diǎn)圖
- 分析變量關(guān)系
from matplotlib import font_manager
fontP = font_manager.FontProperties()
fontP.set_family('SimHei')
fontP.set_size(14)
# 包裝一個散點(diǎn)圖的函數(shù)便于復(fù)用
def scatterplot(x_data, y_data, x_label, y_label, title):
# 創(chuàng)建一個繪圖對象
fig, ax = plt.subplots()
# 設(shè)置數(shù)據(jù)、點(diǎn)的大小漓骚、點(diǎn)的顏色和透明度
ax.scatter(x_data, y_data, s = 10, color = '#539caf', alpha = 0.75) # http://www.114la.com/other/rgb.htm
# 添加標(biāo)題和坐標(biāo)說明
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
# 繪制散點(diǎn)圖
scatterplot(x_data = daily_data['temp']
, y_data = daily_data['cnt']
, x_label = 'Normalized temperature (C)'
, y_label = 'Check outs'
, title = 'Number of Check Outs vs Temperature')
曲線圖
- 擬合變量關(guān)系
# 線性回歸
import statsmodels.api as sm # 最小二乘
from statsmodels.stats.outliers_influence import summary_table # 獲得匯總信息
x = sm.add_constant(daily_data['temp']) # 線性回歸增加常數(shù)項(xiàng) y=kx+b
y = daily_data['cnt']
regr = sm.OLS(y, x) # 普通最小二乘模型蝌衔,ordinary least square model
res = regr.fit()
# 從模型獲得擬合數(shù)據(jù)
st, data, ss2 = summary_table(res, alpha=0.05) # 置信水平alpha=5%,st數(shù)據(jù)匯總蝌蹂,data數(shù)據(jù)詳情噩斟,ss2數(shù)據(jù)列名
fitted_values = data[:,2]
# 包裝曲線繪制函數(shù)
def lineplot(x_data, y_data, x_label, y_label, title):
# 創(chuàng)建繪圖對象
_, ax = plt.subplots()
# 繪制擬合曲線,lw=linewidth孤个,alpha=transparancy
ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)
# 添加標(biāo)題和坐標(biāo)說明
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
# 調(diào)用繪圖函數(shù)
lineplot(x_data = daily_data['temp']
, y_data = fitted_values
, x_label = 'Normalized temperature (C)'
, y_label = 'Check outs'
, title = 'Line of Best Fit for Number of Check Outs vs Temperature')
x.head()
type(regr)
st
帶置信區(qū)間的曲線圖
- 評估曲線擬合結(jié)果
# 獲得5%置信區(qū)間的上下界
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
# 創(chuàng)建置信區(qū)間DataFrame剃允,上下界
CI_df = pd.DataFrame(columns = ['x_data', 'low_CI', 'upper_CI'])
CI_df['x_data'] = daily_data['temp']
CI_df['low_CI'] = predict_mean_ci_low
CI_df['upper_CI'] = predict_mean_ci_upp
CI_df.sort_values('x_data', inplace = True) # 根據(jù)x_data進(jìn)行排序
# 繪制置信區(qū)間
def lineplotCI(x_data, y_data, sorted_x, low_CI, upper_CI, x_label, y_label, title):
# 創(chuàng)建繪圖對象
_, ax = plt.subplots()
# 繪制預(yù)測曲線
ax.plot(x_data, y_data, lw = 1, color = '#539caf', alpha = 1, label = 'Fit')
# 繪制置信區(qū)間,順序填充
ax.fill_between(sorted_x, low_CI, upper_CI, color = '#539caf', alpha = 0.4, label = '95% CI')
# 添加標(biāo)題和坐標(biāo)說明
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
# 顯示圖例齐鲤,配合label參數(shù)硅急,loc=“best”自適應(yīng)方式
ax.legend(loc = 'best')
# Call the function to create plot
lineplotCI(x_data = daily_data['temp']
, y_data = fitted_values
, sorted_x = CI_df['x_data']
, low_CI = CI_df['low_CI']
, upper_CI = CI_df['upper_CI']
, x_label = 'Normalized temperature (C)'
, y_label = 'Check outs'
, title = 'Line of Best Fit for Number of Check Outs vs Temperature')
雙坐標(biāo)曲線圖
- 曲線擬合不滿足置信閾值時,考慮增加獨(dú)立變量
*分析不同尺度多變量的關(guān)系
# 雙縱坐標(biāo)繪圖函數(shù)
def lineplot2y(x_data, x_label, y1_data, y1_color, y1_label, y2_data, y2_color, y2_label, title):
_, ax1 = plt.subplots()
ax1.plot(x_data, y1_data, color = y1_color)
# 添加標(biāo)題和坐標(biāo)說明
ax1.set_ylabel(y1_label, color = y1_color)
ax1.set_xlabel(x_label)
ax1.set_title(title)
ax2 = ax1.twinx() # 兩個繪圖對象共享橫坐標(biāo)軸
ax2.plot(x_data, y2_data, color = y2_color)
ax2.set_ylabel(y2_label, color = y2_color)
# 右側(cè)坐標(biāo)軸可見
ax2.spines['right'].set_visible(True)
# 調(diào)用繪圖函數(shù)
lineplot2y(x_data = daily_data['dteday']
, x_label = 'Day'
, y1_data = daily_data['cnt']
, y1_color = '#539caf'
, y1_label = 'Check outs'
, y2_data = daily_data['windspeed']
, y2_color = '#7663b0'
, y2_label = 'Normalized windspeed'
, title = 'Check Outs and Windspeed Over Time')
step4. 分布分析
灰度圖
- 粗略區(qū)間計(jì)數(shù)
# 繪制灰度圖的函數(shù)
def histogram(data, x_label, y_label, title):
_, ax = plt.subplots()
res = ax.hist(data, color = '#539caf', bins=10) # 設(shè)置bin的數(shù)量
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
return res
# 繪圖函數(shù)調(diào)用
res = histogram(data = daily_data['registered']
, x_label = 'Check outs'
, y_label = 'Frequency'
, title = 'Distribution of Registered Check Outs')
res[0] # value of bins
res[1] # boundary of bins
堆疊直方圖
- 比較兩個分布
# 繪制堆疊的直方圖
def overlaid_histogram(data1, data1_name, data1_color, data2, data2_name, data2_color, x_label, y_label, title):
# 歸一化數(shù)據(jù)區(qū)間佳遂,對齊兩個直方圖的bins
max_nbins = 10
data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]
binwidth = (data_range[1] - data_range[0]) / max_nbins
bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth) # 生成直方圖bins區(qū)間
# Create the plot
_, ax = plt.subplots()
ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name)
ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'best')
# Call the function to create plot
overlaid_histogram(data1 = daily_data['registered']
, data1_name = 'Registered'
, data1_color = '#539caf'
, data2 = daily_data['casual']
, data2_name = 'Casual'
, data2_color = '#7663b0'
, x_label = 'Check outs'
, y_label = 'Frequency'
, title = 'Distribution of Check Outs By Type')
- registered:注冊的分布,正態(tài)分布撒顿,why
- casual:偶然的分布丑罪,疑似指數(shù)分布,why
密度圖
- 精細(xì)刻畫概率分布
- KDE: kernal density estimate
# 計(jì)算概率密度
from scipy.stats import gaussian_kde
data = daily_data['registered']
density_est = gaussian_kde(data) # kernal density estimate: https://en.wikipedia.org/wiki/Kernel_density_estimation
# 控制平滑程度,數(shù)值越大吩屹,越平滑
density_est.covariance_factor = lambda : .3
density_est._compute_covariance()
x_data = np.arange(min(data), max(data), 200)
# 繪制密度估計(jì)曲線
def densityplot(x_data, density_est, x_label, y_label, title):
_, ax = plt.subplots()
ax.plot(x_data, density_est(x_data), color = '#539caf', lw = 2)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
# 調(diào)用繪圖函數(shù)
densityplot(x_data = x_data
, density_est = density_est
, x_label = 'Check outs'
, y_label = 'Frequency'
, title = 'Distribution of Registered Check Outs')
type(density_est)
step5. 組間分析
- 組間定量比較
- 分組粒度
- 組間聚類
柱狀圖
- 一級類間均值方差比較
# 分天分析統(tǒng)計(jì)特征
mean_total_co_day = daily_data[['weekday', 'cnt']].groupby('weekday').agg([np.mean, np.std])
mean_total_co_day.columns = mean_total_co_day.columns.droplevel()
# 定義繪制柱狀圖的函數(shù)
def barplot(x_data, y_data, error_data, x_label, y_label, title):
_, ax = plt.subplots()
# 柱狀圖
ax.bar(x_data, y_data, color = '#539caf', align = 'center')
# 繪制方差
# ls='none'去掉bar之間的連線
ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 5)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
# 繪圖函數(shù)調(diào)用
barplot(x_data = mean_total_co_day.index.values
, y_data = mean_total_co_day['mean']
, error_data = mean_total_co_day['std']
, x_label = 'Day of week'
, y_label = 'Check outs'
, title = 'Total Check Outs By Day of Week (0 = Sunday)')
mean_total_co_day.columns
daily_data[['weekday', 'cnt']].groupby('weekday').agg([np.mean, np.std])
堆積柱狀圖
- 多級類間相對占比比較
mean_by_reg_co_day = daily_data[['weekday', 'registered', 'casual']].groupby('weekday').mean()
mean_by_reg_co_day
# 分天統(tǒng)計(jì)注冊和偶然使用的情況
mean_by_reg_co_day = daily_data[['weekday', 'registered', 'casual']].groupby('weekday').mean()
# 分天統(tǒng)計(jì)注冊和偶然使用的占比
mean_by_reg_co_day['total'] = mean_by_reg_co_day['registered'] + mean_by_reg_co_day['casual']
mean_by_reg_co_day['reg_prop'] = mean_by_reg_co_day['registered'] / mean_by_reg_co_day['total']
mean_by_reg_co_day['casual_prop'] = mean_by_reg_co_day['casual'] / mean_by_reg_co_day['total']
# 繪制堆積柱狀圖
def stackedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title):
_, ax = plt.subplots()
# 循環(huán)繪制堆積柱狀圖
for i in range(0, len(y_data_list)):
if i == 0:
ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i])
else:
# 采用堆積的方式跪另,除了第一個分類,后面的分類都從前一個分類的柱狀圖接著畫
# 用歸一化保證最終累積結(jié)果為1
ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i])
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right') # 設(shè)定圖例位置
# 調(diào)用繪圖函數(shù)
stackedbarplot(x_data = mean_by_reg_co_day.index.values
, y_data_list = [mean_by_reg_co_day['reg_prop'], mean_by_reg_co_day['casual_prop']]
, y_data_names = ['Registered', 'Casual']
, colors = ['#539caf', '#7663b0']
, x_label = 'Day of week'
, y_label = 'Proportion of check outs'
, title = 'Check Outs By Registration Status and Day of Week (0 = Sunday)')
分組柱狀圖
- 多級類間絕對數(shù)值比較
# 繪制分組柱狀圖的函數(shù)
def groupedbarplot(x_data, y_data_list, y_data_names, colors, x_label, y_label, title):
_, ax = plt.subplots()
# 設(shè)置每一組柱狀圖的寬度
total_width = 0.8
# 設(shè)置每一個柱狀圖的寬度
ind_width = total_width / len(y_data_list)
# 計(jì)算每一個柱狀圖的中心偏移
alteration = np.arange(-total_width/2+ind_width/2, total_width/2+ind_width/2, ind_width)
# 分別繪制每一個柱狀圖
for i in range(0, len(y_data_list)):
# 橫向散開繪制
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
# 調(diào)用繪圖函數(shù)
groupedbarplot(x_data = mean_by_reg_co_day.index.values
, y_data_list = [mean_by_reg_co_day['registered'], mean_by_reg_co_day['casual']]
, y_data_names = ['Registered', 'Casual']
, colors = ['#539caf', '#7663b0']
, x_label = 'Day of week'
, y_label = 'Check outs'
, title = 'Check Outs By Registration Status and Day of Week (0 = Sunday)')
- 偏移前:ind_width/2
- 偏移后:total_width/2
- 偏移量:total_width/2-ind_width/2
箱式圖
- 多級類間數(shù)據(jù)分布比較
- 柱狀圖 + 堆疊灰度圖
# 只需要指定分類的依據(jù)煤搜,就能自動繪制箱式圖
days = np.unique(daily_data['weekday'])
bp_data = []
for day in days:
bp_data.append(daily_data[daily_data['weekday'] == day]['cnt'].values)
# 定義繪圖函數(shù)
def boxplot(x_data, y_data, base_color, median_color, x_label, y_label, title):
_, ax = plt.subplots()
# 設(shè)置樣式
ax.boxplot(y_data
# 箱子是否顏色填充
, patch_artist = True
# 中位數(shù)線顏色
, medianprops = {'color': base_color}
# 箱子顏色設(shè)置免绿,color:邊框顏色,facecolor:填充顏色
, boxprops = {'color': base_color, 'facecolor': median_color}
# 貓須顏色whisker
, whiskerprops = {'color': median_color}
# 貓須界限顏色whisker cap
, capprops = {'color': base_color})
# 箱圖與x_data保持一致
ax.set_xticklabels(x_data)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
# 調(diào)用繪圖函數(shù)
boxplot(x_data = days
, y_data = bp_data
, base_color = 'b'
, median_color = 'r'
, x_label = 'Day of week'
, y_label = 'Check outs'
, title = 'Total Check Outs By Day of Week (0 = Sunday)')
7. 簡單總結(jié)
- 關(guān)聯(lián)分析擦盾、數(shù)值比較:散點(diǎn)圖嘲驾、曲線圖
- 分布分析:灰度圖、密度圖
- 涉及分類的分析:柱狀圖迹卢、箱式圖
8.案例:2014世界杯決賽分析
step1. 預(yù)處理
準(zhǔn)備好相應(yīng)的數(shù)據(jù)辽故,同時也引入需要的包。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from footyscripts.footyviz import draw_events, draw_pitch, type_names
#plotting settings
%matplotlib inline
pd.options.display.mpl_style = 'default'
df = pd.read_csv("../datasets/germany-vs-argentina-731830.csv", encoding='utf-8', index_col=0)
df.head()
df.index = range(1,len(df) + 1)
df.head()
#standard dimensions
x_size = 105.0
y_size = 68.0
box_height = 16.5*2 + 7.32
box_width = 16.5
y_box_start = y_size/2-box_height/2
y_box_end = y_size/2+box_height/2
#scale of dataset is 100 by 100. Normalizing for a standard soccer pitch size
df['x']=df['x']/100*x_size
df['y']=df['y']/100*y_size
df['to_x']=df['to_x']/100*x_size
df['to_y']=df['to_y']/100*y_size
#creating some measures and classifiers from the original
df['count'] = 1
df['dx'] = df['to_x'] - df['x']
df['dy'] = df['to_y'] - df['y']
df['distance'] = np.sqrt(df['dx']**2+df['dy']**2)
df['fivemin'] = np.floor(df['min']/5)*5
df['type_name'] = df['type'].map(type_names.get)
df['to_box'] = (df['to_x'] > x_size - box_width) & (y_box_start < df['to_y']) & (df['to_y'] < y_box_end)
df['from_box'] = (df['x'] > x_size - box_width) & (y_box_start < df['y']) & (df['y'] < y_box_end)
df['on_offense'] = df['x']>x_size/2
添加隊(duì)名和球員的名字腐碱,翻遍后面進(jìn)行統(tǒng)計(jì)和評估
df['team_name'] = np.where(df['team']==357, 'Germany', 'Argentina')
player_dic = {15207:"Philipp Lahm",44989:"Toni Kroos",15208:"Bastian Schweinsteiger",40691:"Jerome Boateng",37605:"Mesut ?zil",32644:"Javier Mascherano",66842:"André Schürrle",41316:"Benedikt H?wedes",38392:"Mats Hummels",55634:"Thomas Müller",39462:"Lucas Biglia",28525:"Ezequiel Garay",15312:"Martín Demichelis",20658:"Pablo Zabaleta",19054:"Lionel Messi",58893:"Marcos Rojo",20388:"Manuel Neuer",55661:"Enzo Pérez",42899:"Sergio Agüero",37572:"Sergio Romero",5155:"Miroslav Klose",69600:"Fernando Gago",19975:"Mario G?tze",40232:"Gonzalo Higuaín",45154:"Ezequiel Lavezzi",20153:"Rodrigo Palacio",100927:"Christoph Kramer",17127:"Per Mertesacker"}
def get_player_name(player_id):
return player_dic[player_id]
df['player_name'] = df['player_id'].apply(get_player_name)
#preslicing of the main DataFrame in smaller DFs that will be reused along the notebook
dfPeriod1 = df[df['period']==1]
dfP1Shots = dfPeriod1[dfPeriod1['type'].isin([13, 14, 15, 16])]
dfPeriod2 = df[df['period']==2]
dfP2Shots = dfPeriod2[dfPeriod2['type'].isin([13, 14, 15, 16])]
dfExtraTime = df[df['period']>2]
dfETShots = dfExtraTime[dfExtraTime['type'].isin([13, 14, 15, 16])]
step2. 上半場
咱們快速過一下上半場誊垢,下面我們來做一個圖標(biāo),看看進(jìn)攻和防守的狀況(大于0的上半部分表示德國隊(duì)的進(jìn)攻症见,小于0的部分表示德國隊(duì)的防守)喂走,圖中還標(biāo)出了射球的點(diǎn)。
fig = plt.figure(figsize=(12,4))
avg_x = (dfPeriod1[dfPeriod1['team_name']=='Germany'].groupby('min').apply(np.mean)['x'] -
dfPeriod1[dfPeriod1['team_name']=='Argentina'].groupby('min').apply(np.mean)['x'])
plt.stackplot(list(avg_x.index.values), list([x if x>0 else 0 for x in avg_x]))
plt.stackplot(list(avg_x.index.values), list([x if x<0 else 0 for x in avg_x]))
for i, shot in dfP1Shots.iterrows():
x = shot['min']
y = avg_x.ix[shot['min']]
signal = 1 if shot['team_name']=='Germany' else -1
plt.annotate(s=(shot['type_name']+' ('+shot['team_name'][0]+")"), xy=(x, y), xytext=(x-5,y+30*signal), arrowprops=dict(facecolor='black'))
plt.gca().set_xlabel('minute')
plt.title("First Half Profile")
上半場很有意思的地方在于谋作,德國隊(duì)基本主導(dǎo)著比賽芋肠,使得阿根廷大多數(shù)時候都在自己的半場內(nèi)傳球。對于這個的一個可視化瓷们,可能更能說明問題业栅,我們一起來看看,阿根廷上半場的傳球路徑谬晕。
draw_pitch()
draw_events(dfPeriod1[(dfPeriod1['type']==1) & (dfPeriod1['outcome']==1) & (dfPeriod1['team_name']=='Argentina')], mirror_away=True)
plt.text(x_size/4, -3, "Germany's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
plt.text(x_size*3/4, -3, "Argentina's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
plt.title("Argentina's passes during the first half")
dfPeriod1.groupby('team_name').agg({'x': np.mean, 'on_offense': np.mean})
dfPeriod1[dfPeriod1.type==1].groupby('team_name').agg({'outcome': np.mean})
上面還做了一個數(shù)據(jù)的分析碘裕,阿根廷大概只有28%的傳球是在進(jìn)攻階段,而德國有61%是進(jìn)攻階段攒钳。同時即使是進(jìn)攻階段帮孔,你會發(fā)現(xiàn)德國隊(duì)也保持著更高的傳球準(zhǔn)確率。
不過從進(jìn)入禁區(qū)和射門的角度上看不撑,德國隊(duì)也并沒有這么輕松文兢,事實(shí)上,從下面我們做出的圖里你可以看到焕檬,德國隊(duì)在多次嘗試進(jìn)入禁區(qū)射門里姆坚,有效的很少。
draw_pitch()
draw_events(df[(df['to_box']==True) & (df['type']==1) & (df['from_box']==False) & (df['period']==1) & (df['outcome']==1)], mirror_away=True)
draw_events(df[(df['to_box']==True) & (df['type']==1) & (df['from_box']==False) & (df['period']==1) & (df['outcome']==0)], mirror_away=True, alpha=0.2)
draw_events(dfP1Shots, mirror_away=True, base_color='#a93e3e')
plt.text(x_size/4, -3, "Germany's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
plt.text(x_size*3/4, -3, "Argentina's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
dfPeriod1[(dfPeriod1['to_box']==True) & (dfPeriod1['from_box']==False) & (dfPeriod1['type']==1)].groupby(['team_name']).agg({'outcome': np.mean, 'count': np.sum})
step3. 關(guān)于克拉默的分析
大概19分鐘的時候实愚,克拉默受傷了兼呵,但是12分鐘之后才真正換上替補(bǔ)球員兔辅。然后你會發(fā)現(xiàn)這段時間簡直就是德國上半場的地獄期,在我們之前的圖表里也可以看出來击喂。
Reports say that he acted confused维苔,相關(guān)數(shù)據(jù)表明在克拉默受傷以后直到替補(bǔ)上場,他基本是“無功能”狀態(tài):唯一做的可能就是有一個接應(yīng)懂昂,同時穿了一次球介时,還失掉了一次球。
dfKramer = df[df['player_name']=='Christoph Kramer']
pd.pivot_table(dfKramer, values='count', index='type_name', columns='min', aggfunc=sum, fill_value=0)
dfKramer['action']=dfKramer['outcome'].map(str) + '-' + dfKramer['type_name']
dfKramer['action'].unique()
score = {'1-LINEUP': 0, '1-RUN WITH BALL': 0.5, '1-RECEPTION': 0, '1-PASS': 1, '0-PASS': -1,
'0-TACKLE (NO CONTROL)': 0, '1-CLEAR BALL (OUT OF PITCH)': 0.5,
'0-LOST CONTROL OF BALL': -1, '1-SUBSTITUTION (OFF)': 0}
dfKramer['score'] = dfKramer['action'].map(score.get)
dfKramer.groupby('min')['score'].sum().reindex(range(32), fill_value=0).plot(kind='bar')
plt.annotate('Injury', (19,0.5), (14,1.1), arrowprops=dict(facecolor='black'))
plt.annotate('Substitution', (31,0), (22,1.6), arrowprops=dict(facecolor='black'))
plt.gca().set_xlabel('minute')
plt.gca().set_ylabel('no. events')
step4. 下半場
相比之下凌彬,下半場就勢均力敵多了沸柔,按照上半場的方式繪出圖形,你會發(fā)現(xiàn)雙方的控球確實(shí)是相當(dāng)?shù)摹?/p>
fig = plt.figure(figsize=(12,4))
avg_x = (dfPeriod2[dfPeriod2['team_name']=='Germany'].groupby('min').apply(np.mean)['x'] -
dfPeriod2[dfPeriod2['team_name']=='Argentina'].groupby('min').apply(np.mean)['x'])
plt.stackplot(list(avg_x.index.values), list([x if x>0 else 0 for x in avg_x]))
plt.stackplot(list(avg_x.index.values), list([x if x<0 else 0 for x in avg_x]))
for i, shot in dfP2Shots.iterrows():
x = shot['min']
y = avg_x.ix[shot['min']]
signal = 1 if shot['team_name']=='Germany' else -1
plt.annotate(s=(shot['type_name']+' ('+shot['team_name'][0]+")"), xy=(x, y), xytext=(x-5,y+30*signal), arrowprops=dict(facecolor='black'))
plt.gca().set_xlabel('minute')
plt.title("Second Half Profile")
dfPeriod2.groupby('team_name').agg({'x': np.mean, 'on_offense': np.mean})
dfPeriod2[dfPeriod2['type']==1].groupby('team_name').agg({'outcome': np.mean})
draw_pitch()
draw_events(df[(df['to_box']==True) & (df['type']==1) & (df['from_box']==False) & (df['period']==2) & (df['outcome']==1)], mirror_away=True)
draw_events(df[(df['to_box']==True) & (df['type']==1) & (df['from_box']==False) & (df['period']==2) & (df['outcome']==0)], mirror_away=True, alpha=0.2)
draw_events(dfP2Shots, mirror_away=True, base_color='#a93e3e')
plt.text(x_size/4, -3, "Germany's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
plt.text(x_size*3/4, -3, "Argentina's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
dfPeriod2[(dfPeriod2['to_box']==True) & (dfPeriod2['from_box']==False) & (dfPeriod2['type']==1)].groupby(['team_name']).agg({'outcome': np.mean, 'count': np.sum})
step5. 加時部分
fig = plt.figure(figsize=(12,4))
avg_x = (dfExtraTime[dfExtraTime['team_name']=='Germany'].groupby('min').apply(np.mean)['x'] -
dfExtraTime[dfExtraTime['team_name']=='Argentina'].groupby('min').apply(np.mean)['x'].reindex(dfExtraTime['min'].unique(), fill_value=0))
plt.stackplot(list(avg_x.index.values), list([x if x>0 else 0 for x in avg_x]))
plt.stackplot(list(avg_x.index.values), list([x if x<0 else 0 for x in avg_x]))
for i, shot in dfETShots.iterrows():
x = shot['min']
y = avg_x.ix[shot['min']]
signal = 1 if shot['team_name']=='Germany' else -1
plt.annotate(s=(shot['type_name']+' ('+shot['team_name'][0]+")"), xy=(x, y), xytext=(x-5,y+20*signal), arrowprops=dict(facecolor='black'))
plt.gca().set_xlabel('minute')
plt.title("Extra Time Profile")
df.groupby(['team_name', 'period']).agg({'count': np.sum, 'x': np.mean, 'on_offense': np.mean})
我們發(fā)現(xiàn)德國隊(duì)的第4段和其余階段很不同饿序,德國隊(duì)明顯減少了傳球次數(shù)勉失,他們在試圖控制比賽,把節(jié)奏放慢(有點(diǎn)拖延時間的味道原探?)乱凿。你可以看看在德國隊(duì)的上一記射門之后的數(shù)據(jù),更能體現(xiàn)這一點(diǎn)咽弦。
goal_ix = df[df['type']==16].index[0]
df_after_shot = df.ix[goal_ix+1:]
df_after_shot.groupby(['team_name', 'period']).agg({'count': np.sum, 'x': np.mean, 'on_offense': np.mean})
draw_pitch()
draw_events(df_after_shot[(df_after_shot['to_box']==True) & (df_after_shot['type']==1) & (df_after_shot['from_box']==False) & (df_after_shot['outcome']==1)], mirror_away=True)
draw_events(df_after_shot[(df_after_shot['to_box']==True) & (df_after_shot['type']==1) & (df_after_shot['from_box']==False) & (df_after_shot['outcome']==0)], mirror_away=True, alpha=0.2)
draw_events(df_after_shot[df_after_shot['type'].isin([13,14,15,16])], mirror_away=True, base_color='#a93e3e')
plt.text(x_size/4, -3, "Germany's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
plt.text(x_size*3/4, -3, "Argentina's defense", color='black', bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='center')
df_after_shot[df_after_shot['type'].isin([13,14,15,16])][['min', 'player_name', 'team_name', 'type_name']]
德國隊(duì)基本不打算繼續(xù)射門了徒蟆,只有一次是試圖把球傳入禁區(qū)的。但是他們的防守策略非常成功型型,以至于阿根廷基本很難進(jìn)入他們的禁區(qū)段审。2記射門全都是禁區(qū)外射門的,而且都出自梅西之腳闹蒜,然而梅西可能到這時候也深感絕望了寺枉。
step6. 射門
goal = int(df[df['type']==16].index[0])
dfGoal = df.ix[goal-30:goal]
#goal = np.where(df.type==16)[0][0]
#dfGoal = df.iloc[goal-30:goal+1]
draw_pitch()
draw_events(dfGoal[dfGoal.team_name=='Germany'], base_color='white')
draw_events(dfGoal[dfGoal.team_name=='Argentina'], base_color='cyan')
#Germany's players involved in the play
dfGoal['progression']=dfGoal['to_x']-dfGoal['x']
dfGoal[dfGoal['type'].isin([1, 101, 16])][['player_name', 'type_name', 'progression']]
step7. 一些基礎(chǔ)數(shù)據(jù)
#passing accuracy
df.groupby(['player_name', 'team_name']).agg({'count': np.sum, 'outcome': np.mean}).sort('count', ascending=False)
#shots
pd.pivot_table(df[df['type'].isin([13,14,15,16])],
values='count',
aggfunc=sum,
index=['player_name', 'team_name'],
columns='type_name',
fill_value=0,
margins=True).sort('All', ascending=False)
#defensive play
pd.pivot_table(df[df['type'].isin([7, 8, 49])],
values='count',
aggfunc=np.sum,
index=['player_name', 'team_name'],
columns='type_name',
fill_value=0,
margins=True).sort('All', ascending=False)