討論的問題:綜合考慮勝率與薪資赃承,OKA球隊(duì)相比其他的球隊(duì)是否有競爭性優(yōu)勢妙黍。
數(shù)據(jù)來源:http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip
1、從網(wǎng)絡(luò)上下載需要的CSV文檔瞧剖,這里采用request,stringIO,zipfile進(jìn)行數(shù)據(jù)提仁眉蕖:
def getZIP(zipFileName):
#以字節(jié)的方式請(qǐng)求
r = requests.get(zipFileName).content
#創(chuàng)建內(nèi)存文件
s = StringIO.StringIO(r)
zf = zipfile.ZipFile(s,'r')
return zf
url = 'http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip'
zf = getZIP(url)
數(shù)據(jù)展示如下:
['SchoolsPlayers.csv', 'SeriesPost.csv', 'Teams.csv', 'TeamsFranchises.csv', 'TeamsHalf.csv', 'AllstarFull.csv', 'Appearances.csv', 'AwardsManagers.csv', 'AwardsPlayers.csv', 'AwardsShareManagers.csv', 'AwardsSharePlayers.csv', 'Batting.csv', 'BattingPost.csv', 'Fielding.csv', 'FieldingOF.csv', 'FieldingPost.csv', 'HallOfFame.csv', 'Managers.csv', 'ManagersHalf.csv', 'Master.csv', 'Pitching.csv', 'PitchingPost.csv', 'readme2013.txt', 'Salaries.csv', 'Schools.csv']
這里把需要的salaries和teams這兩個(gè)CSV文件讀取出來:
salaries = pd.read_csv(zf.open(tablenames[tablenames.index('Salaries.csv')]))
print salaries.head()
teams = pd.read_csv(zf.open(tablenames[tablenames.index('Teams.csv')]))
#這里只需要這幾列
teams = teams[['yearID', 'teamID', 'W']]
print teams.head()
yearID teamID lgID playerID salary
0 1985 BAL AL murraed02 1472819
1 1985 BAL AL lynnfr01 1090000
2 1985 BAL AL ripkeca01 800000
3 1985 BAL AL lacyle01 725000
4 1985 BAL AL flanami01 641667
yearID teamID W
0 1871 PH1 21
1 1871 CH1 19
2 1871 BS1 20
3 1871 WS3 15
4 1871 NY2 16
接下來計(jì)算各個(gè)隊(duì)每年的總工資,并把兩個(gè)列表合并起來抓于,W代表勝場:
#一般情況下做粤,聚合數(shù)據(jù)都需要唯一的分組鍵組成的索引,但也可以通過向groupby傳入as_index=False以禁用該功能
totleSalaries = salaries.groupby(['yearID','teamID'],as_index=False).sum()
print totleSalaries.head()
#how="inner"指當(dāng)左右兩個(gè)對(duì)象存在不重合的鍵時(shí)捉撮,inner 代表交集怕品;outer 代表并集;on指的是用于連接的列索引名稱,如果沒有指定且其他參數(shù)也未指定則以兩個(gè)DataFrame的列名交集做為連接鍵
joined = pd.merge(totleSalaries, teams, how="inner", on=['yearID', 'teamID'])
print joined.head()
yearID teamID salary
0 1985 ATL 14807000
1 1985 BAL 11560712
2 1985 BOS 10897560
3 1985 CAL 14427894
4 1985 CHA 9846178
yearID teamID salary W
0 1985 ATL 14807000 66
1 1985 BAL 11560712 83
2 1985 BOS 10897560 81
3 1985 CAL 14427894 90
4 1985 CHA 9846178 85
接下來畫出各個(gè)球隊(duì)每年總的薪水和獲勝次數(shù)的關(guān)系圖,并標(biāo)記處OKA這只球隊(duì):
teamName ='OAK'
years = np.arange(2000,2004)
for year in years:
df = joined[joined['yearID'] == year]
print df
#畫出薪資和勝場的散點(diǎn)圖
plt.scatter(df['salary'] / 1e6,df['W'])
plt.title(str(year)+'年'+'勝場與薪資')
plt.xlabel('總薪水(百萬)')
plt.ylabel('勝場')
plt.xlim(0, 180)
plt.ylim(30, 130)
plt.grid()
#標(biāo)記出OKA球隊(duì)
plt.annotate(teamName,
xy=(df['salary'][df['teamID'] == teamName] / 1e6, df['W'][df['teamID'] == teamName]),
xytext=(-20, 20), textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops=dict(arrowstyle='->', facecolor='black', connectionstyle='arc3,rad=0'))
plt.show()
可以看出OKA的在2000年到2004年間付出總薪水較少的情況下獲得了比較好的勝利場數(shù)巾遭,接下來用回歸分析證明這一點(diǎn)肉康,并看看更長時(shí)間內(nèi)的數(shù)據(jù)怎么樣,算出各支球隊(duì)的殘差灼舍,就能知道是否如上述推論:
teamName = 'OAK'
years = np.arange(1999, 2014)
def Residual(year):
residData = pd.DataFrame()
df = joined[joined['yearID'] == year]
#原始數(shù)據(jù)橫坐標(biāo)
x_list = df['salary'].values / 1e6
#縱坐標(biāo)
y_list = df['W'].values
#最小二乘估計(jì)
A = np.array([x_list, np.ones(len(x_list))])#構(gòu)造系數(shù)矩陣
y = y_list
w = np.linalg.lstsq(A.T,y)[0] #求出斜率以及縱截距吼和,w[0]斜率w[1]縱截距
yhat = (w[0]*x_list+w[1]) # 回歸線
residData['teamID'] = df['teamID']
residData[year] = y - yhat
residData.index = residData['teamID']
residData = residData.drop(residData.columns[0], axis=1)
#print residData
return residData
#將dataframe放入數(shù)組
Residuals = [Residual(year) for year in years]
#按照隊(duì)名合并
Residual_df = reduce(lambda left,right:pd.merge(left,right,how='outer',left_index=True, right_index=True),Residuals)
print Residual_df
Residual_df = Residual_df.T
Residual_df.plot(title = '各支球隊(duì)的殘差圖', figsize = (15, 8),
color=map(lambda x: 'blue' if x==teamName else 'gray',Residual_df.columns))
plt.xlabel('年')
plt.ylabel('殘差')
plt.show()
這里主要在于如何將多個(gè)將dataframe拆分成多個(gè)小的dataframe并重新按照不重合的主鍵名合并。
如圖可以看出片仿,在2000年到2003年間纹安,OKA球隊(duì)偏移回歸線較遠(yuǎn),且殘差為正,說明其能在付出較少薪水的情況下獲得較好的成績厢岂,特別是在2002與2003年光督,偏移最遠(yuǎn),此時(shí)球隊(duì)的性價(jià)比在聯(lián)盟中應(yīng)該是最高的塔粒。但在2004年后结借,殘差往負(fù)的方向走,并持續(xù)多年卒茬,說明此時(shí)球隊(duì)成績不太好船老,但在2010年后有復(fù)蘇的趨勢。