概述
本文數(shù)據(jù)來源kaggle的House Prices: Advanced Regression Techniques大賽缀蹄。
在做的過程中螺戳,瀏覽了好多出色的報(bào)告醋旦,受益匪淺,瀏覽的文章主要包括:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
from scipy.stats import skew
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# import warnings
# warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
查看數(shù)據(jù)
我們拿到數(shù)據(jù)后咒劲,先對數(shù)據(jù)要有個(gè)大致的了解顷蟆,我們有1460的訓(xùn)練數(shù)據(jù)和1460的測試數(shù)據(jù),數(shù)據(jù)的特征列有81個(gè)腐魂,其中35個(gè)是數(shù)值類型的帐偎,44個(gè)類別類型。
我們通過閱讀數(shù)據(jù)的描述說明蛔屹,會(huì)發(fā)現(xiàn)列MSSubClass
,OverallQual
,OverallCond
這些數(shù)據(jù)可以將其轉(zhuǎn)換為類別類型.
但是去具體看OverallQual
,OverallCond
的時(shí)候削樊,其沒有缺失列,可以當(dāng)做int來處理
all_df = pd.concat((train_df.loc[:,'MSSubClass':'SaleCondition'], test_df.loc[:,'MSSubClass':'SaleCondition']), axis=0,ignore_index=True)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
quantitative = [f for f in all_df.columns if all_df.dtypes[f] != 'object']
qualitative = [f for f in all_df.columns if all_df.dtypes[f] == 'object']
print("quantitative: {}, qualitative: {}" .format (len(quantitative),len(qualitative)))
quantitative: 35, qualitative: 44
處理缺失數(shù)據(jù)
對于缺失值的處理
- 缺失的行特別對兔毒,棄用該列
- 缺失的值比較少漫贞,取均值
- 缺失的值中間,對于類別信息的列可以將缺失作為新的類別做 one-hot
missing = all_df.isnull().sum()
missing.sort_values(inplace=True,ascending=False)
missing = missing[missing > 0]
types = all_df[missing.index].dtypes
percent = (all_df[missing.index].isnull().sum()/all_df[missing.index].isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing, percent,types], axis=1, keys=['Total', 'Percent','Types'])
missing_data.sort_values('Total',ascending=False,inplace=True)
missing_data
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x112096c88>
上述缺失的列中有6列大于了15%的缺失率育叁,其余主要是 BsmtX 和 GarageX 兩大類迅脐,我們在具體決定這些列的處理之前,我們來看下我們要預(yù)測的價(jià)格的一些特征
數(shù)據(jù)統(tǒng)計(jì)分析
單變量分析
先看下我們要預(yù)測的價(jià)格的一些統(tǒng)計(jì)信息
train_df.describe()['SalePrice']
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
#skewness and kurtosis
print("Skewness: %f" % train_df['SalePrice'].skew())
print("Kurtosis: %f" % train_df['SalePrice'].kurt())
# 在統(tǒng)計(jì)學(xué)中豪嗽,峰度(Kurtosis)衡量實(shí)數(shù)隨機(jī)變量概率分布的峰態(tài)仪际。峰度高就意味著方差增大是由低頻度的大于或小于平均值的極端差值引起的。
Skewness: 1.882876
Kurtosis: 6.536282
相關(guān)性
我們先通過計(jì)算變量相關(guān)性昵骤,大致看下最相關(guān)的列都有什么
corrmat = train_df.corr()
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
## 同時(shí)是相關(guān)性列树碱,也是缺失數(shù)據(jù)的
missing_data.index.intersection(cols)
Index(['GarageCars', 'GarageArea', 'TotalBsmtSF'], dtype='object')
missing_data.loc[missing_data.index.intersection(cols)]
從上面最相關(guān)的圖中,我們可以首先將缺失的數(shù)據(jù)都給刪除的
#dealing with missing data
all_df = all_df.drop((missing_data[missing_data['Total'] > 1]).index,1)
# df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
all_df.isnull().sum().max() #just checking that there's no missing data missing...
# 對于missing 1 的我們到時(shí)候已平均數(shù)填充
正態(tài)概率圖 (normal probability plot)
#histogram and normal probability plot
sns.distplot(train_df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], plot=plt)
一個(gè)好的處理方法就是進(jìn)行l(wèi)og
train_df['SalePrice'] = np.log(train_df['SalePrice'])
#histogram and normal probability plot
sns.distplot(train_df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(train_df['SalePrice'], plot=plt)
看下每個(gè)定量變量的分布圖
quantitative = [f for f in all_df.columns if all_df.dtypes[f] != 'object']
qualitative = [f for f in all_df.columns if all_df.dtypes[f] == 'object']
print("quantitative: {}, qualitative: {}" .format (len(quantitative),len(qualitative)))
quantitative: 30, qualitative: 26
f = pd.melt(all_df, value_vars=quantitative)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
上面有些數(shù)據(jù)是類似于正態(tài)分布的变秦,我們可以對其進(jìn)行l(wèi)og操作了提升質(zhì)量的成榜,有些則不適合,合適的預(yù)選對象有LotArea,BsmtUnfSF,1stFlrSF,TotalBsmtSF,KitchenAbvGr
我們計(jì)算下我們定量數(shù)據(jù)的偏度
all_df[quantitative].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
MiscVal 21.947195
PoolArea 16.898328
LotArea 12.822431
LowQualFinSF 12.088761
3SsnPorch 11.376065
KitchenAbvGr 4.302254
BsmtFinSF2 4.145323
EnclosedPorch 4.003891
ScreenPorch 3.946694
OpenPorchSF 2.535114
WoodDeckSF 1.842433
1stFlrSF 1.469604
BsmtFinSF1 1.424989
GrLivArea 1.269358
TotalBsmtSF 1.162285
BsmtUnfSF 0.919351
2ndFlrSF 0.861675
TotRmsAbvGrd 0.758367
Fireplaces 0.733495
HalfBath 0.694566
OverallCond 0.570312
BedroomAbvGr 0.326324
GarageArea 0.241176
OverallQual 0.197110
MoSold 0.195884
FullBath 0.167606
YrSold 0.132399
GarageCars -0.218260
YearRemodAdd -0.451020
YearBuilt -0.599806
dtype: float64
定量特征分析
方差分析或變方分析(Analysis of variance蹦玫,簡稱 ANOVA)為數(shù)據(jù)分析中常見的統(tǒng)計(jì)模型
train = all_df.loc[train_df.index]
train['SalePrice'] = train_df.SalePrice
def anova(frame):
anv = pd.DataFrame()
anv['feature'] = qualitative
pvals = []
for c in qualitative:
samples = []
for cls in frame[c].unique():
s = frame[frame[c] == cls]['SalePrice'].values
samples.append(s)
pval = stats.f_oneway(*samples)[1]
pvals.append(pval)
anv['pval'] = pvals
return anv.sort_values('pval')
a = anova(train)
a['disparity'] = np.log(1./a['pval'].values)
sns.barplot(data=a, x='feature', y='disparity')
x=plt.xticks(rotation=90)
/Users/zhuanxu/anaconda/envs/linear_regression_demo/lib/python3.6/site-packages/scipy/stats/stats.py:2958: RuntimeWarning: invalid value encountered in double_scalars
ssbn += _square_of_sums(a - offset) / float(len(a))
此處 stats.f_oneway 的作用是計(jì)算這種定性變量對于SalePrice的作用赎婚,如果GarageType的每個(gè)類別SalePrice的價(jià)格方差差不多,意味著該變量對于SalePrice就沒什么作用樱溉,stats.f_oneway 返回的 pval > 0.05挣输,基本就意味著量集合的相似,具體可以看
- Statistical analysis made easy in Python with SciPy and pandas DataFrames
- Four ways to conduct one-way ANOVAs with Python
- Python for Data Analysis Part 26: Analysis of Variance (ANOVA)
下面對這些定性變量進(jìn)行下處理福贞,對齊進(jìn)行數(shù)值編碼撩嚼,讓他轉(zhuǎn)換為定性的列
def encode(frame, feature):
ordering = pd.DataFrame()
ordering['val'] = frame[feature].unique()
ordering.index = ordering.val
ordering['spmean'] = frame[[feature, 'SalePrice']].groupby(feature).mean()['SalePrice']
ordering = ordering.sort_values('spmean')
ordering['ordering'] = range(1, ordering.shape[0]+1)
ordering = ordering['ordering'].to_dict()
for cat, o in ordering.items():
frame.loc[frame[feature] == cat, feature+'_E'] = o
qual_encoded = []
for q in qualitative:
encode(train, q)
qual_encoded.append(q+'_E')
print(qual_encoded)
['MSSubClass_E', 'Street_E', 'LotShape_E', 'LandContour_E', 'LotConfig_E', 'LandSlope_E', 'Neighborhood_E', 'Condition1_E', 'Condition2_E', 'BldgType_E', 'HouseStyle_E', 'RoofStyle_E', 'RoofMatl_E', 'Exterior1st_E', 'Exterior2nd_E', 'ExterQual_E', 'ExterCond_E', 'Foundation_E', 'Heating_E', 'HeatingQC_E', 'CentralAir_E', 'Electrical_E', 'KitchenQual_E', 'PavedDrive_E', 'SaleType_E', 'SaleCondition_E']
# 選出了包含缺失數(shù)據(jù)的行,處理一下
missing_data = all_df.isnull().sum()
missing_data = missing_data[missing_data>0]
ids = all_df[missing_data.index].isnull()
# index (0), columns (1)
all_df.loc[ids[ids.any(axis=1)].index][missing_data.index]
# 處理完后對于nan的數(shù)據(jù),其值還是nan
train.loc[1379,'Electrical_E']
nan
相關(guān)性計(jì)算
def spearman(frame, features):
spr = pd.DataFrame()
spr['feature'] = features
#Signature: a.corr(other, method='pearson', min_periods=None)
#Docstring:
#Compute correlation with `other` Series, excluding missing values
# 計(jì)算特征和 SalePrice的 斯皮爾曼 相關(guān)系數(shù)
spr['spearman'] = [frame[f].corr(frame['SalePrice'], 'spearman') for f in features]
spr = spr.sort_values('spearman')
plt.figure(figsize=(6, 0.25*len(features))) # width, height
sns.barplot(data=spr, y='feature', x='spearman', orient='h')
features = quantitative + qual_encoded
spearman(train, features)
從上圖我們可以看到特征 OverallQual Neighborhood GrLiveArea 對價(jià)格影響都比較大
下面我們分析下特征列之間的相關(guān)性完丽,如果兩特征相關(guān)恋技,在做回歸的時(shí)候會(huì)導(dǎo)致共線性問題
plt.figure(1)
corr = train[quantitative+['SalePrice']].corr()
sns.heatmap(corr)
plt.figure(2)
corr = train[qual_encoded+['SalePrice']].corr()
sns.heatmap(corr)
plt.figure(3)
# [31,27]
corr = pd.DataFrame(np.zeros([len(quantitative)+1, len(qual_encoded)+1]), index=quantitative+['SalePrice'], columns=qual_encoded+['SalePrice'])
for q1 in quantitative+['SalePrice']:
for q2 in qual_encoded+['SalePrice']:
corr.loc[q1, q2] = train[q1].corr(train[q2])
sns.heatmap(corr)
<matplotlib.axes._subplots.AxesSubplot at 0x1172cb860>
Pairplots
def pairplot(x, y, **kwargs):
ax = plt.gca()
ts = pd.DataFrame({'time': x, 'val': y})
ts = ts.groupby('time').mean()
ts.plot(ax=ax)
plt.xticks(rotation=90)
f = pd.melt(train, id_vars=['SalePrice'], value_vars=quantitative+qual_encoded)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(pairplot, "value", "SalePrice")
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
從上面的數(shù)據(jù)我們能清晰的看到哪些變量是線性關(guān)系比較好的,哪些是非線性關(guān)系逻族,還有一些能看到如果加二次項(xiàng)可能會(huì)表現(xiàn)出比較的線性相關(guān)性出來
價(jià)格分段
我們對于價(jià)格簡單的做一個(gè)二分蜻底,然后看下特征的不同,我們先看下SalePrice的圖
a = train['SalePrice']
a.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11ed529b0>
features = quantitative
standard = train[train['SalePrice'] < np.log(200000)]
pricey = train[train['SalePrice'] >= np.log(200000)]
diff = pd.DataFrame()
diff['feature'] = features
diff['difference'] = [(pricey[f].fillna(0.).mean() - standard[f].fillna(0.).mean())/(standard[f].fillna(0.).mean())
for f in features]
sns.barplot(data=diff, x='feature', y='difference')
x=plt.xticks(rotation=90)
![Uploading output_52_0_342062.png . . .]
上圖可以看到貴的房子聘鳞,泳池會(huì)影響比較大
分類
我們先對數(shù)據(jù)做一個(gè)簡單的分類
features = quantitative + qual_encoded
model = TSNE(n_components=2, random_state=0, perplexity=50)
X = train[features].fillna(0.).values
tsne = model.fit_transform(X)
std = StandardScaler()
s = std.fit_transform(X)
pca = PCA(n_components=30)
pca.fit(s)
pc = pca.transform(s)
kmeans = KMeans(n_clusters=5)
kmeans.fit(pc)
fr = pd.DataFrame({'tsne1': tsne[:,0], 'tsne2': tsne[:, 1], 'cluster': kmeans.labels_})
sns.lmplot(data=fr, x='tsne1', y='tsne2', hue='cluster', fit_reg=False)
print(np.sum(pca.explained_variance_ratio_))
0.838557886152
30個(gè)成分能覆蓋83%的方差薄辅,整體看來,這種聚類方法不太好
總結(jié)
本文對數(shù)據(jù)進(jìn)行了一些分析抠璃,下一篇會(huì)基于這個(gè)分析做模型處理