日暮途遠,人間何世
將軍一去昔逗,大樹飄零
概述
之前學習了加州房價預(yù)測模型降传,便摩拳擦掌,從kaggle上找到一份帝都房價數(shù)據(jù)勾怒,練練手婆排。
實驗流程
實驗數(shù)據(jù)
從 Kaggle 中選擇了帝都北京住房價格的數(shù)據(jù)集,該數(shù)據(jù)集摘錄了2011~2017年鏈家網(wǎng)上的北京房價數(shù)據(jù)笔链。
下載并預(yù)覽數(shù)據(jù)
下載并解壓數(shù)據(jù)
預(yù)覽數(shù)據(jù)
每一行代表一間房,每個房子有26個相關(guān)屬性赞枕,其中以下幾個需要備注:
DOM: 市場活躍天數(shù)
followers: 關(guān)注人數(shù)
totalPrice: 房屋總價格
price: 每平米價格
floor: 樓層數(shù)鹦赎,中文數(shù)據(jù),處理時需要注意
buildingType: 房屋類型,包含塔樓陪踩、平房肩狂、復(fù)式和樣板房
renovationCondition: 裝修情況姥饰,包括其他列粪、毛坯、簡裝和精裝
buildingStructure: 建筑結(jié)構(gòu)态蒂,包含未知钾恢、混合瘩蚪、磚木稿黍、磚混闻察、鋼和鋼混結(jié)構(gòu)
ladderRatio: 人均樓梯數(shù)
fiveYearsProperty: 產(chǎn)權(quán)
district:區(qū)域辕漂,離散型
讀取并初步分析數(shù)據(jù)
-
讀取數(shù)據(jù)
讀取數(shù)據(jù)報錯钉嘹,懷疑是編碼問題,檢查文件編碼file new.csv new.csv: ISO-8859 text, with CRLF line terminators
文件編碼是ISO-8859格式缨睡,因而將其另存為UTF-8格式,之后讀取數(shù)據(jù)成功
-
查看數(shù)據(jù)結(jié)構(gòu)和描述
可見與加州不同细诸,這里存在大量非數(shù)值型數(shù)據(jù)陋守。一共有318851個實例水评,其中DOM、bulidingType寇甸、elevator拿霉、fiveYearsProperty友浸、subway偏窝、communityAverage存在缺失祭往。其中DOM缺失過多,可以考慮刪除此屬性驮肉。其中url离钝、id卵渴、Cid是不對房價構(gòu)成影響的因素鲤竹,可以直接不予考慮。我的目標預(yù)測結(jié)果是房屋總價格互订,因此每平米均價可以刪去仰禽。 -
查看數(shù)據(jù)基本情況
查看數(shù)據(jù)頻數(shù)直方分布情況
發(fā)現(xiàn)這組數(shù)據(jù)存在大量離散情況坟瓢,連續(xù)型屬性為:DOM、Lat粒褒、Lng奕坟、communityAverage、followers刃跛、square桨昙。import pandas as pd import matplotlib.pyplot as plt def load_housing_data(file_path): return pd.read_csv(file_path, sep=',', low_memory=False) def check_attributes(housing): attributes = list(housing) for attr in attributes: print(housing[attr].value_counts()) if __name__ == '__main__': housing = load_housing_data('new.csv') housing = housing.drop(['url','id','price'], axis=1) check_attributes(housing) housing.describe() housing.hist(bins=50, figsize=(20,15)) plt.savefig('housing_distribution.png')
創(chuàng)建測試集
選取數(shù)據(jù)集的20%作為測試集蛙酪,由于存在district屬性桂塞,剛好可以以其作為分層抽樣的依據(jù),劃分好測試集之后馍驯,檢查測試集分布是否與原始數(shù)據(jù)一致
#split the train and test set
spliter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in spliter.split(housing, housing['district']):
train_set = housing.loc[train_index]
test_set = housing.loc[test_index]
test_set.hist(bins=50, figsize=(20,15))
plt.savefig('test.png')
數(shù)據(jù)探索和可視化
首先將測試集放在一邊汰瘫,對訓練集進行數(shù)據(jù)探索吟吝。
-
將地理數(shù)據(jù)可視化
改變alpha參數(shù),觀察實例分布密度
不得不說官辽,帝都房價就是厲害同仆,每個地區(qū)房屋成交量都很巨大俗批。 -
將區(qū)域市怎、房價信息可視化
圖中每個圓的半徑代表價格区匠,顏色代表各區(qū)域驰弄,基本了解數(shù)據(jù)中房源的區(qū)域集中情況。
發(fā)現(xiàn)帝都房價個地區(qū)基本持平五鲫,都集中在2500w之下位喂,也鮮有出奇高的房子#explore the data housing = train_set.copy() housing.plot(kind='scatter', x='Lat', y='Lng') plt.savefig('gregrophy.png') housing.plot(kind='scatter', x='Lat', y='Lng', alpha=0.1) plt.savefig('gregrophy_more.png') fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \ s=housing['totalPrice']/100, label='Price', \ c=housing['district'], cmap=plt.get_cmap('jet')) plt.colorbar(fig) plt.legend() plt.savefig('gregrophy_district_value.png') fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \ c=housing['totalPrice'], cmap=plt.get_cmap('jet')) plt.colorbar(fig) plt.savefig('gregrophy_price_value.png')
-
繪制價格隨時間變化圖
帝都房價10年開始狂飆突進忆某,18年倒有下降趨勢
自02年~18年帝都房價統(tǒng)計如圖弃舒,離群點不算太多聋呢,盒子被壓縮的比較小颠区,說明每個月房內(nèi)的房子出售價格維持在差異很小的范圍內(nèi)(500w左右)price_by_trade_time = pd.DataFrame() price_by_trade_time['totalPrice'] = housing['totalPrice'] price_by_trade_time.index = housing['tradeTime'].astype('datetime64[ns]') price_by_trade_month = price_by_trade_time.resample('M').mean().to_period('M').fillna(0) price_by_trade_month.plot(kind='line') price_stat_trade_month_index = [x.strftime('%Y-%m') for x in set(price_by_trade_time.to_period('M').index)] price_stat_trade_month_index.sort() price_stat_trade_month = [] for month in price_stat_trade_month_index: price_stat_trade_month.append(price_by_trade_time[month]['totalPrice'].values) price_stat_trade_month = pd.DataFrame(price_stat_trade_month) price_stat_trade_month.index = price_stat_trade_month_index price_stat_trade_month = price_stat_trade_month.T price_stat_trade_month.boxplot(figsize=(15,10)) plt.xticks(rotation=90,fontsize=7) plt.savefig('price_stat_trade_time.png')
-
探索房子建筑年限與房價的關(guān)系
查看房子建筑年限數(shù)據(jù)概況未知 15475 0 14 1 12
發(fā)現(xiàn)存在噪聲器贩,選擇刪除,之后繪制均價-房齡折線圖
百年老房吧黄,就是不同凡響唆姐!
發(fā)現(xiàn)百年老房只是個例奉芦,房齡集中在0~65年附近,放大圖像進行細微觀察
大部分房產(chǎn)還是500w附近的烦却,但是半世紀的老房子居然賣得和新房一樣短绸,實在難以理解,但是不像流言中北京房價都是千萬級的窄驹,留在北京有希望了!?菇铩瑞眼!#price and constraction correlations price_by_cons_time = pd.DataFrame() price_by_cons_time['totalPrice'] = housing['totalPrice'] price_by_cons_time['constructionTime'] = housing['constructionTime'] price_by_cons_time = price_by_cons_time[ (price_by_cons_time.constructionTime != '0') & (price_by_cons_time.constructionTime != '1') & (price_by_cons_time.constructionTime != '未知') ] price_by_cons_time['constructionTime'] = price_by_cons_time['constructionTime'].astype('int64') price_by_cons_time['constructionTime'] = 2018 - price_by_cons_time['constructionTime'] price_by_cons_time_index = list(set(price_by_cons_time['constructionTime'])) price_by_cons_time_index.sort() price_by_cons_time.index = price_by_cons_time['constructionTime'] price_by_cons_time = price_by_cons_time.drop('constructionTime', axis=1) price_by_cons_time_line = [] price_by_cons_time_stat = [] for years in price_by_cons_time_index: price_by_cons_time_line.append(price_by_cons_time.loc[years]['totalPrice'].mean()) try: price_by_cons_time_stat.append(price_by_cons_time.loc[years]['totalPrice'].values) except Exception: price_by_cons_time_stat.append(np.array([price_by_cons_time.loc[years]['totalPrice']])) plt.plot(list(price_by_cons_time_index), price_by_cons_time_line) plt.savefig('price_cons_line.png') price_by_cons_time_stat = pd.DataFrame(price_by_cons_time_stat) price_by_cons_time_stat.index = price_by_cons_time_index price_by_cons_time_stat = price_by_cons_time_stat.T price_by_cons_time_stat.boxplot(figsize=(20,15)) plt.ylim(0,2500) plt.savefig('price_stat_cons_time.png')
-
探索房價與面積關(guān)系
可見1000平以上的豪宅價格飆升伤疙,600~900平又是一個上升區(qū)間徒像,0~400平應(yīng)該屬于剛需部分蛙讥,400~600平價格基本穩(wěn)定,但有可能是樣本數(shù)量問題旁涤,因此我決定再看看整體情況
發(fā)現(xiàn)面積很集中,縮小區(qū)間再觀察一下
北京樓市交易成功的房產(chǎn)大多是100平及以下的房子
看一下面積與價格的情況
發(fā)現(xiàn)基本是面積越大瞳遍,價格越高
放大坐標進行觀察
#square and price price_by_square = pd.DataFrame() price_by_square['totalPrice'] = housing['totalPrice'] price_by_square['square'] = housing['square'] price_by_square['square'] = np.ceil(price_by_square['square']) price_by_square['square'] = price_by_square['square'] - (price_by_square['square'] % 10) price_by_square_index = list(set(price_by_square['square'])) price_by_square_index.sort() price_by_square.index = price_by_square['square'] price_by_square_line = [] price_by_square_stat = [] for squares in price_by_square_index: #price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean()) try: price_by_square_stat.append(price_by_square.loc[squares]['totalPrice'].values) except Exception: price_by_square_stat.append(np.array([price_by_square.loc[squares]['totalPrice']])) plt.plot(price_by_square_index, price_by_square_line) plt.savefig('price_square_mean.png') price_by_square['square'].hist(bins=50, figsize=(20,15)) plt.savefig('price_square.png') price_by_square_stat = pd.DataFrame(price_by_square_stat).T price_by_square_index = [int(x) for x in price_by_square_index] price_by_square_stat.columns = price_by_square_index price_by_square_stat.boxplot(figsize=(20,15)) plt.xticks(rotation=90) plt.ylim(0,5000) plt.savefig('price_stat_square_time.png')
-
探索時間傅蹂、面積與房價的關(guān)系
市面上交易的北京房產(chǎn)大多集中在0~2500w左右份蝴,0~500平之間
放大坐標
再度放大坐標
發(fā)現(xiàn)17年價格一騎絕塵婚夫,11年則似乎是北京最佳購房時機#price and time,square correlations price = pd.DataFrame() price['totalPrice'] = housing['totalPrice'] price['square'] = housing['square'] price.index = housing['tradeTime'].astype('datetime64[ns]') price['square'] = np.ceil(price['square']) price['square'] = price['square'] - (price['square'] % 10) price = price.to_period('Y') price_time_index = [x.strftime('%Y') for x in set(price.index)] price_time_index.sort() colormap = mpl.cm.Dark2.colors m_styles = ['','.','o','^','*'] for year, (maker, color) in zip(price_time_index, itertools.product(m_styles, colormap)): y, x = get_mean(price.loc[year]) plt.plot(x, y, color=color, marker=maker, label=year) plt.xticks(rotation=90) plt.xlim(0,750) plt.ylim(0,5000) plt.legend(price_time_index) plt.savefig('price_by_time_square.png') def get_mean(price_by_square): try: price_by_square_index = list(set(price_by_square['square'])) price_by_square_index.sort() price_by_square_line = [] price_by_square.index = price_by_square['square'] for squares in price_by_square_index: price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean()) price_by_square_index = [int(x) for x in price_by_square_index] except Exception: price_by_square_line = [price_by_square.loc['totalPrice']] price_by_square_index = [int(price_by_square['square'])] return price_by_square_line, price_by_square_index
- 檢查是否存在臟數(shù)據(jù)
livingRoom存在#NAME案糙?考慮刪除
drawingRoom存在中文时捌、數(shù)值混雜奢讨,混雜的中文也不多焰薄,考慮刪除
bathRoom存在明顯錯誤,考慮刪除錯誤記錄
floor屬性很混亂亩码,需要特別處理
buildingType也存在錯誤
經(jīng)檢查描沟,需要處理的屬性如下:
constructionTime
buildingType
floor
bathRoom
drawingRoom
livingRoom
連續(xù)型屬性是:
communityAverage
ladderRatio
constructionTime
square
followers
Lat
Lng
離散型屬性是:
district
subway
fiveYearsProperty
elevator
buildingStructure
renovationCondition
buildingType
floor
bathRoom
kitchen
drawingRoom
livingRoom
斜體離散型是0啊掏,1二元值衰猛,不需要獨熱編碼,tradeTime并非房產(chǎn)屬性娜睛,刪除
數(shù)據(jù)準備
-
清洗數(shù)據(jù)
- 數(shù)據(jù)存在太多臟記錄,從頭開始清理
- 移除不需要的屬性
- 將constructionTime轉(zhuǎn)換為連續(xù)性屬性房齡(用2018作為基準)
- 清除buildingType中的臟記錄
- 清除livingRoom方库、drawingRoom、bathRoom中的臟記錄纵潦,并將其轉(zhuǎn)化為數(shù)值型
- floor屬性太過復(fù)雜邀层,我決定刪除
class DataNumCleaner(BaseEstimator, TransformerMixin):
def init(self, clean=True):
self.clean = clean
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.clean:
X = X[(X.constructionTime != '0') & (X.constructionTime != '1') & (X.constructionTime != '未知')]
X['constructionTime'] = 2018 - X['constructionTime'].astype('int64')
X = X[(X.buildingType == 1) | (X.buildingType == 2) | (X.buildingType == 3) | (X.buildingType == 4)]
X = X[X.livingRoom != '#NAME?']
X = X[(X.drawingRoom == '0') | (X.drawingRoom == '1') | (X.drawingRoom == '2') | (X.drawingRoom == '3') | (X.drawingRoom == '4') | (X.drawingRoom == '5')]
X = X[(X.bathRoom == '0') | (X.bathRoom == '1') | (X.bathRoom == '2') | (X.bathRoom == '3') | (X.bathRoom == '4') | (X.bathRoom == '5') | (X.bathRoom == '6') | (X.bathRoom == '7')]
X.bathRoom = X.bathRoom.astype('float64')
X.drawingRoom = X.drawingRoom.astype('float64')
X.livingRoom = X.livingRoom.astype('float64')
return X
else:
return X
```
-
清洗結(jié)果還比較理想
用眾數(shù)填補缺失值
將buildingType寥院、renovationCondition秸谢、buildingStructure、district轉(zhuǎn)換為獨熱編碼
-
建立數(shù)據(jù)清洗流程
num_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(num_attributes)), ('imputer', Imputer(strategy='most_frequent')), ('std_scaler', StandardScaler()) ]) cat_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(cat_attributes)), ('encoder', OneHotEncoder()) ]) label_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(['totalPrice'])) ]) full_pipeline = FeatureUnion([ ('num_pipeline', num_pipeline), ('cat_pipeline', cat_pipeline) ])
模型訓練
-
線性回歸模型
效果理想 -
決策樹
效果也算理想,但是訓練時間過久沫换,考慮減少一些無關(guān)特征苗沧。
查看特征之間相關(guān)性
發(fā)現(xiàn)與價格相關(guān)性最高的還是面積待逞、社區(qū)均價网严,但是我們是要預(yù)測一套房子的價格,因此選取的特征最好是房子本身的屬性怜庸,我考慮刪除followers、communityAverage
減少特征之后的線性回歸模型性能仍可以接受 -
線性SVR
-
調(diào)參
由于我的計算機算力實在不濟,所以只能先使用線性模型進行練手了
得到線性svr的最佳參數(shù)
查看每次的RMSE
結(jié)果可以接受#improve liner_svr model param_grid = [ {'C': [0.5, 1, 2], 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']} ] grid_search = GridSearchCV(lin_svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(housing_prepared,housing_label) grid_search.best_params_ cvres = grid_search.cv_results_ for mean_score, params in zip(cvres['mean_test_score'], cvres['params']): print(np.sqrt(-mean_score), params) #final model final_model = grid_search.best_estimator_
模型驗證
-
利用測試集進行驗證
效果與訓練集差不多宏榕,可以接受 -
從測試集中隨機取100個記錄進行預(yù)測,查看效果
可見預(yù)測結(jié)果幾乎吻合奠支,因此模型可以使用test_index = [randint(0,len(y_test)) for i in range(100)] y_label = [y_test[index] for index in test_index] y_predict = [final_model.predict(X_test_prepared[index]) for index in test_index] x = [i+1 for i in range(100)] plt.plot(x, y_label, c='red', label='label') plt.plot(x, y_predict, c='blue', label='predict') plt.legend() plt.savefig('result.png')
-
導出模型
joblib.dump(final_model,'BeijingHousingPricePredicter.pkl')
總結(jié)
- 北京房價真的高倍谜!
- 北京市場上成功買賣的房產(chǎn)基本都在500w附近叉抡,100平米左右,房齡在0~40年之間您旁。面積更大的房產(chǎn)有價無市
- 北京最佳購房時機在2011年附近
- 2017年附近竟然交易了一套17500w的天價房產(chǎn)鹤盒,不知買賣雙方是何等神仙
- 數(shù)據(jù)清洗很重要侦副,可以自己寫轉(zhuǎn)換器,列入PipeLine
- 有些特征可以憑人為經(jīng)驗刪去秦驯,但是特征工程 很重要R氚!
- 機器學習需要算力較好的計算機ORZ
- 完整代碼