數(shù)據(jù)集：https://www.kaggle.com/ruiqurm/lianjia
本數(shù)據(jù)集包含2010年至2018年1月份鏈家網(wǎng)站上掛牌出售的二手房信息

第一部分：數(shù)據(jù)準(zhǔn)備&清洗工作

1.1 數(shù)據(jù)準(zhǔn)備

所使用到的數(shù)據(jù)包：

library (tidyverse)
library (psych)
library (xts)
library (tseries)
library (forecast)
library (dplyr)
library (ggplot2)
library (VIM)
library (ggmap)
library (xgboost)
library (dygraphs)
library (knitr)
library (Matrix)
library(RCurl)
library(geosphere)
library(viridis)
library(lubridate)

說明一下步咪，這里的數(shù)據(jù)包可能并沒有全部被使用铲咨，有一些是在嘗試過程中使用的掐松，但是并沒有嘗試成功呢袱，然后又忘了及時把沒用到的數(shù)據(jù)包刪掉怖辆，過了幾天之后就忘記是哪個包有用哪個包沒用了……

首先對數(shù)據(jù)集進(jìn)行觀察：

beijing_house = read.csv ('D:/R語言/House Price Model/new.csv')

#firstly, lets see the data set
glimpse (beijing_house)

image.png

可以看出該數(shù)據(jù)集當(dāng)中數(shù)據(jù)量十分巨大笨触，將近32萬條數(shù)據(jù)魔慷，同時一共有26個不同的變量
url ：房子對應(yīng)的鏈家鏈接
id：在鏈家網(wǎng)上的id
Lng：房子所處經(jīng)度
Lat：房子所處緯度
Cid：社區(qū)id
tradeTime：交易時間
DOM：Days on Market孝情，掛牌出售時長
followers：關(guān)注人數(shù)
totalPrice：房產(chǎn)總價
price：每平米價格
square：面積
livingRoom：客廳數(shù)量
drawingRoom：書房數(shù)量
kitchen：廚房數(shù)量
bathRoom：衛(wèi)生間數(shù)量
floor：樓層
buildingType：建筑形式鱼蝉，1 = 塔式，2 = 平房箫荡， 3 = 蝶式魁亦，4 = 板式，具體區(qū)別可以百度
constructionTime：建筑時間
renovationCondition：裝修狀態(tài)羔挡，1 = 其他洁奈，2 = 毛坯， 3 = 簡裝绞灼， 4 = 精裝
buildingStructure：建筑結(jié)構(gòu)利术，1 = 未知， 2 = 混合低矮，3 = 木磚結(jié)構(gòu)印叁， 4 = 磚混凝土結(jié)構(gòu)，5 = 鋼構(gòu)军掂，6 = 鋼筋混凝土結(jié)構(gòu)
ladderRatio：梯戶比
elevator：是否有電梯
fiveyearsproperty：是否已過五年產(chǎn)權(quán)期限
subway：是否有地鐵
communityAverage：社區(qū)均價
district：所屬區(qū)轮蜕，
1 = 東城區(qū),
2 = 豐臺區(qū),
3 = 大興亦莊,
4 = 大興區(qū),
5 =房山區(qū),
6 = 昌平區(qū),
7 = 朝陽區(qū),
8 = 海淀區(qū),
9 = 石景山區(qū),
10 = 順義區(qū),
11 = 通州區(qū),
12 = 順義區(qū),
13 = 門頭溝

第二部分數(shù)據(jù)清洗和填充

2.1 數(shù)據(jù)清洗

image.png

從數(shù)據(jù)當(dāng)中可以看出，url良姆，id肠虽，Cid這幾個數(shù)據(jù)對房價預(yù)測并沒有什么卵用，所以直接去掉

house_price = select (beijing_house, -url, -id, -Cid)

同時floor變量當(dāng)中存在著漢字與數(shù)據(jù)夾雜的情況玛追，對于后面的分析步驟來說比較麻煩税课，因此需要分離和用數(shù)字進(jìn)行分類

floor = substring (house_price$floor, 0, 1)
house_price$floorpos = floor
house_price$floor=sub('^.','',house_price$floor)
house_price = house_price %>% 
  mutate(floorpos = case_when(floorpos == "底" ~ 1,
                              floorpos == "低" ~ 2,
                              floorpos == "中" ~ 3,
                              floorpos == "高" ~ 4,
                              floorpos == "頂" ~ 5,
                              floorpos == "未" ~ 0))

2.2 數(shù)據(jù)填充

首先來康康都有哪些數(shù)據(jù)缺失了的

missing_data = data.frame(lapply(house_price,function(x) sum(is.na(x))))
missing_data

image.png

可以看出DOM這一變量的數(shù)據(jù)大量缺失，然后還有其他幾個變量也有一定量的缺失值痊剖，現(xiàn)在我們來把缺失數(shù)據(jù)的數(shù)量變得更直觀一點

missing_price = select (house_price, DOM, buildingType, elevator, fiveYearsProperty, subway, communityAverage)
aggr (missing_price, prop = TRUE, numbers = TRUE)

MISSINGDATADISTRIBUTION.jpeg

可以看出DOM缺失的頻率占到了一半以上韩玩，這個太大了；其他幾個變量缺失值的占比很小很小陆馁，因此我們可以直接忽略掉這些個數(shù)據(jù)找颓。首先對DOM這個變量進(jìn)行處理,看一下這個變量的分布情況

qqnorm(house_price$DOM)
qqline(house_price$DOM)

NORMALLQQPLOT.jpeg

從QQ散點圖可以看出這一部分變量是具有一定的斜率的，因此可以直接采用中位數(shù)進(jìn)行填充

house_price$DOM<-ifelse(is.na(house_price$DOM),median(house_price$DOM,na.rm=TRUE),house_price$DOM)

剩下的缺失值直接刪掉就好

house_price1 <- na.omit(house_price)

這時候再來看一下清洗之后的數(shù)據(jù)狀態(tài)

dim (house_price1)

image.png

少了兩千多個數(shù)據(jù)叮贩，和30萬條數(shù)據(jù)比起來根本不算什么击狮，所以還ok佛析，數(shù)據(jù)清洗和準(zhǔn)備工作告一段落。

第三部分數(shù)據(jù)可視化

3.1 價格熱力圖

首先整一幅北京市地圖彪蓬，下載地址：https://www.kaggle.com/eraw0x/beijing-map/download/DtXJ8lv7gp5r6Mm3C6Zs%2Fversions%2FISfNUAnnDEEQIZbKxWlj%2Ffiles%2Fbeijing_map.RData?datasetVersionNumber=1
大概長這樣：

image.png

有經(jīng)緯度就很舒服寸莫，在R里加載地圖

load(file = "D:/R語言/NTU/House Price Model/beijing_map.RData",verbose = TRUE)

把價格做成熱力圖并且呈現(xiàn)在地圖上

beijing + geom_point(data =house_price1, aes(house_price1$Lng,house_price1$Lat,color=price),size=1.3,alpha=.5)+ scale_color_viridis()

PRICEMAPPING.jpeg

可以大致看出北京市城市呈現(xiàn)出輻射狀的分布，中心區(qū)域價格最高档冬，然后呈環(huán)狀向外價格逐層遞減膘茎，西北區(qū)域略高于其他區(qū)域

3.2可視化數(shù)據(jù)處理

對要畫圖的數(shù)據(jù)在進(jìn)行一些處理，將數(shù)字轉(zhuǎn)換成文字

house_price2 = house_price1 %>% 
  mutate(district = case_when(district == 1 ~ "DongCheng",
                              district == 2 ~ "FengTai",
                              district == 3 ~ "DaXing",
                              district == 4 ~ "YiZhuang",
                              district == 5 ~ "FangShan",
                              district == 6 ~ "ChangPing",
                              district == 7 ~ "ChaoYang",
                              district == 8 ~ "HaiDian",
                              district == 9 ~ "ShiJingShan",
                              district == 10 ~ "XiCheng",
                              district == 11 ~ "TongZhou",
                              district == 12 ~ "ShunYi",
                              district == 13 ~ "MenTouGou"))

house_price2 = house_price2 %>% 
  mutate(buildingType = case_when(buildingType == 1 ~ "Tower",
                                  buildingType == 2 ~ "Bungalow",
                                  buildingType == 3 ~ "Plate&Tower",
                                  buildingType == 4 ~ "Plate"))

house_price2 = house_price2 %>% 
  mutate(buildingStructure = case_when(buildingStructure == 1 ~ "Unavailable",
                                       buildingStructure == 2 ~ "Mixed",
                                       buildingStructure == 3 ~ "Brick/Wood",
                                       buildingStructure == 4 ~ "Brick/Concrete",
                                       buildingStructure == 5 ~ "Steel",
                                       buildingStructure == 6 ~ "Steel/Concrete"))

house_price2 = house_price2 %>% 
  mutate(renovationCondition = case_when(renovationCondition == 1 ~ "Other",
                                         renovationCondition == 2 ~ "Rough",
                                         renovationCondition == 3 ~ "Simplicity",
                                         renovationCondition == 4 ~ "Hardcover"))

house_price2 = house_price2 %>% 
  mutate(elevator = case_when(elevator == 1 ~ "Has_Elevator",
                              elevator != 1 ~ "No_elevator"))
house_price2 = house_price2 %>% 
  mutate(fiveYearsProperty = case_when(fiveYearsProperty == 1 ~ "Ownership < 5Yrs",
                                       fiveYearsProperty != 1 ~ "Ownership > 5Yrs"))

house_price2 = house_price2 %>% 
  mutate(subway = case_when(subway == 1 ~ "Has_Subway",
                            subway != 1 ~ "No_Subway"))

然后開始針對其中一些變量進(jìn)行可視化處理酷誓，我選擇了行政區(qū)劃披坏、建筑形式、建筑結(jié)構(gòu)盐数、裝修狀態(tài)棒拂、電梯、五年產(chǎn)權(quán)限制和地鐵幾個因素進(jìn)行可視化

3.3 行政區(qū)劃價格箱線圖

ggplot(house_price2, aes(reorder(x= district, -price), y=price, color = district))+geom_boxplot() + labs(title = "Prices of the District", y =" Price Per Sqft")+coord_flip()

PRICE_DISTRICT.jpeg

可以看出西城區(qū)價格最高娘扩，門頭溝價格最低

3.4 建筑形式價格箱線圖

ggplot(house_price2 , aes(x= buildingType, y=price, color = buildingType))+geom_boxplot() + labs(title = "Prices In Function Of The Building Type", y =" Price Per Sqft")

PRICEinBUILDINGTYPE.jpeg

平房建筑價格最高着茸，北京寸土寸金，平房容積率低琐旁，大部分應(yīng)該是豪華別墅和四合院之類的房子涮阔，因此最貴

3.5 建筑結(jié)構(gòu)價格箱線圖

ggplot(house_price2, aes(x= buildingStructure, y=price, color = buildingStructure))+geom_boxplot() + labs(title = "Prices In Function Of The Building Structure", y =" Price Per Sqft")

PRICEinBUILDINGSTRUC.jpeg

木磚結(jié)構(gòu)價格最高，推測木磚結(jié)構(gòu)房屋大多是處于中心城區(qū)的老房子灰殴、四合院一類的房子敬特，因此價格偏高

3.6 裝修狀態(tài)價格箱線圖

ggplot(house_price2, aes(x= renovationCondition, y=price, color = renovationCondition))+geom_boxplot() + labs(title = "Prices In Function Of The Renovation Condition", y =" Price Per Sqft")

PRICEinRENOVATION.jpeg

基本符合生活常識，精裝和簡裝價格要偏高一點

3.7 電梯價格箱線圖

ggplot(house_price2, aes(x= elevator, y=price, color = elevator))+geom_boxplot() + labs(title = "Prices In Function Of The elevator", y =" Price Per Sqft")

PriceonELEVATOR.jpeg

價格差不多牺陶，北京作為一座歷史城市老房子比較多伟阔，而且集中于中心城區(qū)，所以價格相差不多可以理解

3.8 地鐵價格箱線圖

ggplot(house_price2, aes(x= subway, y=price, color = subway))+geom_boxplot() + labs(title = "Prices In Function Of The subway", y =" Price Per Sqft")

subway.jpeg

有地鐵那還是貴一點的

3.9 房價時間趨勢圖

house_price1$tradeTime = as.Date(house_price1$tradeTime)
house_price1$constructionTime = as.Date(house_price1$tradeTime)
house_price1$tradeTimeM = floor_date(house_price1$tradeTime, "month")
house_price1$tradeTimeY = floor_date(house_price1$tradeTime, "year")
housePrice_time1 = house_price1 %>%  
  filter(tradeTimeM >= ymd("2010-01-01") & tradeTimeM < ymd("2018-12-31")) %>% 
  group_by(tradeTimeM) %>%  
  summarize(mean = mean(price))

HP_xts <- xts(housePrice_time1[,-1], order.by = housePrice_time1$tradeTimeM)
dygraph(HP_xts, main = "Sales Count & Price Per Square Meter", 
        ylab = "Average Monthly Price") %>%
  dySeries("mean", label = "Mean Price/SQFT") %>%
  dyOptions(stackedGraph = TRUE) %>%
  dyRangeSelector(height = 20)

TSERIES.jpeg

從時間變化價格圖能夠看出來掰伸，2010年到2018年房價基本翻了7倍皱炉，房地產(chǎn)仍然是非常保值的投資

第四部分預(yù)測模型

最近在Kaggle的數(shù)據(jù)競賽當(dāng)中，XGBOOST算法在很多比賽當(dāng)中都取得了非常亮眼的成績狮鸭，同時XGBOOST模型本身對于二元數(shù)據(jù)和線性數(shù)據(jù)和時間序列都具有非常好的包容性合搅，因此選擇XGBOOST算法進(jìn)行建模

4.1 訓(xùn)練集與測試集選擇

training = house_price1 %>%
  filter ( year(tradeTime) != 2018)
training = rbind (training, validation[0:100, ])
validation = house_price1 %>% filter (year (tradeTime)==2018)
validation = validation[101:219, ]

訓(xùn)練集選取2018年1月之前的所有數(shù)據(jù)以及2018年1月份一半的房產(chǎn)數(shù)據(jù)，測試集則選擇2018年1月份下半月的數(shù)據(jù)歧蕉，一共119條

4.2 訓(xùn)練集和測試集數(shù)據(jù)處理

將其中漢字部分裝換為數(shù)字

training = training %>% 
  mutate(floorpos = case_when(floorpos == "底" ~ 1,
                              floorpos == "低" ~ 2,
                              floorpos == "中" ~ 3,
                              floorpos == "高" ~ 4,
                              floorpos == "頂" ~ 5,
                              floorpos == "未" ~ 0))

validation = validation %>% 
  mutate(floorpos = case_when(floorpos == "底" ~ 1,
                              floorpos == "低" ~ 2,
                              floorpos == "中" ~ 3,
                              floorpos == "高" ~ 4,
                              floorpos == "頂" ~ 5,
                              floorpos == "未" ~ 0))

4.3 距離變量加入

在訓(xùn)練集和測試集中加入新的‘distance’變量灾部，該變量依照北京市環(huán)狀輻射狀的價格變化趨勢，將天安門廣場所在的坐標(biāo)作為北京市市中心惯退，計算房子和北京市中心之間的距離

location = data.frame (Lng = training$Lng, Lat = training$Lat)
location = data.matrix (location)  
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)#計算兩點之間的距離
distance = data.frame (distance)
training = cbind(training, distance)
training1 = select (training, -totalPrice, -price)

location = data.frame (Lng = validation$Lng, Lat = validation$Lat)
location = data.matrix (location)  
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
validation = cbind(validation, distance)
validation1 = select (validation, -totalPrice )

4.4 XGBOOST模型訓(xùn)練

set.seed(122)
Price_xg = xgboost(data = data.matrix(training1), label = training$price, max.depth = 6, eta = 0.55,  nrounds = 55, objective = "reg:linear")
pred_xgb = predict(Price_xg, data.matrix(validation1[, c('Lng',
                                                         'Lat',
                                                         'tradeTime',
                                                         'DOM', 
                                                         'followers', 
                                                         'square', 
                                                         'livingRoom', 
                                                         'drawingRoom', 
                                                         "kitchen", 
                                                         "bathRoom", 
                                                         "floor", 
                                                         "buildingType", 
                                                         "constructionTime", 
                                                         "renovationCondition", 
                                                         "buildingStructure", 
                                                         "ladderRatio", 
                                                         "elevator", 
                                                         "fiveYearsProperty",
                                                         "subway", 
                                                         "district",
                                                         "communityAverage",
                                                         "floorpos",
                                                         "distance")]))

4.5 模型分析

計算RMSE

RMSE = sqrt (((sum((validation1$price - pred_xgb)^2))/219))
RMSE

image.png

對于均價將近6萬每平方米的測試集來說赌髓，3500+的RMSE可以說是非常優(yōu)秀的結(jié)果了

各種變量的重要性：

XGBIMPORTANCE.jpeg

可以看出，社區(qū)均價所代表的房地產(chǎn)所處的地產(chǎn)板塊對于價格起到了非常重要的作用，同時交易時間也是非常重要的一個因素锁蠕，說明房價對時間十分敏感夷野。

后續(xù)的研究方向：經(jīng)濟(jì)形勢和相關(guān)政策對房價的影響

第五部分 CODE

library (tidyverse)
library (psych)
library (xts)
library (tseries)
library (forecast)
library (dplyr)
library (ggplot2)
library (VIM)
library (ggmap)
library (xgboost)
library (dygraphs)
library (knitr)
library (Matrix)
library(RCurl)
library(geosphere)
library(viridis)
library(lubridate)

beijing_house = read.csv ('D:/R語言/NTU/House Price Model/new.csv')

#firstly, lets see the data set
glimpse (beijing_house)

#By carefully examing the dataset, we found that there are some useless data: url, id, cid, so we should drop them
house_price = select (beijing_house, -url, -id, -Cid)

#In floor we can see there are Chinese character along with the floor numbers, therefore I split them into different columns as different factors
floor = substring (house_price$floor, 0, 1)
house_price$floorpos = floor
house_price$floor=sub('^.','',house_price$floor)

#check the missing data of the dataset
missing_data = data.frame(lapply(house_price,function(x) sum(is.na(x))))
missing_data

#Visualize the missing data 
missing_price = select (house_price, DOM, buildingType, elevator, fiveYearsProperty, subway, communityAverage)
aggr (missing_price, prop = TRUE, numbers = TRUE)

qqnorm(house_price$DOM)
qqline(house_price$DOM)

house_price = as.data.frame(house_price)

#Using median 
house_price$DOM<- ifelse(is.na(house_price$DOM),median(house_price$DOM,na.rm=TRUE),house_price$DOM)
aggr (house_price, prop = TRUE, numbers = TRUE)

#omit other mising data
house_price1 <- na.omit(house_price)
dim (house_price1)
dim (house_price)

glimpse (house_price1)
load(file = "D:/R語言/NTU/House Price Model/beijing_map.RData",verbose = TRUE)

beijing

house_price2 = house_price1 %>% 
  mutate(district = case_when(district == 1 ~ "DongCheng",
                              district == 2 ~ "FengTai",
                              district == 3 ~ "DaXing",
                              district == 4 ~ "FaXing",
                              district == 5 ~ "FangShan",
                              district == 6 ~ "ChangPing",
                              district == 7 ~ "ChaoYang",
                              district == 8 ~ "HaiDian",
                              district == 9 ~ "ShiJingShan",
                              district == 10 ~ "XiCheng",
                              district == 11 ~ "TongZhou",
                              district == 12 ~ "ShunYi",
                              district == 13 ~ "MenTouGou"))

house_price2 = house_price2 %>% 
  mutate(buildingType = case_when(buildingType == 1 ~ "Tower",
                                  buildingType == 2 ~ "Bungalow",
                                  buildingType == 3 ~ "Plate/Tower",
                                  buildingType == 4 ~ "Plate"))

house_price2 = house_price2 %>% 
  mutate(buildingStructure = case_when(buildingStructure == 1 ~ "Unavailable",
                                       buildingStructure == 2 ~ "Mixed",
                                       buildingStructure == 3 ~ "Brick/Wood",
                                       buildingStructure == 4 ~ "Brick/Concrete",
                                       buildingStructure == 5 ~ "Steel",
                                       buildingStructure == 6 ~ "Steel/Concrete"))

house_price2 = house_price2 %>% 
  mutate(renovationCondition = case_when(renovationCondition == 1 ~ "Other",
                                         renovationCondition == 2 ~ "Rough",
                                         renovationCondition == 3 ~ "Simplicity",
                                         renovationCondition == 4 ~ "Hardcover"))

house_price2 = house_price2 %>% 
  mutate(elevator = case_when(elevator == 1 ~ "Has_Elevator",
                              elevator != 1 ~ "No_elevator"))
house_price2 = house_price2 %>% 
  mutate(fiveYearsProperty = case_when(fiveYearsProperty == 1 ~ "Ownership < 5Yrs",
                                       fiveYearsProperty != 1 ~ "Ownership > 5Yrs"))

house_price2 = house_price2 %>% 
  mutate(subway = case_when(subway == 1 ~ "Has_Subway",
                            subway != 1 ~ "No_Subway"))

#Price & District
ggplot(house_price2, aes(reorder(x= district, -price), y=price, color = district))+geom_boxplot() + labs(title = "Prices of the District", y =" Price Per Sqft")+coord_flip()


#Pricing Mapping
beijing + geom_point(data =house_price1, aes(house_price1$Lng,house_price1$Lat,color=price),size=1.3,alpha=.5)+ scale_color_viridis() 

#Price Comparison on Different BuildingType
ggplot(house_price2 , aes(x= buildingType, y=price, color = buildingType))+geom_boxplot() + labs(title = "Prices In Function Of The Building Type", y =" Price Per Sqft")

#Price Comparison on Different BuildingStructure
ggplot(house_price2, aes(x= buildingStructure, y=price, color = buildingStructure))+geom_boxplot() + labs(title = "Prices In Function Of The Building Structure", y =" Price Per Sqft")

#Price Comparison on Different RenovationCondition
ggplot(house_price2, aes(x= renovationCondition, y=price, color = renovationCondition))+geom_boxplot() + labs(title = "Prices In Function Of The Renovation Condition", y =" Price Per Sqft")

#Price Comparison on Elevator
ggplot(house_price2, aes(x= elevator, y=price, color = elevator))+geom_boxplot() + labs(title = "Prices In Function Of The elevator", y =" Price Per Sqft")

#Price Comparison on Subway
ggplot(house_price2, aes(x= subway, y=price, color = subway))+geom_boxplot() + labs(title = "Prices In Function Of The subway", y =" Price Per Sqft")

#Five Years Ownership
ggplot(house_price2, aes(x= fiveYearsProperty, y=price, color = fiveYearsProperty))+geom_boxplot() + labs(title = "Prices In Function Of The five Years Property Variable", y =" Price Per Sqft")

house_price1$tradeTime = as.Date(house_price1$tradeTime)
house_price1$constructionTime = as.Date(house_price1$tradeTime)
validation = house_price1 %>% filter (year (tradeTime)==2018)

training = house_price1 %>%
  filter ( year(tradeTime) != 2018)
training = rbind (training, validation[0:100, ])
validation = validation[101:219, ]

training = training %>% 
  mutate(floorpos = case_when(floorpos == "底" ~ 1,
                              floorpos == "低" ~ 2,
                              floorpos == "中" ~ 3,
                              floorpos == "高" ~ 4,
                              floorpos == "頂" ~ 5,
                              floorpos == "未" ~ 0))

validation = validation %>% 
  mutate(floorpos = case_when(floorpos == "底" ~ 1,
                              floorpos == "低" ~ 2,
                              floorpos == "中" ~ 3,
                              floorpos == "高" ~ 4,
                              floorpos == "頂" ~ 5,
                              floorpos == "未" ~ 0))

#calculate the distance between center of Beijing (the Forbidden City) and location of teh house
location = data.frame (Lng = training$Lng, Lat = training$Lat)
location = data.matrix (location)  
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
training = cbind(training, distance)
training1 = select (training, -totalPrice, -price)

location = data.frame (Lng = validation$Lng, Lat = validation$Lat)
location = data.matrix (location)  
center = data.frame (Lng = 116.23, Lat = 39.54)
center = data.matrix (center)
distance = round (distm(location, center, fun=distVincentyEllipsoid), 0)
distance = data.frame (distance)
validation = cbind(validation, distance)
validation1 = select (validation, -totalPrice )


set.seed(122)
#8 0.55 56
Price_xg = xgboost(data = data.matrix(training1), label = training$price, max.depth = 6, eta = 0.55,  nrounds = 55, objective = "reg:linear")
pred_xgb = predict(Price_xg, data.matrix(validation1[, c('Lng',
                                                         'Lat',
                                                         'tradeTime',
                                                         'DOM', 
                                                         'followers', 
                                                         'square', 
                                                         'livingRoom', 
                                                         'drawingRoom', 
                                                         "kitchen", 
                                                         "bathRoom", 
                                                         "floor", 
                                                         "buildingType", 
                                                         "constructionTime", 
                                                         "renovationCondition", 
                                                         "buildingStructure", 
                                                         "ladderRatio", 
                                                         "elevator", 
                                                         "fiveYearsProperty",
                                                         "subway", 
                                                         "district",
                                                         "communityAverage",
                                                         "floorpos",
                                                         "distance")]))

RMSE = sqrt (((sum((validation1$price - pred_xgb)^2))/219))
RMSE

 #獲取變量的重要性
model = xgb.dump(Price_xg,with_stats = T) 
model 
names = dimnames(data.matrix(training1[,c(1:23)])) 
importance_matrix = xgb.importance(names[[2]],model=Price_xg) # 計算變量重要性
names
xgb.plot.importance(importance_matrix[,])

house_price1$tradeTimeM = floor_date(house_price1$tradeTime, "month")
house_price1$tradeTimeY = floor_date(house_price1$tradeTime, "year")


housePrice_time1 = house_price1 %>%  
  filter(tradeTimeM >= ymd("2010-01-01") & tradeTimeM < ymd("2018-12-31")) %>% 
  group_by(tradeTimeM) %>%  
  summarize(mean = mean(price))

HP_xts <- xts(housePrice_time1[,-1], order.by = housePrice_time1$tradeTimeM)
dygraph(HP_xts, main = "Sales Count & Price Per Square Meter", 
        ylab = "Average Monthly Price") %>%
  dySeries("mean", label = "Mean Price/SQFT") %>%
  dyOptions(stackedGraph = TRUE) %>%
  dyRangeSelector(height = 20)

R語言——運(yùn)用XGBoost預(yù)測北京市房地產(chǎn)價格

R語言——運(yùn)用XGBoost預(yù)測北京市房地產(chǎn)價格

第一部分：數(shù)據(jù)準(zhǔn)備&清洗工作

1.1 數(shù)據(jù)準(zhǔn)備

第二部分數(shù)據(jù)清洗和填充

2.1 數(shù)據(jù)清洗

2.2 數(shù)據(jù)填充

第三部分數(shù)據(jù)可視化

3.1 價格熱力圖

3.2可視化數(shù)據(jù)處理

3.3 行政區(qū)劃價格箱線圖

3.4 建筑形式價格箱線圖

3.5 建筑結(jié)構(gòu)價格箱線圖

3.6 裝修狀態(tài)價格箱線圖

3.7 電梯價格箱線圖

3.8 地鐵價格箱線圖

3.9 房價時間趨勢圖

第四部分預(yù)測模型

4.1 訓(xùn)練集與測試集選擇

4.2 訓(xùn)練集和測試集數(shù)據(jù)處理

4.3 距離變量加入

4.4 XGBOOST模型訓(xùn)練

4.5 模型分析

第五部分 CODE

R語言——運(yùn)用XGBoost預(yù)測北京市房地產(chǎn)價格

第一部分：數(shù)據(jù)準(zhǔn)備&清洗工作

1.1 數(shù)據(jù)準(zhǔn)備

第二部分 數(shù)據(jù)清洗和填充

2.1 數(shù)據(jù)清洗

2.2 數(shù)據(jù)填充

第三部分 數(shù)據(jù)可視化

3.1 價格熱力圖

3.2可視化數(shù)據(jù)處理

3.3 行政區(qū)劃價格箱線圖

3.4 建筑形式價格箱線圖

3.5 建筑結(jié)構(gòu)價格箱線圖

3.6 裝修狀態(tài)價格箱線圖

3.7 電梯價格箱線圖

3.8 地鐵價格箱線圖

3.9 房價時間趨勢圖

第四部分 預(yù)測模型

4.1 訓(xùn)練集與測試集選擇

4.2 訓(xùn)練集和測試集數(shù)據(jù)處理

4.3 距離變量加入

4.4 XGBOOST模型訓(xùn)練

4.5 模型分析

第五部分 CODE

第二部分數(shù)據(jù)清洗和填充

第三部分數(shù)據(jù)可視化

第四部分預(yù)測模型