近來(lái)準(zhǔn)備開(kāi)始做一個(gè)有關(guān)于房?jī)r(jià)的分析項(xiàng)目知给,以重新熟練一下之前的爬蟲(chóng)知識(shí)渺杉,并應(yīng)用一下近來(lái)學(xué)習(xí)的Tableau作圖技巧,本次項(xiàng)目?jī)H做交流使用鹦倚,非具有任何商業(yè)用途。
為了保證信息對(duì)地區(qū)房?jī)r(jià)的真實(shí)反映冀惭,本項(xiàng)目選擇鏈家網(wǎng)作為二手房信息的爬取網(wǎng)站震叙,首先以青島地區(qū)二手房為例進(jìn)行爬取。
第一步散休,導(dǎo)入需要用到的庫(kù)或模塊媒楼。本次使用urllib庫(kù),通過(guò)xpath進(jìn)行網(wǎng)頁(yè)解析戚丸,由于筆者習(xí)慣對(duì)DataFrame形式的數(shù)據(jù)進(jìn)行處理划址,因此在此導(dǎo)入pandas庫(kù)。
import urllib.request
from lxml import etree
import pandas as pd
第二步限府,為了后續(xù)的數(shù)據(jù)框轉(zhuǎn)換更加順利猴鲫,在網(wǎng)頁(yè)解析部分寫(xiě)的有些過(guò)于細(xì)致,如果你不習(xí)慣用DataFrame谣殊,可以采用別的數(shù)據(jù)結(jié)構(gòu)。
house_info = []
for page in range(1,101):
url = 'https://qd.lianjia.com/ershoufang/pg'+str(page)
html = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
selector = etree.HTML(html)
page_info = selector.xpath('//li[@class="clear LOGCLICKDATA"]')
print('正在爬第'+str(page)+'頁(yè)')
for i in range(len(page_info)):
house_infor_one = []
title = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/a/text()')
house_infor_one.extend(title if title else ['.'])
way = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/span/text()')
house_infor_one.extend(way if way else ['.'])
road = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()')
house_infor_one.extend(road if road else ['.'])
community = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()')
house_infor_one.extend(community if community else ['.'])
house_des = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/text()')
house_infor_one.extend(house_des if house_des else ['.'])
floor = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/text()')
house_infor_one.extend(floor if floor else ['.'])
popularity = page_info[i].xpath('div[@class="info clear"]/div[@class="followInfo"]/text()')
house_infor_one.extend(popularity if popularity else ['.'])
subway = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="subway"]/text()')
house_infor_one.extend(subway if subway else ['.'])
taxfree = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()')
house_infor_one.extend(taxfree if taxfree else ['.'])
haskey = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()')
house_infor_one.extend(haskey if haskey else ['.'])
total_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')
house_infor_one.extend(total_price if total_price else ['.'])
price_unit = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
house_infor_one.extend(price_unit if price_unit else ['.'])
per_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')
house_infor_one.extend(per_price if per_price else ['.'])
house_info.append(house_infor_one)
第三步牺弄,將已經(jīng)整理好格式的數(shù)據(jù)轉(zhuǎn)換為數(shù)據(jù)框姻几,并給他們的列進(jìn)行命名,存到本地文件中,至此我們的數(shù)據(jù)就爬取結(jié)束啦
house_df = pd.DataFrame(house_info)
house_df.columns = ['房源描述', '房源來(lái)源', '房源地址(路)', '小區(qū)名稱', '戶型信息', '樓層', '人氣', '距離地鐵', '房本情況(個(gè)稅)', '看房時(shí)間(鑰匙)', '房源總價(jià)', '房源總價(jià)單位', '房源單價(jià)(平)','備注']
house_df.to_excel('D:/Tsingtao.xls',)