學習python
爬蟲一周多了,看了看練手例子,突然看到鏈家網(wǎng)的二手房成交數(shù)據(jù)很值得去抓取下贞绳,也正好看看房價走勢
因為最近在學習scrapy
舰讹,所以就用scrapy
和xpath
來抓取,抓取的數(shù)據(jù)就存MySQL
數(shù)據(jù)庫中鬼贱,方便以后查看。
抓取之前得先去分析下鏈家網(wǎng)二手房成交量頁面,一看額檬某。。螟蝙。
這個價格
2**萬
是怎么回事恢恼,說實話剛看到這個頁面的時候覺得價格得去APP里查看,這還抓**(TM...)啊,不過用開發(fā)者工具仔細看了下頁面胰默,發(fā)現(xiàn)了一個好東西https://hz.lianjia.com/chengjiao/103102419296.html
,點進去發(fā)現(xiàn)原來如此场斑,這時突然覺得鏈家好low啊
心里突然有點小激動哈!
正式開始牵署,我們的思路是首先去列表頁把隱藏的詳情頁漏隐,提取出來,然后進詳情頁奴迅,具體的抓取了青责,抓取的代碼相對簡單:
class Lianjiaspider(scrapy.Spider):
name = 'lianjia1'
allowed_domains = ['hz.lianjia.com']
start_urls = []
regions = {'xihu': '西湖',
'xiacheng': '下城',
'jianggan': '江干',
'gongshu': '拱墅',
'shangcheng': '上城',
'binjiang': '濱江',
'yuhang': '余杭',
'xiaoshan': '蕭山',
'xiasha': '下沙'}
for region in list(regions.keys()):
for i in range(1, 11):
start_urls.append('https://hz.lianjia.com/chengjiao/' + region + '//pg' + str(i) + "/")
def parse(self, response):
#把隱藏的詳情HTML拿出來
li_item = response.xpath('//ul[@class="listContent"]')
for li in li_item:
hrefs = li.xpath('//a[@class="img"]/@href').extract()
for href in hrefs:
#進入詳情,繼續(xù)抓
yield scrapy.Request(url=href, callback=self.more, dont_filter=True)
進入隱藏的HTML挨個抓取
def more(self, response):
item = LianjiaItem()
info1 = ''
# 地區(qū)
area = response.xpath('//section[1]/div[1]/a[3]/text()').extract()[0]
item['region'] = area.replace("二手房成交價格", "")
# 小區(qū)名
community = response.xpath('//title/text()').extract()[0]
item['community'] = community[:community.find(" ", 1, len(community))]
# 成交時間
deal_time = response.xpath('//div[@class="wrapper"]/span/text()').extract()[0]
item['deal_time'] = deal_time.replace("鏈家成交", "").strip()
# 總價
item['total_price'] = response.xpath('//span[@class="dealTotalPrice"]/i/text()').extract()[
0] + '萬'
# 單價
item['unit_price'] = response.xpath('//div[@class="price"]/b/text()').extract()[0] + '元/平'
# 戶型
introContent = response.xpath('//div[@class="content"]/ul/li/text()').extract()
item['style'] = introContent[0].strip()
# 樓層
item['floor'] = introContent[1].strip()
# 大小
item['size'] = introContent[2].strip()
# 朝向
item['orientation'] = introContent[6].strip()
# 建成年代
item['build_year'] = introContent[7].strip()
# 裝修情況
item['decoration'] = introContent[8].strip()
# 產(chǎn)權年限
item['property_time'] = introContent[12].strip()
# 電梯配備
item['elevator'] = introContent[13].strip()
# 其他周邊等信息
infos = response.xpath('//div[@class="content"]/text()').extract()
if len(infos) != 0:
for info in infos:
info = "".join(info.split())
info1 += info
item['info'] = info1
else:
item['info'] = '暫無信息'
return item
在這里我只抓取了1
到10
頁的內(nèi)容取具,如果大家想抓全部的內(nèi)容的話還得在抓取之前脖隶,先把總頁數(shù)先抓過來,也不一定都是100
者填,xpath
是
//div[@class="page-box house-lst-page-box"]/@page-data
得到的數(shù)據(jù)是:{"totalPage":87,"curPage":1}類似這樣的信息浩村,具體大家再把87提取出來就可以了
抓取過程中可以看到日志:
接下來就是把抓取的數(shù)據(jù)存進數(shù)據(jù)庫,說實話我是做android開發(fā)的占哟,對于數(shù)據(jù)庫不是很懂(還想著直接存txt嘿嘿心墅,其實就是懶不想去看)酿矢,搞了半天才搞好,對于python和數(shù)據(jù)庫鏈接怎燥,我用的是peewee瘫筐,一個簡單、輕巧的 Python ORM铐姚。研究了文檔半天突然又發(fā)現(xiàn)這咋和scrapy一起用啊策肝,沒事繼續(xù)研究,發(fā)現(xiàn)也簡單。
新建一個model
# -*- coding: utf-8 -*-
from peewee import *
db = MySQLDatabase('lianjia', host='localhost', port=3306, user='root', passwd='12345678',
charset='utf8')
# define base model
class BaseModel(Model):
class Meta:
database = db
class LianjiaInfo(BaseModel):
region = CharField()
community = CharField()
deal_time = CharField()
total_price = CharField()
unit_price = CharField()
style = CharField()
floor = CharField()
size = CharField()
orientation = CharField()
build_year = CharField()
decoration = CharField()
property_time = CharField()
elevator = CharField()
info = TextField()
db.connect()
db.create_tables([LianjiaInfo], safe=True)
在pipelines.py中直接插入數(shù)據(jù)
LianjiaInfo.create(region=item['region'],community=item['community'],deal_time=item['deal_time'],
total_price=item['total_price'],unit_price=item['unit_price'],style=item['style'],
floor=item['floor'], size=item['size'],orientation=item['orientation'],
build_year=item['build_year'],decoration=item['decoration'],property_time=item['property_time'],
elevator=item['elevator'],info=item['info'])
ok看看結果:一共2516
條數(shù)據(jù)隐绵,按理說一頁30
條之众,10
頁9
個區(qū)有2700
條數(shù)據(jù),還有186
條數(shù)據(jù)不見了依许,恕我學習python
爬蟲沒多久實在是不理解
全部代碼放在github上棺禾,感興趣的伙伴可以clone下看看