scrapy抓取鏈家網(wǎng)二手房成交數(shù)據(jù)

image

學習python爬蟲一周多了，看了看練手例子，突然看到鏈家網(wǎng)的二手房成交數(shù)據(jù)很值得去抓取下贞绳，也正好看看房價走勢

因為最近在學習scrapy舰讹，所以就用scrapy和xpath來抓取，抓取的數(shù)據(jù)就存MySQL數(shù)據(jù)庫中鬼贱，方便以后查看。

抓取之前得先去分析下鏈家網(wǎng)二手房成交量頁面，一看額檬某。。螟蝙。

image

這個價格2**萬是怎么回事恢恼，說實話剛看到這個頁面的時候覺得價格得去APP里查看，這還抓**(TM...)啊,不過用開發(fā)者工具仔細看了下頁面胰默，發(fā)現(xiàn)了一個好東西
https://hz.lianjia.com/chengjiao/103102419296.html,點進去發(fā)現(xiàn)原來如此场斑，這時突然覺得鏈家好low啊

意外發(fā)現(xiàn)的鏈接

點進去的網(wǎng)頁

心里突然有點小激動哈！

image

正式開始牵署，我們的思路是首先去列表頁把隱藏的詳情頁漏隐，提取出來，然后進詳情頁奴迅，具體的抓取了青责，抓取的代碼相對簡單：

class Lianjiaspider(scrapy.Spider):
    name = 'lianjia1'
    allowed_domains = ['hz.lianjia.com']
    start_urls = []
    regions = {'xihu': '西湖',
               'xiacheng': '下城',
               'jianggan': '江干',
               'gongshu': '拱墅',
               'shangcheng': '上城',
               'binjiang': '濱江',
               'yuhang': '余杭',
               'xiaoshan': '蕭山',
               'xiasha': '下沙'}
    for region in list(regions.keys()):
        for i in range(1, 11):
            start_urls.append('https://hz.lianjia.com/chengjiao/' + region + '//pg' + str(i) + "/")

    def parse(self, response):
        #把隱藏的詳情HTML拿出來
        li_item = response.xpath('//ul[@class="listContent"]')
        for li in li_item:
            hrefs = li.xpath('//a[@class="img"]/@href').extract()
            for href in hrefs:
                #進入詳情，繼續(xù)抓
                yield scrapy.Request(url=href, callback=self.more, dont_filter=True)

進入隱藏的HTML挨個抓取

 def more(self, response):
        item = LianjiaItem()
        info1 = ''
        # 地區(qū)
        area = response.xpath('//section[1]/div[1]/a[3]/text()').extract()[0]
        item['region'] = area.replace("二手房成交價格", "")
        # 小區(qū)名
        community = response.xpath('//title/text()').extract()[0]
        item['community'] = community[:community.find(" ", 1, len(community))]
        # 成交時間
        deal_time = response.xpath('//div[@class="wrapper"]/span/text()').extract()[0]
        item['deal_time'] = deal_time.replace("鏈家成交", "").strip()
        # 總價
        item['total_price'] = response.xpath('//span[@class="dealTotalPrice"]/i/text()').extract()[
                                  0] + '萬'
        # 單價
        item['unit_price'] = response.xpath('//div[@class="price"]/b/text()').extract()[0] + '元/平'

        # 戶型
        introContent = response.xpath('//div[@class="content"]/ul/li/text()').extract()
        item['style'] = introContent[0].strip()
        # 樓層
        item['floor'] = introContent[1].strip()
        # 大小
        item['size'] = introContent[2].strip()
        # 朝向
        item['orientation'] = introContent[6].strip()
        # 建成年代
        item['build_year'] = introContent[7].strip()
        # 裝修情況
        item['decoration'] = introContent[8].strip()
        # 產(chǎn)權年限
        item['property_time'] = introContent[12].strip()
        # 電梯配備
        item['elevator'] = introContent[13].strip()
        # 其他周邊等信息
        infos = response.xpath('//div[@class="content"]/text()').extract()
        if len(infos) != 0:
            for info in infos:
                info = "".join(info.split())
                info1 += info
            item['info'] = info1
        else:
            item['info'] = '暫無信息'
        return item

在這里我只抓取了1到10頁的內(nèi)容取具，如果大家想抓全部的內(nèi)容的話還得在抓取之前脖隶，先把總頁數(shù)先抓過來，也不一定都是100者填，xpath是

//div[@class="page-box house-lst-page-box"]/@page-data
得到的數(shù)據(jù)是：{"totalPage":87,"curPage":1}類似這樣的信息浩村，具體大家再把87提取出來就可以了

image

抓取過程中可以看到日志：

image

接下來就是把抓取的數(shù)據(jù)存進數(shù)據(jù)庫，說實話我是做android開發(fā)的占哟，對于數(shù)據(jù)庫不是很懂(還想著直接存txt嘿嘿心墅，其實就是懶不想去看)酿矢，搞了半天才搞好，對于python和數(shù)據(jù)庫鏈接怎燥，我用的是peewee瘫筐，一個簡單、輕巧的 Python ORM铐姚。研究了文檔半天突然又發(fā)現(xiàn)這咋和scrapy一起用啊策肝，沒事繼續(xù)研究,發(fā)現(xiàn)也簡單。

image

新建一個model

# -*- coding: utf-8 -*-

from peewee import *

db = MySQLDatabase('lianjia', host='localhost', port=3306, user='root', passwd='12345678',
                   charset='utf8')


# define base model
class BaseModel(Model):
    class Meta:
        database = db

class LianjiaInfo(BaseModel):
    region = CharField()
    community = CharField()
    deal_time = CharField()
    total_price = CharField()
    unit_price = CharField()
    style = CharField()
    floor = CharField()
    size = CharField()
    orientation = CharField()
    build_year = CharField()
    decoration = CharField()
    property_time = CharField()
    elevator = CharField()
    info = TextField()

db.connect()
db.create_tables([LianjiaInfo], safe=True)

在pipelines.py中直接插入數(shù)據(jù)

LianjiaInfo.create(region=item['region'],community=item['community'],deal_time=item['deal_time'],
                           total_price=item['total_price'],unit_price=item['unit_price'],style=item['style'],
                           floor=item['floor'], size=item['size'],orientation=item['orientation'],
                           build_year=item['build_year'],decoration=item['decoration'],property_time=item['property_time'],
                           elevator=item['elevator'],info=item['info'])

ok看看結果：一共2516條數(shù)據(jù)隐绵，按理說一頁30條之众，10頁9個區(qū)有2700條數(shù)據(jù)，還有186條數(shù)據(jù)不見了依许，恕我學習python爬蟲沒多久實在是不理解

這是一部分數(shù)據(jù)

全部代碼放在github上棺禾，感興趣的伙伴可以clone下看看