本文在有些需要解釋說(shuō)明的地方引用了知乎文章屌絲想買(mǎi)房……和Scrapy入門(mén)教程
本篇教程中將按照下列五步實(shí)現(xiàn)標(biāo)題所述目標(biāo):
1怎憋、創(chuàng)建一個(gè)Scrapy項(xiàng)目
本篇建議安裝Anaconda3噪珊,Anaconda可以很方便地解決多版本python并存纳猫、切換以及各種第三方包安裝問(wèn)題座哩。Anaconda利用工具/命令conda來(lái)進(jìn)行package和environment的管理,并且已經(jīng)包含了Python和相關(guān)的配套工具识腿。
- 1折欠、新建項(xiàng)目:
scrapy startproject lianjia
- 2、切換目錄:
cd lianjia
- 3谓着、新建爬蟲(chóng):
scrapy genspider Ljia sh.lianjia.com/zufang
工程文件說(shuō)明:
scrapy.cfg 記錄項(xiàng)目的配置信息
items.py 存放爬取完數(shù)據(jù)的模板泼诱,用于結(jié)構(gòu)化數(shù)據(jù)
pipelines 數(shù)據(jù)處理行為,比如結(jié)構(gòu)化的數(shù)據(jù)赊锚,存放到數(shù)據(jù)庫(kù)持久化等等
settings.py 配置文件治筒,比如遞歸的層數(shù)、并發(fā)數(shù)舷蒲,延遲下載等
spiders 真正干活的爬蟲(chóng)目錄矢炼,對(duì)網(wǎng)頁(yè)的數(shù)據(jù)清洗
2、定義提取的Item
Item是我們要爬取數(shù)據(jù)的模板阿纤,因此我們應(yīng)該先編輯lianjia/lianjia下的items文件
觀察我們要爬取的在租房示例圖句灌,首先想好你要爬取哪些關(guān)鍵信息
在租房示例圖
我定義的目標(biāo)提取字段比較詳細(xì)(也可以說(shuō)比較啰嗦),字段含義參考代碼注釋
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LianjiaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() #房間標(biāo)題,例如:申金大廈胰锌,好樓層骗绕,鑰匙在鏈家,鏈家好房
roomType = scrapy.Field() #房間類(lèi)型资昧,幾室?guī)讖d
roomName = scrapy.Field() #房間名酬土,例如:申金大廈
roomPrice = scrapy.Field() #價(jià)格,按月為單位
roomStatus = scrapy.Field() #房間狀態(tài)格带,例如:隨時(shí)看房
roomDate = scrapy.Field() #房間上架日期撤缴,例如:2017.08.06
areaB = scrapy.Field() #區(qū)域 例如:浦東,黃浦
street = scrapy.Field() #街道叽唱,例如:距離5號(hào)線金平路站794米
3屈呕、編寫(xiě)爬取網(wǎng)站的spider并提取Item
網(wǎng)頁(yè)解析
可以看到網(wǎng)頁(yè)元素還是很好抽取的,但是我們要先做一些準(zhǔn)備工作
- 1棺亭、把setting.py里面的ROBOT協(xié)議尊守改為False,不修改爬不了任何數(shù)據(jù)).
- 2虎眨、添加瀏覽器代理
- 3、取消注釋
以下代碼使用xpath提取目標(biāo)字段镶摘,xpath是抽取HTML元素最為便捷和快捷的方式嗽桩,關(guān)于xpath的使用參考xpath語(yǔ)法
# -*- coding: utf-8 -*-
import scrapy
import re
from lianjia.items import LianjiaItem
class LjiaSpider(scrapy.Spider):
name = 'Ljia'
allowed_domains = ['https://sh.lianjia.com/zufang']
start_urls = ['https://sh.lianjia.com/zufang']
def parse(self, response):
for i in response.xpath('.//li/div[@class="info-panel"]'):
item = LianjiaItem()
item['title'] = i.xpath('.//h2/a/@title').extract_first()
item['roomName'] = i.xpath('.//div[@class="where"]/a/span/text()').extract_first()
item['roomType'] = i.xpath('.//div[@class="where"]/span/text()').extract_first().rstrip('  ')
roomDesc = i.xpath('.//div[@class="con"]').extract_first()
item['roomPrice'] = i.xpath('.//div[@class="price"]/span/text()').extract_first()
item['roomStatus'] = i.xpath('.//span[@class="anytime-ex"]/span/text()').extract_first()
item['roomDate'] = i.xpath('.//div[@class="col-3"]/div[@class="price-pre"]/text()').extract_first().rstrip('7\n\t\t\t\t\t\t\t上架')
item['areaB'] = str(i.xpath('.//div[@class="con"]/a/text()').extract()[0])
item['street'] = i.xpath('.//span[@class="fang-subway-ex"]/span/text()').extract_first()
yield item
temp_url = response.xpath('//a[@gahref="results_next_page"]/@href').extract()[0]
if temp_url:
url = 'https://sh.lianjia.com' + temp_url
yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
注釋?zhuān)篹xtract_first()方法用來(lái)序列化抽取到的網(wǎng)頁(yè)元素,dont_filter字段用于避免服務(wù)器把我們的爬蟲(chóng)url做重定向
4凄敢、編寫(xiě)Item PipeLine來(lái)存儲(chǔ)提取到的Item(即數(shù)據(jù))
/lianjia/lianjia/pipelines 文件
import pymysql
class LianjiaPipeline(object):
def __init__(self):
self.conn = pymysql.connect(host='localhost', user='root', passwd='****', \
db='***', charset='utf8')
self.cur = self.conn.cursor()
def process_item(self, item, spider):
title = item.get('title', 'N/A')
roomType = item.get('roomType', 'N/A')
roomName = item.get('roomName', 'N/A')
#roomSize = item.get('roomSize', 'N/A')
#roomDesc = item.get('roomDesc', 'N/A')
roomPrice = item.get('roomPrice', 'N/A')
roomStatus = item.get('roomStatus', 'N/A')
roomDate = item.get('roomDate', 'N/A')
areaB = item.get('areaB', 'N/A')
street = item.get('street', 'N/A')
sql = 'insert into lianjia(title, roomType, roomName, roomPrice, \
roomStatus, roomDate, areaB, street) values(%s, %s, %s, %s, %s, %s, %s, %s)'
self.cur.execute(sql, (title, roomType, roomName, roomPrice, roomStatus, roomDate,areaB, street))
self.conn.commit()
#return item
def close_spider(self, spider):
self.cur.close()
self.conn.close()
注釋?zhuān)?br> 1碌冶、安裝pymysql包:pip install pymysql
2、self.cur(游標(biāo)對(duì)象)用于對(duì)數(shù)據(jù)表操作涝缝,self.conn(數(shù)據(jù)庫(kù)對(duì)象)用于提交對(duì)數(shù)據(jù)庫(kù)的操作
5扑庞、讓爬蟲(chóng)動(dòng)起來(lái)
上面的爬蟲(chóng)工程已經(jīng)準(zhǔn)備好了,現(xiàn)在可以運(yùn)行一下等待結(jié)果了
- 初步調(diào)試階段:
先注釋掉pipelines文件中的sql執(zhí)行語(yǔ)句俊卤,執(zhí)行命令scrapy crawl Ljia -o house.csv
做初步調(diào)試嫩挤,不著急導(dǎo)入數(shù)據(jù)庫(kù),在終端觀察爬蟲(chóng)運(yùn)行情況消恍,若出現(xiàn)報(bào)錯(cuò)則查看錯(cuò)誤信息進(jìn)行故障排查岂昭,若爬取成功則打開(kāi)lianjia目錄下面的house.csv文件查看爬取結(jié)果
- 數(shù)據(jù)庫(kù)導(dǎo)入階段:
若上面執(zhí)行成功,則開(kāi)始導(dǎo)入數(shù)據(jù)庫(kù)狠怨,取消注釋sql語(yǔ)句约啊,執(zhí)行命令scrapy crawl Ljia
,在終端觀察導(dǎo)入情況,若有報(bào)錯(cuò)佣赖,則排除問(wèn)題恰矩,若成功寫(xiě)入, 則本次試驗(yàn)到此結(jié)束憎蛤。若有數(shù)據(jù)庫(kù)編碼問(wèn)題外傅,可嘗試自行解決纪吮。