Scrapy 框架
Scrapy是用純Python實(shí)現(xiàn)一個(gè)為了爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)而編寫(xiě)的應(yīng)用框架,用途非常廣泛纪蜒。
框架的力量,用戶只需要定制開(kāi)發(fā)幾個(gè)模塊就可以輕松的實(shí)現(xiàn)一個(gè)爬蟲(chóng)此叠,用來(lái)抓取網(wǎng)頁(yè)內(nèi)容以及各種圖片霍掺,非常之方便。
Scrapy 使用了 Twisted['tw?st?d] 異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通訊拌蜘,可以加快我們的下載速度,不用自己去實(shí)現(xiàn)異步框架牙丽,并且包含了各種中間件接口简卧,可以靈活的完成各種需求。
Scrapy架構(gòu)圖
Windows 安裝方式
Python 3
升級(jí)pip版本:
pip3 install --upgrade pip
通過(guò)pip 安裝 Scrapy 框架
pip3 install Scrapy
Ubuntu 安裝方式
通過(guò)pip3 安裝 Scrapy 框架
sudo pip3 install scrapy
如果安裝不成功再試著添加這些依賴庫(kù):
安裝非Python的依賴
sudo apt-get install python3-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
安裝后烤芦,只要在命令終端輸入 scrapy举娩,提示類(lèi)似以下結(jié)果,代表已經(jīng)安裝成功
創(chuàng)建語(yǔ)句
創(chuàng)建爬蟲(chóng)項(xiàng)目
scrapy startproject jobboleproject
新建爬蟲(chóng)文件
scrapy genspider jobbole jobbole.com
啟動(dòng)爬蟲(chóng)
scrapy crawl jobbole
item.py文件
import scrapy
class XiachufangspiderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
img = scrapy.Field()
yongliao = scrapy.Field()
zuofa = scrapy.Field()
path = scrapy.Field()
pipelines.py文件
import json
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import os
class XiachufangspiderPipeline(object):
def __init__(self):
self.f = open('xiachufang.json', 'a')
def process_item(self, item, spider):
self.f.write(json.dumps(dict(item), ensure_ascii=False)+'\n')
return item
def close_spider(self, spider):
self.f.close()
#把這個(gè)配置路徑拿到
IMAGES_STORE = get_project_settings().get('IMAGES_STORE')
#下載圖片
class XiachufangImgspiderPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
#發(fā)起圖片請(qǐng)求,把結(jié)果回調(diào)給item_completed
return scrapy.Request(url=item['img'])
def item_completed(self, results, item, info):
# for ok,x in results:
# if ok:
# x['path']
if imgs:
os.rename(IMAGES_STORE + imgs[0],
IMAGES_STORE + item['name'] + '.jpg')
item['path'] = os.getcwd() + '/' + IMAGES_STORE + item['name'] + '.jpg'
else:
item['path'] = ""
return item
xiachufang.py文件
# -*- coding: utf-8 -*-
import scrapy
from XiachufangSpider.items import XiachufangspiderItem
class XiachufangSpider(scrapy.Spider):
name = 'xiachufang'
allowed_domains = ['xiachufang.com']
start_urls = ['http://www.xiachufang.com/category/40076/']
def parse(self, response):
div_list = response.xpath('//div[@class="pure-u-3-4 category-recipe-list"]//div[@class="normal-recipe-list"]//li')
for div in div_list:
url = div.xpath('.//p[@class="name"]/a/@href').extract_first('')
print(url)
yield scrapy.Request(url='http://www.xiachufang.com' + url, callback=self.parseDetail)
def parseDetail(self, response):
item = XiachufangspiderItem()
name = response.xpath('//h1/text()').extract_first('').replace('\n', '').strip()
img = response.xpath('//div[@class="cover image expandable block-negative-margin"]/img/@src').extract_first('').replace('\n', '').strip()
yongliao = ''.join(response.xpath('//tr/td//text()').extract()).replace('\n', '').replace(' ', '')
zuofa = ''.join(response.xpath('//div[@class="steps"]//p/text()').extract())
item['name'] = name
item['img'] = img
item['yongliao'] = yongliao
item['zuofa'] = zuofa
print(item)
yield item