關于scrapy框架

Scrapy的安裝介紹

Scrapy框架官方網(wǎng)址：http://doc.scrapy.org/en/latest

Scrapy中文維護站點：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

Windows 安裝方式

Python 3

升級pip版本：

pip3 install --upgrade pip

通過pip 安裝 Scrapy 框架

pip3 install Scrapy

Ubuntu 安裝方式

通過pip3 安裝 Scrapy 框架

sudo pip3 install scrapy
如果安裝不成功再試著添加這些依賴庫：

安裝非Python的依賴

sudo apt-get install python3-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

流程圖

Image.png

1.Scrapy Engine(引擎): 負責Spider长搀、ItemPipeline、Downloader斗塘、Scheduler中間的通訊醋奠，信號、數(shù)據(jù)傳遞等

2.Scheduler(調(diào)度器): 它負責接受引擎發(fā)送過來的Request請求蝌衔，并按照一定的方式進行整理排列豺鼻，入隊耽梅，當引擎需要時除盏，交還給引擎叉橱。

3.Downloader（下載器）：負責下載Scrapy Engine(引擎)發(fā)送的所有Requests請求，并將其獲取到的Responses交還給Scrapy Engine(引擎)者蠕，由引擎交給Spider來處理赏迟，

4.Spider（爬蟲）：它負責處理所有Responses,從中分析提取數(shù)據(jù)，獲取Item字段需要的數(shù)據(jù)蠢棱，并將需要跟進的URL提交給引擎，再次進入Scheduler(調(diào)度器)甩栈，

5.Item Pipeline(管道)：它負責處理Spider中獲取到的Item泻仙，并進行進行后期處理（詳細分析、過濾量没、存儲等）的地方.

6.Downloader Middlewares（下載中間件）：你可以當作是一個可以自定義擴展下載功能的組件玉转。

7.Spider Middlewares（Spider中間件）：你可以理解為是一個可以自定擴展和操作引擎和Spider中間通信的功能組件（比如進入Spider的Responses;和從Spider出去的Requests）

爬蟲又分為普通爬蟲和通用爬蟲

普通爬蟲

1.新建項目
在開始爬取之前，必須創(chuàng)建一個新的Scrapy項目殴蹄。進入自定義的項目目錄中究抓，運行下列命令：
scrapy startproject myspider
注意：我們在寫項目時可以使用虛擬環(huán)境

2.新建爬蟲文件
scrapy genspider jobbole jobbole.com

3.明確目標url
https://www.baidu.com/

4.進入item.py文件創(chuàng)建自己需要爬取的字段名稱
標題
title = scrapy.Field()
創(chuàng)建時間
create_date = scrapy.Field()
文章地址
url = scrapy.Field()

5.制作爬蟲（spider/baidu.py）
-*- coding: utf-8 -*-
import scrapy
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['jobbole.com']
start_urls = ['http://blog.jobbole.com/all-posts/']

def parse(self, response):
pass

6.分析數(shù)據(jù)，存儲數(shù)據(jù)
在管道文件（pipeline.py）進行存儲袭灯。

通用爬蟲

通過下面的命令可以快速創(chuàng)建 CrawlSpider模板的代碼：
scrapy genspider -t crawl 爬蟲文件域名

它是Spider的派生類刺下，Spider類的設計原則是只爬取start_url列表中的網(wǎng)頁，而CrawlSpider類定義了一些規(guī)則Rule來提供跟進鏈接的方便的機制稽荧，從爬取的網(wǎng)頁結果中獲取鏈接并繼續(xù)爬取的工作．

CrawlSpider繼承于Spider類橘茉，除了繼承過來的屬性外（name、allow_domains），還提供了新的屬性和方法:

rules

CrawlSpider使用rules屬性來決定爬蟲的爬取規(guī)則畅卓，并將匹配后的url請求提交給引擎,完成后續(xù)的爬取工作擅腰。
在rules中包含一個或多個Rule對象，每個Rule對爬取網(wǎng)站的動作定義了某種特定操作翁潘，比如提取當前相應內(nèi)容里的特定鏈接趁冈，是否對提取的鏈接跟進爬取，對提交的請求設置回調(diào)函數(shù)等拜马。

class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)

使用通用爬蟲

第一步：根據(jù)要爬取的網(wǎng)頁確定需要保存的字段
class ZhilianItem(scrapy.Item):
define the fields for your item here like:

name = scrapy.Field()
job_title = scrapy.Field()

第二步：編寫爬蟲類
LinkExtractor實例對象
jobListRult = LinkExtractor(allow=r'sou.zhaopin.com/jobs')

第三步：數(shù)據(jù)保存
Pipelines.py
import json
class ZhilianPipeline(object):

def init(self):```
self.file = open('zhilian.json','a+')

def process_item(self, item, spider):
content = json.dumps(dict(item),ensure_ascii=False) + '\n'
self.file.write(content)
def closespider(self):
self.file.close()

第四步：settings相關設置
1.ROBOTSTXT_OBEY = False 設置是否遵守robot協(xié)議
2.DOWNLOAD_DELAY = 3 設置下載延時
3.設置全局的Header
DEFAULT_REQUEST_HEADERS = {
'User-Agent':' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:59.0) Gecko/20100101 Firefox/59.0',
}
4.激活pipelines數(shù)據(jù)處理管道
ITEM_PIPELINES = {
'zhilian.pipelines.ZhilianPipeline': 300,
}

第五步：運行程序
scrapy crawl zhilianCrawl