Scrapy 學習手記

Scrapy蕾羊，Python開發(fā)的一個快速侮叮、高層次的屏幕抓取和web抓取框架避矢，用于抓取web站點并從頁面中提取結(jié)構(gòu)化的數(shù)據(jù)。Scrapy用途廣泛，可以用于數(shù)據(jù)挖掘审胸、監(jiān)測和自動化測試亥宿。--網(wǎng)絡介紹

羅列推薦看的網(wǎng)址：

Scrapy Tutorial 官方doc，比較推薦
極簡中文入門介紹可以用以簡單了解
github上的自發(fā)的翻譯項目
XPath教程

按教程安裝完(這里推薦且只推薦使用conda方式安裝砂沛，而非其它烫扼，諸如pip)后，常用的命令：

scrapy startproject projectName
scrapy genspider spiderName "xxx.html"
scrapy crawl spiderName
scrapy shell "url"

整體而言碍庵，要做的就是：先設計自己要爬取的item的結(jié)構(gòu)體(items.py)映企，同時設計對應的spider.py，在爬蟲代碼中静浴，將爬到的內(nèi)容填入item的字段后堰氓，將item交給pipeline處理，pipeline將這些爬取到的數(shù)據(jù)存儲到文件苹享，或者是數(shù)據(jù)庫中双絮。
具體流程也一并給出如下：

爬蟲引擎獲得初始請求開始抓取。
爬蟲引擎開始請求調(diào)度程序得问，并準備對下一次的請求進行抓取掷邦。
爬蟲調(diào)度器返回下一個請求給爬蟲引擎。
引擎請求發(fā)送到下載器椭赋，通過下載中間件下載網(wǎng)絡數(shù)據(jù)。
一旦下載器完成頁面下載或杠，將下載結(jié)果返回給爬蟲引擎哪怔。
引擎將下載器的響應通過中間件返回給爬蟲進行處理。
爬蟲處理響應向抢，并通過中間件返回處理后的items认境，以及新的請求給引擎。
引擎發(fā)送處理后的 items 到項目管道挟鸠，然后把處理結(jié)果返回給調(diào)度器叉信，調(diào)度器計劃處理下一個請求抓取。
重復該過程（繼續(xù)步驟1）艘希，直到爬取完所有的 url 請求硼身。

建立爬蟲項目之后，給出爬蟲代碼模板覆享，第一個代碼是最基本的爬蟲佳遂，但是，這份代碼實在是太簡單了撒顿，只能爬一兩個頁面丑罪，如果是要爬多頁數(shù)據(jù)，頁里又有多條數(shù)據(jù)，可以下面的第二份代碼：

"""第一份代碼"""
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

"""第二份代碼"""
class MySpider(scrapy.Spider):
    name = 'myspider'
    def start_requests(self):
        urls = ['https://www.xxx.com']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.url_parse)

    def url_parse(self, response):
        """
        url_parser目的是獲取目錄頁面上一個個頁面的網(wǎng)址
        """
        urls = response.xpath('...')
        for url in urls:
            # 對每個頁面進行數(shù)據(jù)提取
            yield scrapy.Request(url, callback=self.data_parse)

        next_page = response.xpath('...').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.url_parse)


    def data_parse(self, response):
        """
        從最終的頁面獲取數(shù)據(jù)
        """
        data = response.xpath('...')
        l = ItemLoader(item=MyItem(), response=response)
        l.add_value('data', data)
        yield l.load_item()

其中name即當前爬蟲的名稱吩屹，也是今后啟動它的唯一標識跪另。爬蟲回自動執(zhí)行start_requests(self)函數(shù)，按代碼對url發(fā)起請求煤搜，這是比較自定義的形式免绿。如果為了偷懶，也可以不重寫start_requests()宅楞，而是用下面這份代碼來代替：

name = "quotes"
start_urls = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
]

這些就不贅述针姿。總的來說厌衙，start_urls負責發(fā)起請求距淫，請求之后的結(jié)果，會放入response函數(shù)中婶希，并運行parse(self, response)函數(shù)榕暇，我們的目的就是從response中，利用各種解析工具喻杈，比如css或者xpath或者正則表達式來提取內(nèi)容彤枢，提取之后，可以放入自定義的item筒饰，丟入pipeline來持久化（比如存入數(shù)據(jù)庫）缴啡，也可以直接parse完寫入文件。

先在items.py 定義自己想要的item:

import scrapy
class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

然后在spider中將item_load到pipeline中:

from scrapy.loader import ItemLoader
from myproject.items import Product

l = ItemLoader(item=Product(), response=response)
l.add_value('name', response.xpath('...'))
l.add_value('price', response.xpath('...'))
l.add_value('stock', response.xpath('...'))
l.add_value('last_updated', response.xpath('...'))
yield l.load_item()

那么就需要在pipelines.py中實現(xiàn)對這些items的處理的pipeline瓷们，對于不同的spider业栅，可以用它們的name來進行區(qū)分處理:

class MyPipeline(object):
    def open_spider(self, spider):
        if spider.name == 'spider1':
            self.file = open('file1.txt', 'a+')
       elif spider.name == 'spider2':
            self.file = open('file2.txt', 'a+')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()

最后在settings.py中注冊這個pipeline，讓其真正投入生產(chǎn)線：

ITEM_PIPELINES = {
   'myproject.pipelines.MyPipeline': 300,
}

這里規(guī)定的數(shù)字谬晕，決定了組件運行的順序碘裕，數(shù)字越小，越先執(zhí)行攒钳。

實踐相關問題：

可以參考這里
常見的有防止被目標網(wǎng)站禁止(ban)：

使用user agent池帮孔，輪流選擇之一來作為user agent。
禁止cookies(參考settings.py的COOKIES_ENABLED)不撑，有些站點會使用cookies來發(fā)現(xiàn)爬蟲的軌跡文兢。
設置下載延遲(2或更高)。參考DOWNLOAD_DELAY設置焕檬。
如果可行禽作，使用Google cache來爬取數(shù)據(jù)，而不是直接訪問站點揩页。
使用IP池旷偿。例如免費的 Tor項目或付費服務ProxyMesh烹俗。
使用高度分布式的下載器(downloader)來繞過禁止(ban)，您就只需要專注分析處理頁面萍程。這樣的例子有: Crawlera

其中幢妄，設置隨機請求頭的問題，這個問題非常建議參考博文：一行代碼搞定Scrapy的隨機UserAgent
簡而言之茫负，只需兩步：

pip安裝模塊 pip install scrapy-fake-useragent
在settings.py加入這個模塊蕉鸳，讓它進行UserAgent的自動設置。

DOWNLOADER_MIDDLEWARES = {
     # 關閉默認的UA方法忍法，并開啟自己的UA方法
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,  
}