Scrapy-5.Items

本文地址：http://www.reibang.com/p/58781f28904f

在抓取數(shù)據(jù)的過程中杈绸，主要要做的事就是從雜亂的數(shù)據(jù)中提取出結(jié)構(gòu)化的數(shù)據(jù)百新。Scrapy的Spider可以把數(shù)據(jù)提取為一個(gè)Python中的字典企软，雖然字典使用起來非常方便，對我們來說也很熟悉饭望，但是字典有一個(gè)缺點(diǎn)：缺少固定結(jié)構(gòu)仗哨。在一個(gè)擁有許多爬蟲的大項(xiàng)目中，字典非常容易造成字段名稱上的語法錯(cuò)誤铅辞，或者是返回不一致的數(shù)據(jù)厌漂。

所以Scrapy中，定義了一個(gè)專門的通用數(shù)據(jù)結(jié)構(gòu)：Item斟珊。這個(gè)Item對象提供了跟字典相似的API桩卵，并且有一個(gè)非常方便的語法來聲明可用的字段。

聲明Item

Item的聲明在items.py這個(gè)文件中倍宾，聲明Item時(shí)，需要從scrapy.Item繼承：

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

在聲明這個(gè)Item時(shí)胜嗓，需要同時(shí)聲明Item中的字段高职，聲明的方式類似于設(shè)置類屬性。熟悉Djiango的話那么你會(huì)注意到這與Django Models非常相似辞州，但是沒有多種Field類型怔锌，更加簡單。

使用：

In [1]: product = Product(name='Desktop PC', price=1000)

In [2]: product
Out[2]: {'name': 'Desktop PC', 'price': 1000}

Field對象

可以看到变过，在聲明Item時(shí)埃元，聲明字段使用的是Field對象。這個(gè)Field對象其實(shí)完全繼承自Python的字典媚狰，并且沒有做任何改動(dòng)岛杀，所以在使用Field聲明字段時(shí)，可以傳入數(shù)據(jù)作為這個(gè)字段的元數(shù)據(jù)（metadata）崭孤，上方的serializer=str其實(shí)就是一個(gè)指定序列化函數(shù)的元數(shù)據(jù)类嗤。

字段的元數(shù)據(jù)與字段的值之間沒有必然的聯(lián)系糊肠。如果我們直接查看Item對象，那么獲取的是字段的值;

In [3]: product
Out[3]: {'name': 'Desktop PC', 'price': 1000}

如果使用.fields遗锣，獲取的就是字段的元數(shù)據(jù)了：

In [4]: product.fields
Out[4]: {'last_updated': {'serializer': str}, 'name': {}, 'price': {}, 'stock': {}}

使用方式

由于Item有跟字典類似的API货裹，所以很多時(shí)候可以像字典一樣使用：

# 可以像字典一樣用字段名取值
In [5]: product['name']
Out[5]: 'Desktop PC'

# 可以使用get方法
In [6]: product.get('name')
Out[6]: 'Desktop PC'

# 可以在獲取字段沒有值時(shí)，設(shè)置默認(rèn)返回的值
In [7]: product.get('last_updated', 'not set')
Out[7]: 'not set'

# 可以像字典一樣對存在的字段賦值
In [8]: product['last_updated'] = 'today'
In [9]: product['last_updated']
Out[9]: today

但是有一點(diǎn)區(qū)別的是精偿，如果對Item沒有聲明的字段操作弧圆，會(huì)拋出異常：

# 獲取沒有聲明的字段
In [10]: product['lala']
Traceback (most recent call last):
    ...
KeyError: 'lala'

# 可以對未聲明字段使用get方法，設(shè)置默認(rèn)返回的值
In [11]: product.get('lala', 'unknown field')
Out[11]:'unknown field'

# 給沒有聲明的字段賦值
In [12]: product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

還可以直接從Dict直接創(chuàng)建Item：

In [13]: Product({'name': 'Laptop PC', 'price': 1500})
Out[13]: Product(price=1500, name='Laptop PC')

In [14]: Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

如果需要擴(kuò)展或者修改某一個(gè)Item類的話笔咽，可以使用繼承的方式：

class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

Item Pipeline

在Spider中返回一個(gè)Item后搔预，這個(gè)Item將會(huì)被發(fā)送給Item Pipeline，其主要有以下幾種作用：

清洗數(shù)據(jù)
驗(yàn)證抓取下來的數(shù)據(jù)（檢查是否含有某些字段）
檢查去重
存儲(chǔ)數(shù)據(jù)到數(shù)據(jù)庫

每個(gè)Item Pipeline都是一個(gè)Python類拓轻，實(shí)現(xiàn)了幾個(gè)簡單的方法斯撮。

啟用Item Pipeline

啟用方式與Middleware基本相同，優(yōu)先級的值越小扶叉，越先被調(diào)用勿锅。

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

自定義Item Pipeline

自定義Item Pipeline必須實(shí)現(xiàn)以下這個(gè)方法：

process_item(self, item, spider)

每一個(gè)Item Pipeline都會(huì)調(diào)用這個(gè)方法，用來處理Item枣氧。

參數(shù)：
- item(Item對象或者Dict) - 抓取的Item溢十。
- spider(Spider對象) - 抓取這個(gè)Item的Spider。
這個(gè)方法需要返回以下兩種返回值的一種：
- Item或者Dict
  
  Scrapy將會(huì)繼續(xù)調(diào)用接下來的Item Pipeline組件處理下去达吞。
- 拋出一個(gè)DropItem異常
  
  將會(huì)不再繼續(xù)調(diào)用接下來的Item Pipeline张弛。

還可以實(shí)現(xiàn)以下幾種方法之一來實(shí)現(xiàn)某些功能：

open_spider(self, spider)

這個(gè)方法將會(huì)在Spider打開時(shí)調(diào)用。
close_spider(self, spider)

這個(gè)方法將會(huì)在Spider關(guān)閉時(shí)調(diào)用酪劫。
from_crawler(cls, crawler)

如果存在這個(gè)方法吞鸭，那么就會(huì)調(diào)用這個(gè)類方法來從Crawler創(chuàng)建一個(gè)Pipeline實(shí)例。這個(gè)方法必須返回一個(gè)Pipeline的新實(shí)例覆糟。

其中的Crwaler對象提供了可以訪問到Scrapy核心部件的路徑刻剥，比如settings和signals。Pipeline實(shí)例可以通過這種方式來連接他們滩字。

Item Pipeline例子

1.驗(yàn)證數(shù)據(jù)

如果Item中包含price_excludes_vat屬性造虏，就調(diào)整數(shù)據(jù)中的price屬性，并拋棄那些不包含price屬性的Item麦箍。

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

2.寫入Item到JSON文件

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

3.寫入Item到MongoDB

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

4.去重

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

系列文章：