除了爬取文本实牡,我們可能還需要下載文件斋荞、視頻悴务、圖片、壓縮包等譬猫,這也是一些常見的需求讯檐。scrapy提供了FilesPipeline和ImagesPipeline,專門用于下載普通文件及圖片染服。兩者的使用方法也十分簡(jiǎn)單别洪,首先看下FilesPipeline的使用方式。
FilesPipeline
FilesPipeline的工作流如下:
- 在
spider
中爬取要下載的文件鏈接柳刮,將其放置于item中的file_urls
挖垛; -
spider
將其返回并傳送至pipeline
鏈; - 當(dāng)
FilesPipeline
處理時(shí)秉颗,它會(huì)檢測(cè)是否有file_urls
字段痢毒,如果有的話,會(huì)將url傳送給scarpy調(diào)度器和下載器蚕甥; - 下載完成之后哪替,會(huì)將結(jié)果寫入item的另一字段
files
,files
包含了文件現(xiàn)在的本地路徑(相對(duì)于配置FILE_STORE
的路徑)菇怀、文件校驗(yàn)和checksum
凭舶、文件的url
。
從上面的過程可以看出使用FilesPipeline
的幾個(gè)必須項(xiàng):
-
Item
要包含file_urls
和files
兩個(gè)字段爱沟; - 打開
FilesPipeline
配置帅霜; - 配置文件下載目錄
FILE_STORE
。
下面以下載https://twistedmatrix.com/documents/current/core/examples/頁面下的python代碼為例:
# items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ExamplesItem(scrapy.Item):
file_urls = scrapy.Field() # 指定文件下載的連接
files = scrapy.Field() #文件下載完成后會(huì)往里面寫相關(guān)的信息
#example.py
# -*- coding: utf-8 -*-
import scrapy
from ..items import ExamplesItem
class ExamplesSpider(scrapy.Spider):
name = 'examples'
allowed_domains = ['twistedmatrix.com']
start_urls = ['https://twistedmatrix.com/documents/current/core/examples/']
def parse(self, response):
urls = response.css('a.reference.download.internal::attr(href)').extract()
for url in urls:
yield ExamplesItem(file_urls = [response.urljoin(url)])
#setting.py
#...
FILES_STORE = '/root/TwistedExamples/file_store'
#...
運(yùn)行scrapy crawl examples
呼伸,然后在FILES_STORE/full
目錄下身冀,可以看到已經(jīng)下載了文件,此時(shí)用url的SHA1 hash
來作為文件的名稱,后面會(huì)講到如何自定義自己想要的名稱搂根。先來看看ImagesPipeline
蝶怔。
ImagesPipeline
IMagesPipeline的過程與FilePipeline差不多,參數(shù)名稱和配置不一樣兄墅;如下:
FilesPipelin | ImagesPipeline | |
---|---|---|
Package | scrapy.pipelines.files.FilesPipeline | scrapy.pipelines.images.ImagesPipeline |
Item | file_urls files |
image_urls images |
存儲(chǔ)路徑配置參數(shù) | FILES_STROE | IMAGES_STORE |
除此之外,ImagesPipeline還支持以下特別功能:
- 生成縮略圖澳叉,通過配置
IMAGES_THUMBS = {'size_name': (width_size,heigh_size),}
- 過濾過小圖片隙咸,通過配置
IMAGES_MIN_HEIGHT
和IMAGES_MIN_WIDTH
來過濾過小的圖片。
下面我們以爬取http://image.so.com/z?ch=beauty下美女的圖片為例成洗,看下ImagePipeline
是如何生效的五督。
通過抓取該網(wǎng)站地址的請(qǐng)求,可以發(fā)現(xiàn)圖片地址是通過接口http://image.so.com/zj?ch=beauty&sn=0&listtype=new&temp=1來獲取圖片地址的瓶殃,其中sn=0
表示圖片數(shù)據(jù)的偏移量充包,默認(rèn)每次返回30個(gè)圖片信息,其返回包是一個(gè)json字符串遥椿,如下:
"end": false,
"count": 30,
"lastid": 30,
"list": [{
"id": "b0cd2c3beced890b801b845a7d2de081",
"imageid": "f90d2737a6d14cbcb2f1f2d5192356dc",
"group_title": "清純美女戶外迷人寫真笑顏迷人",
"tag": "萌女",
"grpseq": 1,
"cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201608\/80\/0dexb2tjurx.jpg",
"cover_height": 960,
"cover_width": 640,
"total_count": 8,
"index": 1,
"qhimg_url": "http:\/\/p0.so.qhmsg.com\/t017d478b5ab2f639ff.jpg",
"qhimg_thumb_url": "http:\/\/p0.so.qhmsg.com\/sdr\/238__\/t017d478b5ab2f639ff.jpg",
"qhimg_width": 238,
"qhimg_height": 357,
"dsptime": ""
},
......省略
, {
"id": "37f6474ea039f34b5936eb70d77c057c",
"imageid": "3125c84c138f1d31096f620c29b94512",
"group_title": "美女蘿莉鐵路制服寫真清純動(dòng)人",
"tag": "萌女",
"grpseq": 1,
"cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201701\/798\/kuojthsyf1j.jpg",
"cover_height": 587,
"cover_width": 880,
"total_count": 8,
"index": 30,
"qhimg_url": "http:\/\/p2.so.qhimgs1.com\/t0108dc82794264fe32.jpg",
"qhimg_thumb_url": "http:\/\/p2.so.qhimgs1.com\/sdr\/238__\/t0108dc82794264fe32.jpg",
"qhimg_width": 238,
"qhimg_height": 159,
"dsptime": ""
}]
}
我們可以通過返回包的qhimg_url
獲取圖片的鏈接基矮,具體代碼如下:
#items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BeautyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
#beauty.py
# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import BeautyItem
class BeautypicSpider(scrapy.Spider):
name = 'beautypic'
allowed_domains = ['image.so.com']
url_pattern = 'http://image.so.com/zj?ch=beauty&sn={offset}&listtype=new&temp=1'
# start_urls = ['http://image.so.com/']
def start_requests(self):
step = 30
for page in range(0,3):
url = self.url_pattern.format(offset = page*step)
yield scrapy.Request(url, callback = self.parse)
def parse(self, response):
ret = json.loads(response.body)
for row in ret['list']:
yield BeautyItem(image_urls=[row['qhimg_url']], name = row['group_title'])
#settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY =False
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':5,
}
IMAGES_STORE = '/root/beauty/store_file'
下載下來的圖片文件如下:
修改文件默認(rèn)名
從FilePipeline和ImagePipeline中可以看到下載的文件名都比較怪異,不太直觀冠场,這些文件名使用的是url地址的sha1散列值家浇,主要用于防止重名的文件相互覆蓋,但有時(shí)我們想文件按照我們的期望來命名碴裙。比如對(duì)于下載文件钢悲,通過查看FilesPipeline的源碼,可以發(fā)現(xiàn)文件名主要由FilesPipeline.file_path
來決定的舔株,部分代碼如下:
class FilesPipeline(MediaPipeline):
...
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
'file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() method has been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
## end of deprecation warning block
media_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
media_ext = os.path.splitext(url)[1] # change to request.url after deprecation
return 'full/%s%s' % (media_guid, media_ext)
...
因此我們可以通過繼承FilesPipeline重寫file_path()
方法來重定義文件名莺琳,新的自定義SelfDefineFilePipline
如下:
#pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
class MatplotlibExamplesPipeline(object):
def process_item(self, item, spider):
return item
class SelfDefineFilePipline(FilesPipeline):
"""
繼承FilesPipeline,更改其存儲(chǔ)文件的方式
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def file_path(self, request, response=None, info=None):
parse_result = urlparse(request.url)
path = parse_result.path
basename = os.path.basename(path)
return basename
在配置文件settings.py
中打開SelfDefineFilePipline
并運(yùn)行爬蟲载慈,以下為下載結(jié)果惭等。
這里講的只是其中一種方法,主要是為了提供一種思路办铡,更改文件名的方法有很多咕缎,要看具體場(chǎng)景,比如下載圖片那一節(jié)料扰,url并沒有帶圖片的名稱凭豪,那么通過只更改file_path()方法來命名應(yīng)該不可能,因?yàn)閕tem['name']并沒有傳進(jìn)來晒杈,通過查找源碼嫂伞,發(fā)現(xiàn)在get_media_requests()方法中是通過Request來下載圖片的,這個(gè)方法里面也有帶item信息,可以將item['name']在Request的meta參數(shù)傳遞帖努,在file_path()方法就能獲取到外部傳進(jìn)來的名字撰豺。所以看源碼其實(shí)也是學(xué)習(xí)框架的一種方式。
總結(jié)
本篇講了如何使用scrapy自帶的FilesPipeline和ImagesPipeline來下載文件和圖片拼余,然后講了如何通過繼承并重寫上述類的方法來重定義文件名的命名方法污桦。下一篇主要學(xué)習(xí)下LineExtractor快速提前鏈接和Exporter導(dǎo)出結(jié)果到文件。