爬取360攝影美圖
參考來源:《Python3網(wǎng)絡(luò)爬蟲開發(fā)實(shí)戰(zhàn)》 第497頁 作者:崔慶才
目的:使用Scrapy爬取360攝影美圖全闷,保存至MONGODB數(shù)據(jù)庫并將圖片下載至本地
目標(biāo)網(wǎng)址:http://image.so.com/z?ch=photography
分析/知識點(diǎn):
爬取難度:
a. 入門級叉寂,靜態(tài)網(wǎng)頁中不含圖片信息,通過AJAX動態(tài)獲取圖片并渲染总珠,返回結(jié)果為JSON格式屏鳍;圖片下載處理:使用內(nèi)置的ImagesPipeline,進(jìn)行少量方法改寫局服;
MONGODB存儲钓瞭;
實(shí)際步驟:
- 創(chuàng)建Scrapy項(xiàng)目/images(spider)
Terminal: > scrapy startproject images360
Terminal: > scrapy genspider images image.so.com
- 配置settings.py文件
# MONGODB配置
MONGO_URI = 'localhost'
MONGO_DB = 'images360'
# 下載圖片默認(rèn)保存目錄(ImagePipelin要用到)
IMAGES_STORE = './images'
# 嘿嘿嘿...
ROBOTSTXT_OBEY = False
# headers
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# 啟用Pipeline(ImagePipeline優(yōu)先級要最高)
ITEM_PIPELINES = {
'images360.pipelines.ImagePipeline': 300,
'images360.pipelines.MongoPipeline': 301,
}
- 編寫items.py文件
from scrapy import Item, Field
# 圖片信息全部獲取
class MovieItem(Item):
cover_height = Field()
cover_imgurl = Field()
cover_width = Field()
dsptime = Field()
group_title = Field()
grpseq = Field()
id = Field()
imageid = Field()
index = Field()
label = Field()
qhimg_height = Field()
qhimg_thumb_url = Field()
qhimg_url = Field()
qhimg_width = Field()
tag = Field()
total_count = Field()
-
編寫pipelines.py文件
a) ImagePipeline: 根據(jù)Scrapy官方文檔修改:
Downloading and processing files and images:
# 圖片下載Pipeline
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
'''
重寫file_path方法,獲取圖片名
'''
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
'''
將下載失敗的圖片剔除淫奔,不保存至數(shù)據(jù)庫
'''
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
'''
重新請求圖片url山涡,調(diào)度器重新安排下載
'''
yield Request(url=item['qhimg_url'])
b) MongoPipeline: 根據(jù)Scrapy官方文檔修改:https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo 代碼略
5. 編寫spiders > images.py文件
注意:
a) 重寫start_requests(self);
b) 動態(tài)獲取請求url唆迁;動態(tài)Field賦值并生成對應(yīng)的ImageItem
# 每張圖片動態(tài)賦值并生產(chǎn)ImageItem
for image in images:
item = ImageItem()
for field in item.fields:
if field in image.keys():
item[field] = image.get(field)
yield item
c) 完整代碼如下:
import json
from scrapy import Spider, Request
from images360.items import ImageItem
class ImagesSpider(Spider):
name = 'images'
# allowed_domains = ['image.so.com']
# start_urls = ['http://image.so.com/z?ch=photography']
url = 'http://image.so.com/zj?ch=photography&sn={sn}&listtype=new&temp=1'
# 重寫
def start_requests(self):
# 循環(huán)生產(chǎn)請求前1200張照片(sn = [1-41])
for sn in range(1, 41):
yield Request(url=self.url.format(sn=sn * 30), callback=self.parse)
def parse(self, response):
results = json.loads(response.text)
# 判斷l(xiāng)ist是否在results的keys中
if 'list' in results.keys():
images = results.get('list')
# 每張圖片動態(tài)賦值并生產(chǎn)ImageItem
for image in images:
item = ImageItem()
for field in item.fields:
if field in image.keys():
item[field] = image.get(field)
yield item
6. 運(yùn)行結(jié)果
小結(jié)
- 入門級項(xiàng)目鸭丛,進(jìn)一步熟悉Scrapy的使用流程媒惕;
- 熟悉網(wǎng)頁AJAX返回結(jié)果的獲取和解析系吩;
- 初步了解ImagesPipeline的使用方法,以及學(xué)會如何根據(jù)需要進(jìn)行改寫妒蔚。