爬蟲框架:Scrapy 1.5
目標(biāo)站點:https://movie.douban.com/top250
目的抓取頁面電影封面攒射,并下載到指定目錄中鲜戒。
爬蟲項目目錄:
爬蟲:movie_spider.py
import scrapy
from douban.items import DoubanItem
class MovieSpider(scrapy.Spider):
name = 'movie'
start_urls = ['https://movie.douban.com/top250']
allowed_domains = ['douban.com']
def parse(self, response):
movies = response.xpath("http://ol[@class='grid_view']//li/div[@class='item']")
item = DoubanItem()
for movie in movies:
item['title'] = movie.xpath("./div/div/a/span[@class='title'][1]/text()").extract_first()
item['num'] = movie.xpath(".//div[@class='pic']/em/text()").extract_first()
item['stars'] = movie.xpath(".//span[@class='rating_num']/text()").extract_first()
item['src'] = movie.xpath("./div/a/img/@src").extract_first()
yield item
next_page = response.xpath("http://div[@class='paginator']/span[@class='next']/a/@href").extract_first()
if next_page is not None:
next_url = "https://movie.douban.com/top250" + next_page
yield scrapy.Request(next_url)
管道代碼:pipelines.py
import urllib
import os
from scrapy.exceptions import DropItem
class DoubanPipeline(object):
def __init__(self):
self.file = open('movies.json','w',encoding='utf-8')
def process_item(self, item, spider):
if item['src'] is not None:
conn = urllib.request.urlopen(item['src'])
with open("download/"+item['num']+item['title']+".jpg",'wb') as file:
file.write(conn.read())
file.close()
配置文件:settings.py
設(shè)置訪問頭和啟動管道文件
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
執(zhí)行結(jié)果: