作為一名爬蟲愛好者,雖然requests庫已經(jīng)足夠我們做一些簡單的小爬蟲,selenium能幫助我們模仿瀏覽器行為,但學(xué)會使用框架能幫助我們更加便捷高效的完成爬取任務(wù)辑鲤。
案例分析:爬取寶馬五系汽車圖片
1.新建一個爬蟲項目
scrapy startproject bw
cd bw
scrapy genspider bw5 "XXXXXXX域名"
2.在item.py中創(chuàng)建三個容器
import scrapy
class BwItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() #裝子分類名字
image_urls = scrapy.Field() #裝圖片鏈接
images = scrapy.Field()
3.在爬蟲文件bw5.py中導(dǎo)入BwItem并解析網(wǎng)頁
import scrapy
from bw.items import BwItem
class Bw5Spider(scrapy.Spider):
name = 'bw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
def parse(self, response):
uiboxs = response.xpath("http://div[@class='uibox']")[1:]
for uibox in uiboxs:
title = uibox.xpath(".//div[@class='uibox-title']/a/text()").get()
urls = uibox.xpath(".//div[@class='uibox-con carpic-list03']/ul/li/a/img/@src").getall()
# for x in urls:
# url = response.urljoin(x)
# print(url)
#獲取所有圖片url
urls = list(map(lambda url:response.urljoin(url),urls))
item = BwItem(title=title,image_urls=urls)
yield item
4.scrapy框架中幫我們寫好了異步下載程序,只需按步驟打開下載器開關(guān)配置相應(yīng)路徑即可
當(dāng)使用File Pipeline下載文件的時候杠茬,按照以下步驟來完成:
1.定義好一個Item月褥,然后在這個item中定義兩個屬性,分別為file_urls以及files 瓢喉。file_urls是用來存儲需要下載的文件的url鏈接吓坚,需要一個列表
⒉.當(dāng)文件下載完成后,會把文件下載的相關(guān)信息存儲到item的files屬性中灯荧。比如下載路徑礁击、下載的url和文件的校驗(yàn)碼等。
3.在配置文件 settings.py中配置FILEs_STORE,這個配置是用來設(shè)置文件下載下來的路徑哆窿。
4.啟動pipeline在ITEN_PIPELNES中設(shè)置scrapy.pipelines.files.FilesPipeline:1.
下載圖片的Images Pipeline:
1.定義好一個Item链烈,然后在這個item中定義兩個屬性,分別為image_urls以及images 挚躯。image_urls是用來存儲需要下載的文件的url鏈接强衡,需要一個列表
⒉.當(dāng)文件下載完成后,會把文件下載的相關(guān)信息存儲到item的images屬性中码荔。比如下載路徑漩勤、下載的url和文件的校驗(yàn)碼等。
3.在配置文件 settings.py中配置IMAGES_STORE缩搅,這個配置是用來設(shè)置文件下載下來的路徑越败。
4.啟動pipeline在ITEN_PIPELNES中設(shè)置scrapy.pipelines.images.ImagesPipeline:1
若有下載要求,需要在pipelines.py中重寫相關(guān)方法
import os
from urllib import request
from scrapy.pipelines.images import ImagesPipeline
from bw import settings
class BwImagesPipeline(ImagesPipeline):#繼承父類ImagesPipeline
def get_media_requests(self, item, info):#重寫父類方法
request_objs = super(BwImagesPipeline,self).get_media_requests(item,info)
for request_obj in request_objs:
request_obj.item = item
return request_objs
def file_path(self, request, response=None, info=None):
path = super(BwImagesPipeline,self).file_path(request,response,info)
title = request.item.get("title")
images_store = settings.IMAGES_STORE
title_path = os.path.join(images_store,title)
if not os.path.exists(title_path):
os.mkdir(title_path)
image_name = path.replace("full/","")
image_path = os.path.join(title_path,image_name)
return image_path
5.在setting.py中開啟相關(guān)設(shè)置
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #關(guān)閉機(jī)器人協(xié)議
#設(shè)置默認(rèn)請求頭
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
}
#開啟item_pipelines
ITEM_PIPELINES = {
# 'bw.pipelines.BwPipeline': 300,
# 'scrapy.pipelines.images.ImagesPipeline':1,
'bw.pipelines.BwImagesPipeline':1,
}
#圖片下載路徑硼瓣,供image pipelines使用
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
6.在工程目錄下創(chuàng)建start.py文件以啟動爬蟲
from scrapy import cmdline
cmdline.execute("scrapy crawl bw5".split())