目標(biāo):抓取圖片網(wǎng)站 http://hunter-its.com上的圖片
1.建立項(xiàng)目 beauty
scrapy startproject beauty
2.cd到目錄胜榔,并新建爬蟲,使用基礎(chǔ)模板
cd beauty
scrapy genspider hunter hunter-its.com
3.pycharm打開項(xiàng)目,先編寫item
打開item.py文件藐石,定義名字和地址
import scrapy
class BeautyItem(scrapy.Item):
name = scrapy.Field()
address = scrapy.Field()
4.編寫spider塔橡,爬蟲文件
導(dǎo)入之前定義的BeautyItem模塊尽纽,和Request模塊
from beauty.items import BeautyItem
from scrapy.http import Request
使用xpath獲取全部的圖片節(jié)點(diǎn)
pics = response.xpath('//div[@class="pic"]/ul/li')
循環(huán)獲取li節(jié)點(diǎn)中的所有圖片和地址
for pic in pics:
item = BeautyItem()
name = pic.xpath('./a/img/@alt').extract()[0]
address = pic.xpath('./a/img/@src').extract()[0]
item['name'] = name
item['address'] = address
yield item
遞歸調(diào)用函數(shù)它褪,爬取多頁數(shù)據(jù)
for i in range(2, 8):
url = 'http://hunter-its.com/m/'+str(i)+'.html'
print(url)
yield Request(url, callback=self.parse)
完整代碼
# -*- coding: utf-8 -*-
import scrapy
from beauty.items import BeautyItem
from scrapy.http import Request
class HunterSpider(scrapy.Spider):
name = 'hunter'
allowed_domains = ['hunter-its.com']
start_urls = ['http://hunter-its.com/m/1.html']
def parse(self, response):
#獲取全部的圖片節(jié)點(diǎn)
pics = response.xpath('//div[@class="pic"]/ul/li')
for pic in pics:
item = BeautyItem()
name = pic.xpath('./a/img/@alt').extract()[0]
address = pic.xpath('./a/img/@src').extract()[0]
item['name'] = name
item['address'] = address
yield item
for i in range(2, 8):
url = 'http://hunter-its.com/m/'+str(i)+'.html'
print(url)
yield Request(url, callback=self.parse)
5.編寫數(shù)據(jù)處理腳本pipelines.py,導(dǎo)入requests模塊
import requests
class BeautyPipeline(object):
def process_item(self, item, spider):
#模擬瀏覽器
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
#使用request模塊俭驮,發(fā)送get請(qǐng)求
r = requests.get(url=item['address'], headers=headers, timeout=4)
print(item['address'])
#下載圖片擂送,存儲(chǔ)在本地文件目錄下
with open(r'/Users/vincentwen/Downloads/hunter/'+ item['name'] + '.jpg', 'wb') as f:
f.write(r.content)
6.修改setting ITEM_PIPELINES
ITEM_PIPELINES = {
'beauty.pipelines.BeautyPipeline': 100,
}
7.運(yùn)行爬蟲
scrapy crawl hunter
覺得文章有用悦荒,請(qǐng)用支付寶掃描,領(lǐng)取一下紅包嘹吨!打賞一下