好久沒寫爬出了芹啥,這段時間都這折騰別的锻离,今天看了個視頻爬圖片,自己無聊也寫了個千圖網(wǎng)的爬蟲墓怀,結(jié)果寫了好久汽纠,真是生疏,等把web知識補完一定要多寫啊傀履。
好了虱朵,我們先看看網(wǎng)站,看看如何遍歷全站钓账,截圖是我選擇的入口
既然找到了遍歷的入口卧秘,接下來就簡單了。這里講下主要思路官扣,跟之前爬宜搜全站一樣,先爬主頁的所有子欄目的網(wǎng)址羞福,然后根據(jù)每個子欄目的頁數(shù)構(gòu)造出每一個頁面的網(wǎng)址惕蹄,之后就是遍歷全站了,這里貼一下scrapy里面主要spider是代碼
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from qiantu_spider.items import QiantuSpiderItem
class QiantuSpider(scrapy.Spider):
name = "qiantu"
allowed_domains = ["58pic.com"]
start_urls = ['http://58pic.com/']
def parse(self, response):
all_url = response.xpath('//div[@class="moren-content"]/a/@href').extract()
#print(all_url)
for i in range(0,int(len(all_url))):
single_url = all_url[i]
each_html = single_url + '0/day-1.html' # 將每個頁面構(gòu)造成第一頁的網(wǎng)址治专,方便提取每頁的最大頁數(shù)
yield Request(each_html,callback=self.list_page,meta={'front':single_url})#把每個子網(wǎng)站傳到下面的函數(shù)
def list_page(self,response):
front_url = response.meta['front']
try:
max_page = response.xpath('//*[@id="showpage"]/a[8]/text()').extract()[0]#提取最大頁數(shù)
print(max_page)
#print(front_url)
try:
for i in range(1,int(max_page)+1):
img_page = front_url+'0/day-'+str(i)+'.html'#構(gòu)造出每一個分類的所有url卖陵,接下來就是提取圖片地址了
#print(img_page)
yield Request(url=img_page,callback=self.get_img_link)
except:
print('該網(wǎng)頁沒有數(shù)據(jù)')
except Exception as e:
print('網(wǎng)頁沒有最大頁數(shù),作廢網(wǎng)頁')
def get_img_link(self,response):
item =QiantuSpiderItem()
img_link1 = response.xpath("http://a[@class='thumb-box']/img/@src").extract()
if img_link1:
#該網(wǎng)站圖片有點奇葩张峰,有些頁面的圖片存儲方式不一樣泪蔫,總體來說是這兩者,分開寫就好了
item['img_urls'] =img_link1
#print(1,img_link1)
yield item
else:
img_link2=response.xpath('//*[@id="main"]/div/div/div/div/div/a/img/@src').extract()
item['img_urls'] = img_link2
yield item
#print(2,img_link2)
下面是piplines代碼喘批,主要是把圖面下載到指定的文件夾撩荣,用了urlretrieve方法
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import urllib.request
import re
import os
class QiantuSpiderPipeline(object):
def process_item(self, item, spider):
for url in item['img_urls']:
try:
real_url = re.sub(r'!(.*)','',url)#把每個圖片地址铣揉!號后面的字符去掉,剩下的是高清圖地址
name = real_url[-24:].replace('/','')#去除不能表示文件名的符號餐曹,這里將我搞死了
#print(name)
file ='E://qiantu/'
urllib.request.urlretrieve(real_url,filename=file+name)
except Exception as e:
print(e,'該圖片沒有高清地址')
print('成功下載一頁圖片')
千圖網(wǎng)全站的爬取很簡單逛拱,不過記得要在settings里面把robot.txt協(xié)議改掉,最好也偽造一下useragent
如圖是短短幾分鐘的爬取效果
下次要挑個特別有難度的網(wǎng)站爬了才行台猴,github地址:https://github.com/xiaobeibei26