1網(wǎng)站整個(gè)圖片的意思是铆帽,網(wǎng)站有用的圖片胞枕,廣告推薦位,等等除外
萌新上路娩嚼,老司機(jī)請(qǐng)略過(guò)
第一步找出網(wǎng)站url分頁(yè)的規(guī)律
選擇自己要爬取的分類(lèi)(如果要所有的圖片可以不選蘑险,顯示的就是所有的照片,具體怎么操作請(qǐng)根據(jù)實(shí)際情況進(jìn)行改進(jìn))
url地址的顯示
看分頁(yè)的url規(guī)律
url地址的顯示
由此可知分頁(yè)的參數(shù)就是 page/頁(yè)數(shù)
第二步獲取總頁(yè)數(shù)和進(jìn)行url請(qǐng)求
1判斷頁(yè)數(shù)的幾種辦法岳悟,1最直接的從瀏覽器上眼看 2先數(shù)一頁(yè)完整的網(wǎng)頁(yè)一共有多少套圖片佃迄,假如有15套,如果有一頁(yè)少于15套那它就是最后一頁(yè)(不排除最后一頁(yè)也是15張)3和第一種方法差不多區(qū)別在于是用程序來(lái)查看總頁(yè)數(shù)的贵少,4不管多少頁(yè)寫(xiě)個(gè)http異常捕獲呵俏,如果get請(qǐng)求返回的是404那就是已經(jīng)爬完了 5頁(yè)面捕獲下一頁(yè),如果沒(méi)有下一頁(yè)就證明爬取完成(但是有些數(shù)據(jù)少的頁(yè)面就沒(méi)有下一頁(yè)這個(gè)標(biāo)簽春瞬,這就尷尬了)柴信,這里以第三種方法為例
由圖可知總頁(yè)數(shù)
用程序捕捉頁(yè)數(shù)
由上圖可知翻頁(yè)的布局在一個(gè)div里 正常情況下包括 上一頁(yè)1 2 3 4 5 ... 101下一頁(yè) 一共9個(gè)選擇項(xiàng)那么倒數(shù)第二個(gè)就是總頁(yè)數(shù)
通過(guò)xpath獲取標(biāo)簽的規(guī)則
這里點(diǎn)擊右鍵copy copy xpath
然后用到一個(gè)谷歌插件 xpath
把剛才復(fù)制的xpath 粘貼進(jìn)去
可以看到獲取的是總頁(yè)數(shù)101 但是我們認(rèn)為的是標(biāo)簽的倒數(shù)第二個(gè)才是總頁(yè)數(shù),所以我們獲取的是一個(gè)列表而不是一個(gè)確定的值宽气,因?yàn)榉?yè)的便簽的個(gè)數(shù)是會(huì)變的但是總頁(yè)數(shù)一直都是最后一個(gè)*(這里以我測(cè)試的網(wǎng)站為例随常,,一切以實(shí)際情況為準(zhǔn))
獲取翻頁(yè)的列表
調(diào)用查找總頁(yè)數(shù)的方法返回第二個(gè)值就是總頁(yè)數(shù)
并做個(gè)判斷如果頁(yè)數(shù)大于總頁(yè)數(shù)的時(shí)候跳出循環(huán)
頁(yè)數(shù)判斷完畢進(jìn)行圖片爬取
一共四步
1訪問(wèn)第一頁(yè)抓取一共多少頁(yè)萄涯,
第二步抓取頁(yè)面10組圖詳情頁(yè)的連接
第三 請(qǐng)求第一組圖片的詳情頁(yè)獲取多少?gòu)垐D片绪氛,
第四步請(qǐng)求每一頁(yè)的詳情頁(yè)并保存圖片,
)
其實(shí)可以整合成兩步涝影,我這樣寫(xiě)等于多請(qǐng)求了兩次枣察,懶的改了,有興趣的話可以自己改一下
看我嗶嗶了那么多燃逻,其實(shí)沒(méi)啥用 有用的才開(kāi)始 多線程
根據(jù)你的網(wǎng)速如果下一張圖片沒(méi)問(wèn)題序目,那么100張100萬(wàn)張呢?
整個(gè)http訪問(wèn)的過(guò)程最慢的就是請(qǐng)求圖片鏈接進(jìn)行保存伯襟,這是最慢的一步猿涨,因?yàn)閳D片的資源大(這是廢話)
假如一組套圖有70張,保存一張就要3秒姆怪,70張是多少秒叛赚,我不知道(小學(xué)畢業(yè)),稽揭,俺附,但是如果開(kāi)了多線程,保存一張要3秒保存100張也要3秒(原理就不解釋了溪掀,大家都懂事镣,上代碼了)
開(kāi)啟隊(duì)列
整合 圖片詳情頁(yè)的url添加到隊(duì)列里 并開(kāi)啟進(jìn)程(其實(shí)可以一個(gè)for循環(huán)完成,但是我試了幾次老是添加的多了 所以添加隊(duì)列和 開(kāi)啟進(jìn)程就分開(kāi)揪胃,你可以試試用一個(gè)循環(huán))
每個(gè)線程結(jié)束后刪除一個(gè)相應(yīng)的隊(duì)列
(由于mac 和win的路徑方式不同 我就沒(méi)有寫(xiě)如果沒(méi)有創(chuàng)建文件夾 就自動(dòng)創(chuàng)建蛮浑,所以運(yùn)行之前請(qǐng)?jiān)诖a的同級(jí)目錄創(chuàng)建一個(gè)imgs文件夾)
看看速度的對(duì)比
多線程的前提是對(duì)訪問(wèn)的頻率沒(méi)有限制唠叛,一般的小網(wǎng)站和見(jiàn)不得人的網(wǎng)站都沒(méi)有這樣限制只嚣,所以你懂得沮稚!
QQ截圖20190620190027.png
爬取相同的一組圖片
普通版
import requests
from lxml import etree
import random
import threading
from time import sleep
from queue import Queue
class ImgSpider() :
def __init__(self):
self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
self.deatil = 'http://www.jitaotu.com/xinggan/{}'
self.headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
'Host': 'www.jitaotu.com',
'If-None-Match': '"5b2b5dc3-2f7a7"',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
self.url_queue = Queue()
def pages(self):
#總頁(yè)數(shù)
response = requests.get(self.urls.format(2))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath('/html/body/section[1]/nav/div/a/text()')
return page[-2]
def html_list(self, page):
#頁(yè)面組圖的連接
print(self.urls.format(page))
response = requests.get(self.urls.format(page))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
return page
def detail_page(self, imgde):
#總圖片數(shù)
response = requests.get(self.deatil.format(imgde))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath("http://*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
return page[-1]
def detail_list(self, imgde, page):
#圖片詳情頁(yè)連接
#截取鏈接關(guān)鍵碼
urls = imgde[-10:-5]
print('開(kāi)始訪問(wèn)圖片頁(yè)面并抓取圖片地址保存')
for i in range(int(page)):
print(self.deatil.format(urls+'_'+str(i+1)+'.html'))
response = requests.get(self.deatil.format(urls+'_'+str(i+1)+'.html'))
strs = response.content.decode()
html = etree.HTML(strs)
imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
#保存圖片
self.save_img(imgs)
def save_img(self, imgs):
print(imgs[0]+'?tdsourcetag=s_pcqq_aiomsg')
response = requests.get(imgs[0], headers=self.headers)
strs = response.content
s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
with open("./imgs/" + str(a) + str(s) + ".jpg", "wb") as f:
f.write(strs)
print("保存圖片")
return
def run(self):
page = 1
# 獲取總頁(yè)數(shù)
pageall = self.pages()
print('總頁(yè)數(shù)'+str(pageall))
while True:
print('訪問(wèn)第' + str(page)+'頁(yè)')
#訪問(wèn)頁(yè)面,獲取10組圖片的詳情頁(yè)鏈接
html_list = self.html_list(page)
#訪問(wèn)圖片的詳情頁(yè)
s =1
for htmls in html_list:
print('訪問(wèn)第'+str(page)+'頁(yè)第'+str(s)+'組')
imgdetalpage = self.detail_page(htmls)
# 址遍歷詳情頁(yè)請(qǐng)求獲取圖片地
print('第' + str(page) + '頁(yè)第' + str(s) + '組有'+str(imgdetalpage)+'張圖片')
self.detail_list(htmls, imgdetalpage)
s += 1
page += 1
if page > pageall:
print('爬取完畢 退出循環(huán)')
return
if __name__ == '__main__':
Imgs = ImgSpider()
Imgs.run()
多線程
import requests
from lxml import etree
import random
import threading
from time import sleep
from queue import Queue
class ImgSpider() :
def __init__(self):
self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
self.deatil = 'http://www.jitaotu.com/xinggan/{}'
self.headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
'Host': 'www.jitaotu.com',
'If-None-Match': '"5b2b5dc3-2f7a7"',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
self.url_queue = Queue()
def pages(self):
response = requests.get(self.urls.format(2))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath('/html/body/section[1]/nav/div/a/text()')
return page[-2]
def html_list(self, page):
print(self.urls.format(page))
response = requests.get(self.urls.format(page))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
return page
def detail_page(self, imgde):
response = requests.get(self.deatil.format(imgde))
strs = response.content.decode()
html = etree.HTML(strs)
page = html.xpath("http://*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
return page[-1]
def detail_list(self, imgde, page):
#截取鏈接關(guān)鍵碼
urls = imgde[-10:-5]
print('開(kāi)始訪問(wèn)圖片頁(yè)面并抓取圖片地址保存')
for i in range(int(page)):
print(self.deatil.format(urls + '_' + str(i + 1) + '.html'))
urlss = self.deatil.format(urls + '_' + str(i + 1) + '.html')
self.url_queue.put(urlss)
for i in range(int(page)):
t_url = threading.Thread(target=self.More_list)
# t_url.setDaemon(True)
t_url.start()
self.url_queue.join()
print('主線程結(jié)束進(jìn)行下一個(gè)')
def More_list(self):
urls = self.url_queue.get()
response = requests.get(urls)
strs = response.content.decode()
html = etree.HTML(strs)
imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
# 保存圖片
self.save_img(imgs)
def save_img(self, imgs):
try:
print(imgs[0])
response = requests.get(imgs[0], headers=self.headers)
except:
print('超時(shí)跳過(guò)')
self.url_queue.task_done()
return
else:
strs = response.content
s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
with open("./imgsa/" + str(a) + str(s) + ".jpg", "wb") as f:
f.write(strs)
print("保存圖片")
self.url_queue.task_done()
return
def run(self):
page = 1
# 獲取總頁(yè)數(shù)
pageall = self.pages()
print('總頁(yè)數(shù)'+str(pageall))
while True:
print('訪問(wèn)第' + str(page)+'頁(yè)')
#訪問(wèn)頁(yè)面册舞,獲取10組圖片的詳情頁(yè)鏈接
html_list = self.html_list(page)
#訪問(wèn)圖片的詳情頁(yè)
s =1
for htmls in html_list:
print('訪問(wèn)第'+str(page)+'頁(yè)第'+str(s)+'組')
imgdetalpage = self.detail_page(htmls)
# 址遍歷詳情頁(yè)請(qǐng)求獲取圖片地
print('第' + str(page) + '頁(yè)第' + str(s) + '組有'+str(imgdetalpage)+'張圖片')
self.detail_list(htmls, imgdetalpage)
s += 1
page += 1
if page > pageall:
print('爬取完畢 退出循環(huán)')
return
if __name__ == '__main__':
Imgs = ImgSpider()
Imgs.run()
看不懂不理解的可以問(wèn)我蕴掏,我也是新手可以交流交流 qq1341485724