python爬蟲120源碼DIY--001--抓取桌面壁紙
本次抓取所用目標網(wǎng)址:http://www.netbian.com/fengjing/,內含N多高清壁紙圖,初始只是預覽圖,真正的高清圖在后面,需要點開兩次鏈接后獲得,所以這里用爬蟲組合出高清圖地址并抓取之.
參考博文:
原文用的是正則表達式和re模塊,且下載的只是預覽圖,我在此基礎上重新改寫可獲得高清圖,本文僅供參考,提供一個思路
工具原料:
python3
requests庫
parsel庫
思路解析:
分析頁面,如圖所示,每個頁面的預覽圖都在list這個列表集合下.每張圖對應一個縮寫的地址:"<a href="/desk/23791.htm...."
[圖片上傳失敗...(image-dd9751-1633356959952)]-
使用parsel庫解析出最初的地址,其中夾雜了不需要的網(wǎng)址
url = 'http://www.netbian.com/fengjing/' response = requests.get(url) sel = parsel.Selector(response.content.decode('gbk')) lists = sel.css('.list li a::attr(href)').extract() # 獲得初始地址
[圖片上傳失敗...(image-7d6ddb-1633356959952)]
-
過濾不需要的地址
查看鏈接獲得高清圖像的地址為:http://www.netbian.com/desk/23791-1920x1080.htm,這里使用startswith()方法挑選出適合的地址并存入新列表,這里直接用一個函數(shù)清洗地址并重新組裝
最終獲得如下地址列表:# 清洗組裝地址 def clearurl(lists): nurls = [] for i in lists: if i.startswith('/desk/'): i = wurl + i[:-4] + '-1920x1080.htm' nurls.append(i) return nurls
-
解析高清圖片所在網(wǎng)頁,獲得最終圖片地址并下載
從網(wǎng)頁源碼中獲得圖片地址,如下圖:
[圖片上傳失敗...(image-b007a1-1633356959952)]
gqurls = clearurl(lists)
response = requests.get(gqurls[0]) #此處僅作測試,抽取第一個地址
sel = parsel.Selector(response.content.decode('gbk'))
gpic = sel.css('td a::attr(href)').extract_first()
image = requests.get(gpic).content
with open('../eg001/'+'1.jpg', 'wb') as f:
f.write(image)
至此,第一個圖片從解析到下載基本完成.
-
解析分頁,獲取更多圖片下載地址
分析網(wǎng)頁地址變化規(guī)律得出地址為:
http://www.netbian.com/fengjing/index.htm
http://www.netbian.com/fengjing/index_2.htm
http://www.netbian.com/fengjing/index_3.htm
.....
http://www.netbian.com/fengjing/index_205.htm
除了第一頁地址不帶序號,之后都是有規(guī)律的地址,將url地址重整為list形式
def urls(): url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2, 206)] url_list.insert(0, 'http://www.netbian.com/fengjing/') return url_list
-
封裝一下sel對象
因為要重復調用sel對象解析網(wǎng)頁,所以將sel做出函數(shù)形式
# 獲得sel對象 def t_sel(url): response = requests.get(url) sel = parsel.Selector(response.content.decode('gbk')) return sel
完整源碼:
#!/usr/bin/env python
# coding=utf-8
'''
001號
壁紙爬取
http://www.netbian.com/fengjing/
'''
import requests
import parsel
url = 'http://www.netbian.com/fengjing/'
wurl = 'http://www.netbian.com'
# 獲得分頁url地址列表
def urls():
url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2, 3)]
url_list.insert(0, url)
return url_list
# 組裝sel對象
def t_sel(url):
response = requests.get(url)
sel = parsel.Selector(response.content.decode('gbk'))
return sel
# 清洗組裝獲得高清地址
def clearurl(lists):
nurls = []
for i in lists:
# print(i)
if i.startswith('/desk/'):
i = wurl + i[:-4] + '-1920x1080.htm'
nurls.append(i)
return nurls
def savepic(gqurls):
for g_url in gqurls:
sel = t_sel(g_url)
gpic = sel.css('td a::attr(href)').extract_first()
image = requests.get(gpic).content
with open('../eg001/' + str(g_url[28:-4]) + '.jpg', 'wb') as f:
f.write(image)
if __name__ == '__main__':
ulist = urls()
for url in ulist:
sel = t_sel(url)
lists = sel.css('.list li a::attr(href)').extract() # 獲得初始地址
gqurls = clearurl(lists)
savepic(gqurls)