爬蟲就是能夠獲取網(wǎng)頁(yè)內(nèi)容,或者網(wǎng)頁(yè)上資源的程序该押。因?yàn)槊總€(gè)頁(yè)面的結(jié)構(gòu)、邏輯可能都不一樣蚕礼,獲取網(wǎng)頁(yè)上資源的方式不都是一樣的梢什,所以爬蟲其實(shí)是具有針對(duì)性的朝聋,針對(duì)某個(gè)網(wǎng)站進(jìn)行編寫。
以下是一個(gè)簡(jiǎn)單爬蟲的源代碼荔睹,說了簡(jiǎn)單言蛇,不需要登錄,讀到源代碼 就可以下載資源猜极。
import requests
import re
def Spider(url):
head = 'http://www.xxx.com'
r = requests.get(url).content
pic_url = re.findall('class="mb10" src="(.*?)"', r, re.S)
i=0
for each in pic_url:
if '@' in each:
each = each[0:each.find('@')]
print each
pic = requests.get(each)
fp = open('pic\\'+str(i)+ '.jpg','wb')
fp.write(pic.content)
fp.close()
i += 1
nextPage = re.findall("<a href='(.*?)' btnmode='true' hideFocus class='pageNext'>", r)
if len(nextPage)<=0:
return
nextPage = nextPage[0]
print nextPage
if nextPage.strip('')!='':
nextPage = head+nextPage
else:
return
Spider(nextPage)
分析:
(1):requests模塊獲取url網(wǎng)頁(yè)源代碼
r = requests.get(url).content
(2):re模塊用正則匹配查找class
pic_url= re.findall('class="mb10" src="(.*?)"', r, re.S)
re模塊 通過正則 按照class 去查找跟伏,網(wǎng)頁(yè)規(guī)則不一樣,要具體編寫
此處得到 一個(gè)src的數(shù)組受扳,即資源url數(shù)組
(3): 將得到的src數(shù)組 進(jìn)行遍歷,根據(jù)網(wǎng)頁(yè)源代碼規(guī)則編寫勘高,讀取內(nèi)容 并寫入本地文件
for each in pic_url:
if '@' in each:
each = each[0:each.find('@')]
print each
pic = requests.get(each)
fp = open('pic\\'+str(i)+ '.jpg','wb')
fp.write(pic.content)
fp.close()
i += 1
(4): nextPage 進(jìn)行自動(dòng)翻頁(yè)