一涛碑、需求:
用python實現去內涵段子里面下載網頁當中的圖片到本地當中
二精堕、實現:
1、獲取要爬取的URL地址
2蒲障、設置headers
3歹篓、請求網頁內容,把html內容轉換成XML
4晌涕、解析地址內容滋捶,進行圖片下載
三、開始操作:以下圖為例子
1余黎、獲取要爬取的URL地址:
url="http://www.neihan8.com/gaoxiaomanhua/index_2.html"
2重窟、設置headers:
headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
3、請求網頁內容惧财,把html內容轉換成XML
request = urllib2.Request(url,headers=headers)
response = urllib2.urlopen(request).read()
xml = etree.HTML(response)#這個etree是需要在前面導入包的 : from lxml import etree?
4巡扇、解析地址內容扭仁,進行圖片下載,我們通過上面的圖片進行獲取到具體的xpath圖片地址.
linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')
ps:這個linklist里面存放的是所有這個xpath里面的內容厅翔,所以如果需要下載的話需要依次提取
for link in linklist:
? ? image_request = urllib2.Request(link)
? ? response = urllib2.urlopen(image_request).read()
? ? filename = link[10:0]
? ? with open(fileName,"wb") as f:
? ? ? ? ? ? f.write(response)
上面是分別解釋了一下流程乖坠,都是手寫的代碼,第一次寫文章比較粗糙大家見諒了刀闷。下面是整個代碼的內容
import urllib2
from lxmlimport etree
class Spider:
pass
? ? def __init__(self):
self.pageNum =2
? ? ? ? self.switch =True
? ? def loadImage(self):
url ="http://www.neihan8.com/gaoxiaomanhua/index_"+str(self.pageNum)+".html"
? ? ? ? headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
request = urllib2.Request(url,headers=headers)
response= urllib2.urlopen(request).read()
content = etree.HTML(response)
linklist = content.xpath('/html/body/div[@class="main wrap"]//div[@class="left"]/div[@class="pic-column-list mt10"]/div/a/img/@src')
for image_linkin linklist:
print "downLoading..."
? ? ? ? ? ? self.writeImage(image_link)
def writeImage(self,link_address):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"}
download_request? =urllib2.Request(link_address)
response = urllib2.urlopen(download_request).read()
fileName = link_address[-10:]
with open(fileName,"wb")as f:
f.write(response)
print "downLoad---FINISH"
if __name__ =="__main__":
spider = Spider()
spider.loadImage()