Python網(wǎng)頁(yè)下載器
注:本代碼基于Mac下的PyCharm,Python2.7與Python3(mac自帶一個(gè),自己又安裝了一個(gè)),PyCharm下載地址(官網(wǎng)):https://www.jetbrains.com/pycharm/download/#section=mac
注冊(cè)碼需要到這個(gè)網(wǎng)站去獲取:http://idea.lanyus.com
如果mac下注冊(cè)碼不能使用:
在終端輸入:sudo vim /private/etc/hosts,然后會(huì)讓你輸入電腦密碼,直接輸入就好了(輸入時(shí)終端里沒(méi)有顯示),然后會(huì)進(jìn)到一個(gè)文件界面,然后點(diǎn)擊i(插入),然后在文件夾中添加:0.0.0.0 account.jetbrains.com 這一句,然后shift+:wq保存一下
想在PyCharm輸入中文:
Pycharm ----> File ----> Default setting ------> Editor -------> File Encodings;需要在文章開(kāi)頭增加 # coding:utf-8
設(shè)置使用Python使用版本
(我在此處使用的是2.7,在創(chuàng)建的時(shí)候選擇一下)
本文語(yǔ)法:
urllib2.urlopen(url[, data[, proxies]]) :創(chuàng)建一個(gè)表示遠(yuǎn)程url的類文件對(duì)象,然后像本地文件一樣操作這個(gè)類文件對(duì)象來(lái)獲取遠(yuǎn)程數(shù)據(jù)。
usr:數(shù)據(jù)路徑,一般都為網(wǎng)址;
data:參數(shù)(post提交數(shù)據(jù)時(shí)使用);
proxies:設(shè)置代理
urlopen 返回 一個(gè)類文件對(duì)象,它包含以下方法:
read() , readline() , readlines() , fileno() , close()
info():返回一個(gè)httplib.HTTPMessage 對(duì)象契沫,表示遠(yuǎn)程服務(wù)器返回的頭信息;
getcode():返回Http狀態(tài)碼讥珍。如果是http請(qǐng)求跟伏,200表示請(qǐng)求成功完成;404表示網(wǎng)址未找到;
geturl():返回請(qǐng)求的url
cookielib.CookieJar(): 用來(lái)保持cookies(eg:采集某個(gè)網(wǎng)站的登錄信息)
代碼示例:
# !user/bin/env python3
# coding:utf-8
import urllib2,cookielib
url = "https://www.baidu.com"
print('***One')
responsel = urllib2.urlopen(url)
print responsel.getcode()
print len(responsel.read())
print('***Two')
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())
print('***Three')
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()
打印為:
***One
200
227
***Two
200
227
***Three
200
<CookieJar[<Cookie BIDUPSID=D96D25C9BB820406A577452D0C0D288A for .baidu.com/>, <Cookie PSTM=1494991402 for .baidu.com/>, <Cookie __bsi=14053390655258788841_00_31_N_N_55_0301_002F_N_N_N_0 for .www.baidu.com/>, <Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
/代碼注釋版/
#coding=utf-8
# urllib2下載網(wǎng)頁(yè)
法一:
import urllib2
response = urllib2.urlopen('http://www.baidu.com')
print response.getcode()
print len(response.read())
法二: 添加data與head
三個(gè)參數(shù) url,data,header
import urllib2
request = urllib2.Request('http://www.baidu.com') #創(chuàng)建請(qǐng)求對(duì)象
request.add_data('a')
request.add_header('User-Agent','Mozilla/5.0')
response = urllib2.urlopen(request) #發(fā)送請(qǐng)求
print response.getcode()
print len(response.read())
法三: 添加特殊情景處理器
import urllib2, cookielib
cj = cookielib.CookieJar() #創(chuàng)建cookie容器
handle = urllib2.HTTPCookieProcessor(cj) #創(chuàng)建一個(gè)handel
opener = urllib2.build_opener(handle) #創(chuàng)建一個(gè)openner
urllib2.install_opener(opener) #給urllib2安裝openner
response = urllib2.urlopen("http://www.baidu.com") #使用帶有 cookie的urllib2訪問(wèn)網(wǎng)頁(yè)
print response.read()