先貼一下原來的代碼,是按照書上直接抄下來的
from urllib.robotparser import RobotFileParser
from urllib.request import urlopen
rp = RobotFileParser()
rp.parse(urlopen('http://www.reibang.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'http://www.reibang.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.reibang.com/search?q=python&page=1&type=collections'))
然后就Pycharm就報出了如下錯誤:
Traceback (most recent call last):
File "E:/PythonProject/PaChong/first.py", line 15, in <module>
rp.parse((urlopen('http://www.reibang.com/robots.txt').read().decode('utf-8').split('\n')))
File "E:\Python\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "E:\Python\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "E:\Python\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "E:\Python\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "E:\Python\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "E:\Python\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "E:\Python\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "E:\Python\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "E:\Python\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "E:\Python\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "E:\Python\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
直接看最后一行欲虚,是HTTP Error 403:Frobidden
經(jīng)過搜索,出現(xiàn)這個原因是因為用urllib.request.urlopen方式打開一個URL的話廷臼,服務(wù)器只會收到一個單純的對于該頁面訪問的請求苍在,但是服務(wù)器并不知道發(fā)送這個請求使用的瀏覽器,操作系統(tǒng)等信息荠商,而缺失這些信息的訪問往往都是非正常訪問寂恬,會被一些網(wǎng)站禁止掉。
那么該如何解決這個問題呢莱没?只需要在請求中加入UserAgent信息就行了
如下
from urllib.robotparser import RobotFileParser
from urllib import request
rp = RobotFileParser()
headers = {
'User-Agent': 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)'
}
url = 'http://www.reibang.com/robots.txt'
req = request.Request(url=url, headers=headers)
response = request.urlopen(req)
rp.parse(response.read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'http://www.reibang.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.reibang.com/search?q=python&page=1&type=collections'))
這樣問題就完美解決了