爬蟲學(xué)習(xí)筆記（一）--urllib總結(jié)

基礎(chǔ)知識(shí)：

1.url（Uniform Resource Locator）:叫做統(tǒng)一資源定位符，是互聯(lián)網(wǎng)上標(biāo)準(zhǔn)資源的地址桶癣，俗稱“網(wǎng)址”拥褂。

2.在python 3.x中已經(jīng)沒有了urllib2庫，只有urllib一個(gè)庫了牙寞。

3.url Encoding也叫做percent—encode饺鹃，即URL編碼也叫做百分號(hào)編碼莫秆。

4.python2.7中的urllib2就是python3中的urllib.request

robotparser變?yōu)榱藆rllib庫中的一個(gè)模塊

根據(jù)官方手冊(cè)，urllib是處理url的一個(gè)庫：

其中有四個(gè)模塊：

1.urllib.request用來打開和讀取urls

? ? ?1.1.urlopen函數(shù)是常用的打開url方式悔详。

? ? ?1.2.用built_opener函數(shù)構(gòu)建opener來打開網(wǎng)頁時(shí)高級(jí)方式镊屎。

2.urllib.error包含了運(yùn)行urllib.request的過程中發(fā)生的錯(cuò)誤

3.urllib.parse用來分析網(wǎng)址（urls）

4.urllib.robotparser用來分析robots.txt文件

一、urllib.request中常用的函數(shù)

urllib.request.urlopen(url, data=None, [timeout,], cafile=None, capath=None, cadefault=False, context=None)

1.urllib.request 模塊用HTTP/1.1協(xié)議以及包括Connection：close的頭部在它的http請(qǐng)求中茄螃。

2.可供選擇的timeout參數(shù)指明阻止連接時(shí)間缝驳，請(qǐng)求連接的操作timeout秒后還沒連接上，就會(huì)拋出連接超時(shí)的異常归苍。若沒有設(shè)置則為全局變量中缺省的超時(shí)時(shí)間用狱。

3.對(duì)于HTTP and HTTPS URLs，這個(gè)函數(shù)返回的是一個(gè)http.client.HTTPResponse對(duì)象（進(jìn)行了輕微的修飾）拼弃，該對(duì)象有如下方法：

- ? 該對(duì)象是類文件對(duì)象夏伊，類文件的方法都可以使用，（read吻氧，readline署海，fileno，close）

- ? geturl（）：返回請(qǐng)求的url

- ? getcode（）：返回響應(yīng)的http狀態(tài)碼医男，200表示請(qǐng)求成功得到響應(yīng)砸狞，404表示請(qǐng)求沒響應(yīng)

- ? info():返回httplib.HTTPMessage對(duì)象，表示遠(yuǎn)程服務(wù)器返回的頭部信息

二镀梭、urllib.parse中常用函數(shù)：

1.urllib.parse.urlparse(url,scheme='',allow_fragments=True)：

-用來分析一個(gè)URL刀森，并分解為6個(gè)組成部分

-返回一個(gè)6個(gè)元素的元組：（scheme，netloc报账，path研底，params，query透罢，fragment）是一個(gè)urllib.parse.ParseResult對(duì)象

并且該對(duì)象有這6個(gè)元素對(duì)應(yīng)的方法

eg：

>>>from urllib import parse

>>>url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'

>>>parseResult= parse.urlparse(url)

>>>parseResult#把地址解析成組件

ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')

>>>parseResult.query

'q=parse&check_keywords=yes&area=default'

看結(jié)果就知道是什么意思了

2.urllib.parse.urlunparse(Tuple)

-是urlparse的逆過程

-輸入是6個(gè)元素的元組榜晦，輸出是完整的url地址

3.urllib.parse.urljoin

urljoin(base, url, allow_fragments=True)

? ? ? ? Join a base URL and a possibly relative URL to form an absolute

? ? ? ? interpretation of the latter.

-base是url的基地址

-base與第二個(gè)參數(shù)中的相對(duì)地址相結(jié)合組成一個(gè)絕對(duì)URL地址

eg:

>>>scheme='http'

>>>netloc='www.python.org'

>>>path='lib/module-urlparse.html'

>>>modlist=('urllib','urllib2','httplib')

>>> unparsed_url=parse.urlunparse((scheme,netloc,path,'','',''))

>>> unparsed_url

'http://www.python.org/lib/module-urlparse.html'

>>> for mod in modlist:

url=parse.urljoin(unparsed_url,'module-%s.html'%mod)

print(url)

#替換是從最后一個(gè)"/"處替換的

http://www.python.org/lib/module-urllib.html

http://www.python.org/lib/module-urllib2.html

http://www.python.org/lib/module-httplib.html

>>>?

4.urllib.parse.parse_qs(qs,keep_blank_values=False,strict_parsing=False,encoding='urf-8',error='replace'):

-用來分析字符串形式的query請(qǐng)求。（Parse a query given as a string argument）

qs參數(shù)：url編碼的字符串query請(qǐng)求（get請(qǐng)求）羽圃。

-返回query請(qǐng)求的參數(shù)字典

eg：

接上乾胶，

>>> param_dict=parse.parse_qs(parseResult.query)

>>> param_dict

>>> {'area': ['default'], 'check_keywords': ['yes'], 'q': ['parse']}

5.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

#對(duì)query合并，并且進(jìn)行url編碼

>>> from urllib import parse

>>> query={'name':'walker','age':99}

>>> parse.urlencode(query)

'name=walker&age=99'

總結(jié)：

1.2.是對(duì)url整體的處理朽寞，包括分解和組合识窿。

4.5是對(duì)url中的query這個(gè)參數(shù)的處理。

5.urllib.parse.quote(string, safe='/', encoding=None, errors=None)

#對(duì)字符串進(jìn)行url編碼

1.url字符串中如果帶有中文的編碼脑融，要使用url時(shí)喻频。先將中文部分編碼由gbk譯為utf8

然后在urllib.quote(str)才可以使用url正常訪問打開，否則編碼會(huì)出問題肘迎。

2.同樣如果從url中取出相應(yīng)中文字段解碼時(shí)甥温，需要先unquote锻煌，然后在decode，具體按照gbk或者utf8姻蚓，視情況而定炼幔。

eg：

>>>from urllib import parse

>>>parse.quote('a&b/c')#未編碼斜線

'a%26b/c'

>>>parse.quote_plus('a&b/c')#編碼了斜線

6.unquote(string, encoding='utf-8', errors='replace')

>>>parse.unquote('1+2')

'1+2'

>>> parse.unquote_plus('1+2')

'1 2'

三、urllib.robotparser

用來分析robots.txt文件,看是否支持該爬蟲

eg：

>>>from urlli import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http://example.webscraping.com/robots.txt')#讀入robots.txt文件

>>>rp.read()

>>>url='http://example.webscraping.com'

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True

詳細(xì)說明史简，見下面函數(shù)文檔：

FUNCTIONS

? ? parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

? ? ? ? Parse a query given as a string argument.

? ? ? Arguments:

? ? ? ? qs: percent-encoded query string to be parsed

? ? ? ? keep_blank_values: flag indicating whether blank values in

? ? ? ? ? ? percent-encoded queries should be treated as blank strings.

? ? ? ? ? ? A true value indicates that blanks should be retained as

? ? ? ? ? ? blank strings. ?The default false value indicates that

? ? ? ? ? ? blank values are to be ignored and treated as if they were

? ? ? ? ? ? not included.

? ? ? ? strict_parsing: flag indicating what to do with parsing errors.

? ? ? ? ? ? If false (the default), errors are silently ignored.

? ? ? ? ? ? If true, errors raise a ValueError exception.

? ? ? ? encoding and errors: specify how to decode percent-encoded sequences

? ? ? ? ? ? into Unicode characters, as accepted by the bytes.decode() method.

? ? parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

? ? ? ? Parse a query given as a string argument.

? ? ? ? Arguments:

? ? ? ? qs: percent-encoded query string to be parsed

? ? ? ? keep_blank_values: flag indicating whether blank values in

? ? ? ? ? ? percent-encoded queries should be treated as blank strings. ?A

? ? ? ? ? ? true value indicates that blanks should be retained as blank

? ? ? ? ? ? strings. ?The default false value indicates that blank values

? ? ? ? ? ? are to be ignored and treated as if they were ?not included.

? ? ? ? strict_parsing: flag indicating what to do with parsing errors. If

? ? ? ? ? ? false (the default), errors are silently ignored. If true,

? ? ? ? ? ? errors raise a ValueError exception.

? ? ? ? encoding and errors: specify how to decode percent-encoded sequences

? ? ? ? ? ? into Unicode characters, as accepted by the bytes.decode() method.

? ? ? ? Returns a list, as G-d intended.

? ? quote(string, safe='/', encoding=None, errors=None)

? ? ? ? quote('abc def') -> 'abc%20def'

? ? ? ? Each part of a URL, e.g. the path info, the query, etc., has a

? ? ? ? different set of reserved characters that must be quoted.

? ? ? ? RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists

? ? ? ? the following reserved characters.

? ? ? ? reserved ? ?= ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |

? ? ? ? ? ? ? ? ? ? ? "$" | ","

? ? ? ? Each of these characters is reserved in some component of a URL,

? ? ? ? but not necessarily in all of them.

? ? ? ? By default, the quote function is intended for quoting the path

? ? ? ? section of a URL. ?Thus, it will not encode '/'. ?This character

? ? ? ? is reserved, but in typical usage the quote function is being

? ? ? ? called on a path where the existing slash characters are used as

? ? ? ? reserved characters.

? ? ? ? string and safe may be either str or bytes objects. encoding and errors

? ? ? ? must not be specified if string is a bytes object.

? ? ? ? The optional encoding and errors parameters specify how to deal with

? ? ? ? non-ASCII characters, as accepted by the str.encode method.

? ? ? ? By default, encoding='utf-8' (characters are encoded with UTF-8), and

? ? ? ? errors='strict' (unsupported characters raise a UnicodeEncodeError).

? ? quote_from_bytes(bs, safe='/')

? ? ? ? Like quote(), but accepts a bytes object rather than a str, and does

? ? ? ? not perform string-to-bytes encoding. ?It always returns an ASCII string.

? ? ? ? quote_from_bytes(b'abc def?') -> 'abc%20def%3f'

? ? quote_plus(string, safe='', encoding=None, errors=None)

? ? ? ? Like quote(), but also replace ' ' with '+', as required for quoting

? ? ? ? HTML form values. Plus signs in the original string are escaped unless

? ? ? ? they are included in safe. It also does not have safe default to '/'.

? ? unquote(string, encoding='utf-8', errors='replace')

? ? ? ? Replace %xx escapes by their single-character equivalent. The optional

? ? ? ? encoding and errors parameters specify how to decode percent-encoded

? ? ? ? sequences into Unicode characters, as accepted by the bytes.decode()

? ? ? ? method.

? ? ? ? By default, percent-encoded sequences are decoded with UTF-8, and invalid

? ? ? ? sequences are replaced by a placeholder character.

? ? ? ? unquote('abc%20def') -> 'abc def'.

? ? unquote_plus(string, encoding='utf-8', errors='replace')

? ? ? ? Like unquote(), but also replace plus signs by spaces, as required for

? ? ? ? unquoting HTML form values.

? ? ? ? unquote_plus('%7e/abc+def') -> '~/abc def'

? ? unquote_to_bytes(string)

? ? ? ? unquote_to_bytes('abc%20def') -> b'abc def'.

? ? urldefrag(url)

? ? ? ? Removes any existing fragment from URL.

? ? ? ? Returns a tuple of the defragmented URL and the fragment. ?If

? ? ? ? the URL contained no fragments, the second element is the

? ? ? ? empty string.

? ? urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

? ? ? ? Encode a dict or sequence of two-element tuples into a URL query string.

? ? ? ? If any values in the query arg are sequences and doseq is true, each

? ? ? ? sequence element is converted to a separate parameter.

? ? ? ? If the query arg is a sequence of two-element tuples, the order of the

? ? ? ? parameters in the output will match the order of parameters in the

? ? ? ? input.

? ? ? ? The components of a query arg may each be either a string or a bytes type.

? ? ? ? The safe, encoding, and errors parameters are passed down to the function

? ? ? ? specified by quote_via (encoding and errors only if a component is a str).

? ? urljoin(base, url, allow_fragments=True)

? ? ? ? Join a base URL and a possibly relative URL to form an absolute

? ? ? ? interpretation of the latter.

? ? urlparse(url, scheme='', allow_fragments=True)

? ? ? ? Parse a URL into 6 components:

? ? ? ? <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

? ? ? ? Return a 6-tuple: (scheme, netloc, path, params, query, fragment).

? ? ? ? Note that we don't break the components up in smaller bits

? ? ? ? (e.g. netloc is a single string) and we don't expand % escapes.

? ? urlsplit(url, scheme='', allow_fragments=True)

? ? ? ? Parse a URL into 5 components:

? ? ? ? <scheme>://<netloc>/<path>?<query>#<fragment>

? ? ? ? Return a 5-tuple: (scheme, netloc, path, query, fragment).

? ? ? ? Note that we don't break the components up in smaller bits

? ? ? ? (e.g. netloc is a single string) and we don't expand % escapes.

? ? urlunparse(components)

? ? ? ? Put a parsed URL back together again. ?This may result in a

? ? ? ? slightly different, but equivalent URL, if the URL that was parsed

? ? ? ? originally had redundant delimiters, e.g. a ? with an empty query

? ? ? ? (the draft states that these are equivalent).

? ? urlunsplit(components)

? ? ? ? Combine the elements of a tuple as returned by urlsplit() into a

? ? ? ? complete URL as a string. The data argument can be any five-item iterable.

? ? ? ? This may result in a slightly different, but equivalent URL, if the URL that

? ? ? ? was parsed originally had unnecessary delimiters (for example, a ? with an

? ? ? ? empty query; the RFC states that these are equivalent).

DATA

? ? __all__ = ['urlparse', 'urlunparse', 'urljoin', 'urldefrag', 'urlsplit...

FILE

? ? d:\python3\lib\urllib\parse.py

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末乃秀，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子圆兵，更是在濱河造成了極大的恐慌跺讯，老刑警劉巖坚嗜，帶你破解...
沈念sama閱讀 212,542評(píng)論 6贊 493
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件男图，死亡現(xiàn)場(chǎng)離奇詭異裸扶，居然都是意外死亡欠橘，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,596評(píng)論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門婿崭，熙熙樓的掌柜王于貴愁眉苦臉地迎上來碍扔，“玉大人涨缚，你說我怎么就攤上這事轮傍≡荼ⅲ” “怎么了？”我有些...
開封第一講書人閱讀 158,021評(píng)論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵创夜，是天一觀的道長(zhǎng)杭跪。經(jīng)常有香客問我，道長(zhǎng)驰吓，這世上最難降的妖魔是什么涧尿？我笑而不...
開封第一講書人閱讀 56,682評(píng)論 1贊 284
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮檬贰，結(jié)果婚禮上姑廉，老公的妹妹穿的比我還像新娘。我一直安慰自己翁涤，他們只是感情好桥言，可當(dāng)我...
茶點(diǎn)故事閱讀 65,792評(píng)論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著迷雪，像睡著了一般限书。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上章咧，一...
開封第一講書人閱讀 49,985評(píng)論 1贊 291
城市分裂傳說
那天，我揣著相機(jī)與錄音能真，去河邊找鬼赁严。笑死扰柠，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的疼约。我是一名探鬼主播卤档，決...
沈念sama閱讀 39,107評(píng)論 3贊 410
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼程剥！你這毒婦竟也來了劝枣？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,845評(píng)論 0贊 268
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤织鲸，失蹤者是張志新（化名）和其女友劉穎舔腾，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體搂擦，經(jīng)...
沈念sama閱讀 44,299評(píng)論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡稳诚，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,612評(píng)論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了瀑踢。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片扳还。...
茶點(diǎn)故事閱讀 38,747評(píng)論 1贊 341
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖橱夭，靈堂內(nèi)的尸體忽然破棺而出氨距，到底是詐尸還是另有隱情，我是刑警寧澤棘劣，帶...
沈念sama閱讀 34,441評(píng)論 4贊 333
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布衔蹲，位于F島的核電站，受9級(jí)特大地震影響呈础，放射性物質(zhì)發(fā)生泄漏舆驶。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 40,072評(píng)論 3贊 317
男人毒藥：我在死后第九天來索命
文/蒙蒙一而钞、第九天我趴在偏房一處隱蔽的房頂上張望沙廉。院中可真熱鬧，春花似錦臼节、人聲如沸撬陵。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,828評(píng)論 0贊 21
一樁弒父案网缝，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽巨税。三九已至，卻和暖如春粉臊，著一層夾襖步出監(jiān)牢的瞬間草添，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,069評(píng)論 1贊 267
情欲美人皮
我被黑心中介騙來泰國打工扼仲，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留远寸，地道東北人抄淑。一個(gè)月前我還...
沈念sama閱讀 46,545評(píng)論 2贊 362
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像驰后，于是被迫代替她去往敵國和親肆资。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,658評(píng)論 2贊 350