7-22
爬蟲(chóng)貓眼top100電影
參看內(nèi)容:https://www.bilibili.com/video/av21857172/?p=14
https://blog.csdn.net/yaoyefengchen/article/details/79025943
1.查找單頁(yè)內(nèi)容
2.正則表達(dá)式使用
3.多線程抓取
查找單頁(yè)內(nèi)容
貓眼榜單→右鍵→檢查左上角箭頭符號(hào),然后選中網(wǎng)頁(yè)內(nèi)容胁附,就大概知道所要爬取的內(nèi)容對(duì)應(yīng)的位置
import requests
from requests.exceptions import RequestException
def get_one_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def main():
url = 'http://maoyan.com/board/4'
html = get_one_page(url)
print(html)
if __name__ == '__main__':
main()
然后結(jié)果就有點(diǎn)小尷尬了2.PNG
稍作修改```
def get_one_page(url):
kv = {"user-agent": "Mizilla/5.0"}
try:
rep = requests.get(url, headers=kv)
if rep.status_code == 200:
return rep.text
except RequestException:
return None```運(yùn)行成功
然后這個(gè)就是我第六天的文章里處理請(qǐng)求頭修改控妻,一般來(lái)說(shuō)簡(jiǎn)單的反爬就是檢查user-agent
其他的運(yùn)行就都o(jì)k
正則表達(dá)式的使用
對(duì)比一下代碼和要爬取的內(nèi)容
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name">'
+ '<a.*?>(.*?)</a>.*?"star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
<dd>
<i class="board-index board-index-10">10</i>
<a href="/films/2760" title="魂斷藍(lán)橋" class="image-link" data-act="boarditem-click" data-val="{movieId:2760}">
<img src="http://ms0.meituan.net/mywww/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
<img data-src="http://p0.meituan.net/movie/46c29a8b8d8424bdda7715e6fd779c66235684.jpg@160w_220h_1e_1c" alt="魂斷藍(lán)橋" class="board-img" />
</a>
<div class="board-item-main">
<div class="board-item-content">
<div class="movie-item-info">
<p class="name"><a href="/films/2760" title="魂斷藍(lán)橋" data-act="boarditem-click" data-val="{movieId:2760}">魂斷藍(lán)橋</a></p>
<p class="star">
主演:費(fèi)雯·麗,羅伯特·泰勒,露塞爾·沃特森
</p>
<p class="releasetime">上映時(shí)間:1940-05-17(美國(guó))</p> </div>
<div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">2</i></p>
然后其實(shí)自己正則表達(dá)式也不太熟悉弓候,要好好研究一下,把之前文章的鏈接找出來(lái)看看
多線程抓取
多線程抓取異步主要是為了加快抓取速度
目前也不是很了解弓叛,就找些資料看看
https://blog.csdn.net/cymy001/article/details/78218024
https://blog.csdn.net/ztf312/article/details/78858512