從一個(gè)爬蟲說起
爬蟲馋贤,就是互聯(lián)網(wǎng)的蜘蛛绢片,在搜索引擎誕生之時(shí)滤馍,與其一同來到世上。爬蟲每秒鐘都會(huì)爬取大量的網(wǎng)頁底循,提取關(guān)鍵信息后存儲(chǔ)在數(shù)據(jù)庫中巢株,以便日后分析。爬蟲有非常簡(jiǎn)單的 Python 十行代碼實(shí)現(xiàn)熙涤,也有 Google 那樣的全球分布式爬蟲的上百萬行代碼阁苞,分布在內(nèi)部上萬臺(tái)服務(wù)器上,對(duì)全世界的信息進(jìn)行嗅探祠挫。
簡(jiǎn)單的爬蟲例子:
import time
def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
time.sleep(sleep_time)
print('OK {}'.format(url))
def main(urls):
for url in urls:
crawl_page(url)
%time main(['url_1', 'url_2', 'url_3', 'url_4'])
########## 輸出 ##########
crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
Wall time: 10 s
一個(gè)很簡(jiǎn)單的思路出現(xiàn)了——我們這種爬取操作那槽,完全可以并發(fā)化。我們就來看看使用協(xié)程怎么寫等舔。
import asyncio
async def crawl_page(url):
print('crawling {}'.format(url))
sleep_time = int(url.split('_')[-1])
await asyncio.sleep(sleep_time)
print('OK {}'.format(url))
async def main(urls):
for url in urls:
await crawl_page(url)
%time asyncio.run(main(['url_1', 'url_2', 'url_3', 'url_4']))
########## 輸出 ##########
crawling url_1
OK url_1
crawling url_2
OK url_2
crawling url_3
OK url_3
crawling url_4
OK url_4
Wall time: 10 s
實(shí)戰(zhàn):豆瓣近日推薦電影爬蟲
任務(wù)描述:https://movie.douban.com/cinema/later/beijing/ 這個(gè)頁面描述了北京最近上映的電影骚灸,你能否通過 Python 得到這些電影的名稱、上映時(shí)間和海報(bào)呢慌植?這個(gè)頁面的海報(bào)是縮小版的甚牲,我希望你能從具體的電影描述頁面中抓取到海報(bào)。
import requests
from bs4 import BeautifulSoup
def main():
url = "https://movie.douban.com/cinema/later/beijing/"
init_page = requests.get(url).content
init_soup = BeautifulSoup(init_page, 'lxml')
all_movies = init_soup.find('div', id="showing-soon")
for each_movie in all_movies.find_all('div', class_="item"):
all_a_tag = each_movie.find_all('a')
all_li_tag = each_movie.find_all('li')
movie_name = all_a_tag[1].text
url_to_fetch = all_a_tag[1]['href']
movie_date = all_li_tag[0].text
response_item = requests.get(url_to_fetch).content
soup_item = BeautifulSoup(response_item, 'lxml')
img_tag = soup_item.find('img')
print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
%time main()
########## 輸出 ##########
阿拉丁 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553992741.jpg
龍珠超:布羅利 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2557371503.jpg
五月天人生無限公司 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554324453.jpg
... ...
直播攻略 06月04日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555957974.jpg
Wall time: 56.6 s
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_content(url):
async with aiohttp.ClientSession(
headers=header, connector=aiohttp.TCPConnector(ssl=False)
) as session:
async with session.get(url) as response:
return await response.text()
async def main():
url = "https://movie.douban.com/cinema/later/beijing/"
init_page = await fetch_content(url)
init_soup = BeautifulSoup(init_page, 'lxml')
movie_names, urls_to_fetch, movie_dates = [], [], []
all_movies = init_soup.find('div', id="showing-soon")
for each_movie in all_movies.find_all('div', class_="item"):
all_a_tag = each_movie.find_all('a')
all_li_tag = each_movie.find_all('li')
movie_names.append(all_a_tag[1].text)
urls_to_fetch.append(all_a_tag[1]['href'])
movie_dates.append(all_li_tag[0].text)
tasks = [fetch_content(url) for url in urls_to_fetch]
pages = await asyncio.gather(*tasks)
for movie_name, movie_date, page in zip(movie_names, movie_dates, pages):
soup_item = BeautifulSoup(page, 'lxml')
img_tag = soup_item.find('img')
print('{} {} {}'.format(movie_name, movie_date, img_tag['src']))
%time asyncio.run(main())
########## 輸出 ##########
阿拉丁 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2553992741.jpg
龍珠超:布羅利 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2557371503.jpg
五月天人生無限公司 05月24日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2554324453.jpg
... ...
直播攻略 06月04日 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555957974.jpg
Wall time: 4.98 s
總結(jié)
- 協(xié)程和多線程的區(qū)別蝶柿,主要在于兩點(diǎn)丈钙,一是協(xié)程為單線程;二是協(xié)程由用戶決定交汤,在哪些地方交出控制權(quán)雏赦,切換到下一個(gè)任務(wù)。
- 協(xié)程的寫法更加簡(jiǎn)潔清晰,把 async / await 語法和 create_task 結(jié)合來用喉誊,對(duì)于中小級(jí)別的并發(fā)需求已經(jīng)毫無壓力邀摆。
- 寫協(xié)程程序的時(shí)候,你的腦海中要有清晰的事件循環(huán)概念伍茄,知道程序在什么時(shí)候需要暫停栋盹、等待 I/O,什么時(shí)候需要一并執(zhí)行到底敷矫。