title標(biāo)題: A Web Crawler With asyncio Coroutines
author作者: A. Jesse Jiryu Davis and Guido van Rossum
<markdown>
_A. Jesse Jiryu Davis is a staff engineer at MongoDB in New York. He wrote Motor, the async MongoDB Python driver, and he is the lead developer of the MongoDB C Driver and a member of the PyMongo team. He contributes to asyncio and Tornado. He writes at http://emptysqua.re.
A. Jesse Jiryu Davis在紐約為MongoDB工作摇展。他編寫了Motor归苍,異步MongoDB Python驅(qū)動器抡驼,他也是MongoDB C驅(qū)動器的首席開發(fā)者运沦, 同時他也是PyMango組織的成員之一碴巾。他對asyncio和Tornado同樣有著杰出貢獻(xiàn)。他的博客是 http://emptysqua.re捐川。
Guido van Rossum is the creator of Python, one of the major programming languages on and off the web. The Python community refers to him as the BDFL (Benevolent Dictator For Life), a title straight from a Monty Python skit. Guido's home on the web is http://www.python.org/~guido/.
Guido van Rossum友鼻,Python之父,Python是目前主要的編程語言之一廊谓,無論線上線下梳猪。 他在社區(qū)里一直是一位仁慈的獨裁者,一個來自Monty Python短劇的標(biāo)題蒸痹。Guido網(wǎng)址是http://www.python.org/~guido/ .
</markdown>
Introduction 介紹
Classical computer science emphasizes efficient algorithms that complete computations as quickly as possible. But many networked programs spend their time not computing, but holding open many connections that are slow, or have infrequent events. These programs present a very different challenge: to wait for a huge number of network events efficiently. A contemporary approach to this problem is asynchronous I/O, or "async".
經(jīng)典計算機科學(xué)看重高效的算法以便能盡快完成計算春弥。但是許多網(wǎng)絡(luò)程序消耗的時間不是在計算上呛哟,它們通常維持著許多打開的緩慢的連接,或者期待著一些不頻繁發(fā)生的事件發(fā)生惕稻。這些程序代表了另一個不同的挑戰(zhàn):如何高效的監(jiān)聽大量網(wǎng)絡(luò)事件竖共。解決這個問題的一個現(xiàn)代方法是采用異步I/O.
This chapter presents a simple web crawler. The crawler is an archetypal async application because it waits for many responses, but does little computation. The more pages it can fetch at once, the sooner it completes. If it devotes a thread to each in-flight request, then as the number of concurrent requests rises it will run out of memory or other thread-related resource before it runs out of sockets. It avoids the need for threads by using asynchronous I/O.
這一章節(jié)實現(xiàn)了一個簡單的網(wǎng)絡(luò)爬蟲。這個爬蟲是一個異步調(diào)用的原型應(yīng)用程序俺祠,因為它需要等待許多響應(yīng)公给,而極少有CPU計算。它每次可以抓取的頁面越多蜘渣,它運行結(jié)束的時間越快淌铐。 如果它為每一個運行的請求分發(fā)一個線程,那么隨著并發(fā)請求數(shù)量的增加蔫缸,它最終會在耗盡系統(tǒng)套接字之前腿准,耗盡內(nèi)存或者其他線程相關(guān)的資源。 它通過使用異步I/O來避免對大量線程依賴拾碌。
We present the example in three stages. First, we show an async event loop and sketch a crawler that uses the event loop with callbacks: it is very efficient, but extending it to more complex problems would lead to unmanageable spaghetti code. Second, therefore, we show that Python coroutines are both efficient and extensible. We implement simple coroutines in Python using generator functions. In the third stage, we use the full-featured coroutines from Python's standard "asyncio" library[^16], and coordinate them using an async queue.
我們通過三步來實現(xiàn)這個例子吐葱。首先,我們展示一個異步的事件循環(huán)校翔,并且完成一個帶有回掉函數(shù)并且使用這個循環(huán)的爬蟲:它非常的高效弟跑,但是當(dāng)我們想擴展它來適應(yīng)更復(fù)雜的問題時會帶來很多難以處理的代碼。因此防症,接下來我們展示一個即高效又容易擴展的python的協(xié)程的程序孟辑。第三步,我們使用python標(biāo)準(zhǔn)庫中的“asyncio”庫中的全功能的協(xié)程程序蔫敲,然后通過async異步隊列來組合他們饲嗽。
The Task
A web crawler finds and downloads all pages on a website, perhaps to archive or index them. Beginning with a root URL, it fetches each page, parses it for links to unseen pages, and adds these to a queue. It stops when it fetches a page with no unseen links and the queue is empty.
一個網(wǎng)絡(luò)爬蟲會尋找并且下載一個網(wǎng)站上的所有頁面,可能會存檔或者對他們建立索引奈嘿。從一個根節(jié)點開始貌虾,它爬取每一個頁面,解析頁面并且尋找從未訪問過的鏈接裙犹,然后把他們加入到隊列中酝惧。當(dāng)解析到一個沒有從未訪問過的鏈接的頁面并且隊列是空的時候,爬蟲會停止伯诬。
We can hasten this process by downloading many pages concurrently. As the crawler finds new links, it launches simultaneous fetch operations for the new pages on separate sockets. It parses responses as they arrive, adding new links to the queue. There may come some point of diminishing returns where too much concurrency degrades performance, so we cap the number of concurrent requests, and leave the remaining links in the queue until some in-flight requests complete.
我們可以通過同時下載許多頁面來加快這個過程晚唇。當(dāng)爬蟲發(fā)現(xiàn)新的鏈接時,它在單獨的sockets上同時啟動抓取新頁面的操作盗似。當(dāng)抓取結(jié)果抵達(dá)時哩陕,它開始解析響應(yīng),并往隊列里添加新解析到的鏈接。 大量的并發(fā)請求可能導(dǎo)致一些性能降低悍及,因而我們限制同一時間內(nèi)請求的數(shù)量闽瓢,把其他的鏈接加入隊列直到一些運行中的請求完成。
The Traditional Approach 傳統(tǒng)的實現(xiàn)方法
How do we make the crawler concurrent? Traditionally we would create a thread pool. Each thread would be in charge of downloading one page at a time over a socket. For example, to download a page from xkcd.com
:
我們該如何讓爬蟲并發(fā)處理請求呢心赶?傳統(tǒng)方法是建立一個線程池扣讼。每個進(jìn)程每次將負(fù)責(zé)通過一個socket下載一個頁面。比如缨叫,下載“xkcd.com”的一個頁面:
def fetch(url):
sock = socket.socket()
sock.connect(('xkcd.com', 80))
request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
sock.send(request.encode('ascii'))
response = b''
chunk = sock.recv(4096)
while chunk:
response += chunk
chunk = sock.recv(4096)
# Page is now downloaded.
links = parse_links(response)
q.add(links)
By default, socket operations are blocking: when the thread calls a method like connect
or recv
, it pauses until the operation completes.[^15] Consequently to download many pages at once, we need many threads. A sophisticated application amortizes the cost of thread-creation by keeping idle threads in a thread pool, then checking them out to reuse them for subsequent tasks; it does the same with sockets in a connection pool.
默認(rèn)情況下椭符,socket操作是阻塞的:當(dāng)一個線程調(diào)用像connect或者recv之類Socket相關(guān)的方法時,它會被阻塞直至操作完成耻姥。 因此销钝,一次性并行下載很多頁面,我們得需要更多的線程琐簇。一個復(fù)雜點的程序通過將線程池中的空閑線程保持在線程池中蒸健,然后將他們檢出以重用他們來用于后續(xù)任務(wù)中分?jǐn)偩€程創(chuàng)建的成本,它對連接池中的套接字執(zhí)行相同的操作婉商。
And yet, threads are expensive, and operating systems enforce a variety of hard caps on the number of threads a process, user, or machine may have. On Jesse's system, a Python thread costs around 50k of memory, and starting tens of thousands of threads causes failures. If we scale up to tens of thousands of simultaneous operations on concurrent sockets, we run out of threads before we run out of sockets. Per-thread overhead or system limits on threads are the bottleneck.
然而似忧,線程的開銷是很昂貴的,并且操作系統(tǒng)對進(jìn)程丈秩,線程的數(shù)量進(jìn)行各種限制盯捌。在Jesse的系統(tǒng)上,Python線程需要大約50k的內(nèi)存癣籽,并且啟動數(shù)以萬計的線程會導(dǎo)致失敗挽唉。 如果我們在并發(fā)socket上擴展到數(shù)萬個并發(fā)操作滤祖,我們就耗盡了線程筷狼,然后我們用完了socket。 每個線程的開銷或系統(tǒng)對線程的限制是瓶頸匠童。
In his influential article "The C10K problem"[^8], Dan Kegel outlines the limitations of multithreading for I/O concurrency. He begins,
在他那篇頗有影響力的文章《The C10K problem》中埂材,Dan Kegel概述了用多線程并行處理I/O問題的局限性。
It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.是時候讓web服務(wù)器同時處理數(shù)萬客戶端請求了汤求,不是嗎俏险?畢竟,web那么大扬绪。
Kegel coined the term "C10K" in 1999. Ten thousand connections sounds dainty now, but the problem has changed only in size, not in kind. Back then, using a thread per connection for C10K was impractical. Now the cap is orders of magnitude higher. Indeed, our toy web crawler would work just fine with threads. Yet for very large scale applications, with hundreds of thousands of connections, the cap remains: there is a limit beyond which most systems can still create sockets, but have run out of threads. How can we overcome this?
Kegel在1999年發(fā)明了“C10K”這個詞竖独。一萬連接現(xiàn)在聽起來覺得很少,但問題的關(guān)鍵點在于連接的數(shù)量而不在于類型挤牛∮。回到那個年代,一個連接使用一個線程來處理C10K問題是不實際的。現(xiàn)在容量已經(jīng)是當(dāng)初的好幾個數(shù)量級了竞膳。說實話航瞭,我們的爬蟲小玩具使用線程的方式也能運行的很好。但對于需要面對成百上千連接的大規(guī)模應(yīng)用程序來說坦辟,使用線程的缺陷還是依舊在這兒:大部分操作系統(tǒng)還能創(chuàng)建Socket刊侯,但是不能再繼續(xù)創(chuàng)建線程了。我們?nèi)绾慰朔@個難題呢锉走?
Async 異步
Asynchronous I/O frameworks do concurrent operations on a single thread using
non-blocking sockets. In our async crawler, we set the socket non-blocking
before we begin to connect to the server:
異步I / O框架使用非阻塞socket在單個線程上執(zhí)行并行操作滨彻。 在我們的異步爬蟲中,我們在開始連接到服務(wù)器之前設(shè)置socket無阻塞:
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
Irritatingly, a non-blocking socket throws an exception from connect
, even when it is working normally. This exception replicates the irritating behavior of the underlying C function, which sets errno
to EINPROGRESS
to tell you it has begun.
非阻塞套接字從connect
拋出一個異常,即使它正常工作唱捣。 這個異常復(fù)制了底層C函數(shù)的行為嘱根,它將errno
設(shè)置為EINPROGRESS
來告訴你它已經(jīng)開始。
Now our crawler needs a way to know when the connection is established, so it can send the HTTP request. We could simply keep trying in a tight loop:
現(xiàn)在我們的爬蟲需要一種方法來知道連接何時建立冬骚,然后它可以發(fā)送HTTP請求。 我們可以簡單地在一個循環(huán)中嘗試:
request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
encoded = request.encode('ascii')
while True:
try:
sock.send(encoded)
break # Done.
except OSError as e:
pass
print('sent')
This method not only wastes electricity, but it cannot efficiently await events on multiple sockets. In ancient times, BSD Unix's solution to this problem was select
, a C function that waits for an event to occur on a non-blocking socket or a small array of them. Nowadays the demand for Internet applications with huge numbers of connections has led to replacements like poll
, then kqueue
on BSD and epoll
on Linux. These APIs are similar to select
, but perform well with very large numbers of connections.
這種方法不僅浪費電力懂算,而且不能有效地等待*多個socket上的事件只冻。 在過去,BSD Unix的這個問題的解決方案是select
计技,一個C函數(shù)喜德,等待一個事件發(fā)生在一個非阻塞的套接字或者一個小的數(shù)組。 如今垮媒,對具有大量連接的互聯(lián)網(wǎng)應(yīng)用的需求導(dǎo)致了諸如poll'舍悯,然后是BSD上的
kqueue'和Linux上的epoll'的替換。 這些API類似于“select”睡雇,但是對于非常大量的連接執(zhí)行得很好萌衬。 Python 3.4's
DefaultSelectoruses the best
select`-like function available on your system. To register for notifications about network I/O, we create a non-blocking socket and register it with the default selector:
Python 3.4的“DefaultSelector”使用你的系統(tǒng)上最好的類似select的函數(shù)。 為了注冊有關(guān)網(wǎng)絡(luò)I / O的通知它抱,我們創(chuàng)建一個非阻塞socket并使用默認(rèn)選擇器注冊它:
from selectors import DefaultSelector, EVENT_WRITE
selector = DefaultSelector()
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
def connected():
selector.unregister(sock.fileno())
print('connected!')
selector.register(sock.fileno(), EVENT_WRITE, connected)
We disregard the spurious error and call selector.register
, passing in the socket's file descriptor and a constant that expresses what event we are waiting for. To be notified when the connection is established, we pass EVENT_WRITE
: that is, we want to know when the socket is "writable". We also pass a Python function, connected
, to run when that event occurs. Such a function is known as a callback.
我們忽略了錯誤并調(diào)用了selector.register秕豫,傳遞了socket的文件描述符和一個表示我們正在等待什么事件的常量。 要在連接建立時獲得通知观蓄,我們傳遞EVENT_WRITE
:也就是說混移,我們想知道套接字何時是可寫的。 我們還傳遞一個Python函數(shù)connected
侮穿,當(dāng)事件發(fā)生時運行歌径。 這樣的函數(shù)稱為回調(diào)。
We process I/O notifications as the selector receives them, in a loop:
當(dāng)選擇器接收到它們時亲茅,我們在一個循環(huán)中處理I / O通知:
def loop():
while True:
events = selector.select()
for event_key, event_mask in events:
callback = event_key.data
callback()
The connected
callback is stored as event_key.data
, which we retrieve and execute once the non-blocking socket is connected.
connected
回調(diào)被存儲為event_key.data
回铛,我們在非阻塞socket連接時取回并執(zhí)行金矛。
Unlike in our fast-spinning loop above, the call to select
here pauses, awaiting the next I/O events. Then the loop runs callbacks that are waiting for these events. Operations that have not completed remain pending until some future tick of the event loop.
與上面的快速循環(huán)不同,這里select
調(diào)用暫停勺届,等待下一個I / O事件驶俊。 然后循環(huán)運行等待這些事件的回調(diào)。 尚未完成的操作將保持掛起免姿,直到事件循環(huán)的某個未來時間點為止
What have we demonstrated already? We showed how to begin an operation and execute a callback when the operation is ready. An async framework builds on the two features we have shown—non-blocking sockets and the event loop—to run concurrent operations on a single thread.
我們已經(jīng)證明了什么饼酿? 我們展示了當(dāng)操作準(zhǔn)備好時如何開始操作和執(zhí)行回調(diào)。 異步框架建立在我們所展示的兩個特性(非阻塞socket和事件循環(huán))上胚膊,以在單個線程上運行并發(fā)操作故俐。
We have achieved "concurrency" here, but not what is traditionally called "parallelism". That is, we built a tiny system that does overlapping I/O. It is capable of beginning new operations while others are in flight. It does not actually utilize multiple cores to execute computation in parallel. But then, this system is designed for I/O-bound problems, not CPU-bound ones.[^14]
我們在這里實現(xiàn)了“并發(fā)”,但不是傳統(tǒng)上被稱為“并行性”紊婉。 也就是說药版,我們構(gòu)建了一個重疊I / O的小系統(tǒng)。 它能夠開始新的操作喻犁,而其他人在飛行槽片。 它實際上并不利用多個核來并行執(zhí)行計算。 但是肢础,這個系統(tǒng)是為I / O綁定的問題設(shè)計的还栓,而不是CPU綁定的。[^ 14]
So our event loop is efficient at concurrent I/O because it does not devote thread resources to each connection. But before we proceed, it is important to correct a common misapprehension that async is faster than multithreading. Often it is not—indeed, in Python, an event loop like ours is moderately slower than multithreading at serving a small number of very active connections. In a runtime without a global interpreter lock, threads would perform even better on such a workload. What asynchronous I/O is right for, is applications with many slow or sleepy connections with infrequent events.[11]<latex>[bayer]</latex>
因此传轰,我們的事件循環(huán)在并發(fā)I / O方面是高效的剩盒,因為它不會將線程資源分配給每個連接。 但在我們繼續(xù)前慨蛙,重要的是糾正一個常見的誤解辽聊,即異步的速度比多線程快。 通常不是期贫,事實上跟匆,在Python中,像我們這樣的事件循環(huán)比服務(wù)少且非澄椋活躍的連接的多線程慢贾铝。 在沒有全局解釋器鎖的運行時隙轻,線程在這樣的工作負(fù)載上表現(xiàn)更好埠帕。 什么時候異步I / O是正確的,是與許多慢或困連接與罕見的事件的應(yīng)用程序玖绿。[^ 11]<latex>[^bayer]</latex>
Programming With Callbacks 回調(diào)
With the runty async framework we have built so far, how can we build a web crawler? Even a simple URL-fetcher is painful to write.
隨著我們構(gòu)建的runty異步框架到目前為止敛瓷,我們?nèi)绾螛?gòu)建一個網(wǎng)絡(luò)爬蟲? 即使一個簡單的URL爬蟲寫起來也是痛苦的斑匪。
We begin with global sets of the URLs we have yet to fetch, and the URLs we have seen:
我們從全局的還沒有抓取URL集合和我們看到的URL開始:
urls_todo = set(['/'])
seen_urls = set(['/'])
The seen_urls
set includes urls_todo
plus completed URLs. The two sets are initialized with the root URL "/".
seen_urls
集合包括urls_todo
和完成的URL呐籽。 這兩個集合由根URL“/”初始化锋勺。
Fetching a page will require a series of callbacks. The connected
callback fires when a socket is connected, and sends a GET request to the server. But then it must await a response, so it registers another callback. If, when that callback fires, it cannot read the full response yet, it registers again, and so on.
獲取頁面將需要一系列回調(diào)。 當(dāng)連接socket時狡蝶,連接回調(diào)觸發(fā)庶橱,并向服務(wù)器發(fā)送GET請求。 但是它必須等待響應(yīng)贪惹,所以它注冊另一個回調(diào)苏章。 如果,當(dāng)回調(diào)觸發(fā)時奏瞬,它不能讀取完整的響應(yīng)枫绅,它再次注冊,等等硼端。
Let us collect these callbacks into a Fetcher
object. It needs a URL, a socket object, and a place to accumulate the response bytes:
讓我們將這些回調(diào)收集到一個Fetcher
對象中并淋。 它需要一個URL,一個socket對象和一個地方來累積response bytes:
class Fetcher:
def __init__(self, url):
self.response = b'' # Empty array of bytes.
self.url = url
self.sock = None
We begin by calling Fetcher.fetch
:
# Method on Fetcher class.
def fetch(self):
self.sock = socket.socket()
self.sock.setblocking(False)
try:
self.sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
# Register next callback.
selector.register(self.sock.fileno(),
EVENT_WRITE,
self.connected)
The fetch
method begins connecting a socket. But notice the method returns before the connection is established. It must return control to the event loop to wait for the connection. To understand why, imagine our whole application was structured so:
fetch
方法開始連接socket珍昨。 但請注意县耽,該方法在建立連接之前返回。 它必須將控制權(quán)返回到事件循環(huán)镣典,以等待連接酬诀。 為了理解為什么,想象我們的整個應(yīng)用程序的結(jié)構(gòu)如下:
# Begin fetching http://xkcd.com/353/
fetcher = Fetcher('/353/')
fetcher.fetch()
while True:
events = selector.select()
for event_key, event_mask in events:
callback = event_key.data
callback(event_key, event_mask)
All event notifications are processed in the event loop when it calls select
. Hence fetch
must hand control to the event loop, so that the program knows when the socket has connected. Only then does the loop run the connected
callback, which was registered at the end of fetch
above.
當(dāng)調(diào)用select
時骆撇,所有事件通知都在事件循環(huán)中處理瞒御。 因此,“fetch”必須手動控制事件循環(huán)神郊,以便程序知道套接字何時連接肴裙。 只有這樣,循環(huán)才會運行connected
回調(diào)涌乳,它在上面的fetch
結(jié)尾處注冊蜻懦。
Here is the implementation of connected
:這里是connected
的實現(xiàn):
# Method on Fetcher class.
def connected(self, key, mask):
print('connected!')
selector.unregister(key.fd)
request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
self.sock.send(request.encode('ascii'))
# Register the next callback.
selector.register(key.fd,
EVENT_READ,
self.read_response)
The method sends a GET request. A real application would check the return value of send
in case the whole message cannot be sent at once. But our request is small and our application unsophisticated. It blithely calls send
, then waits for a response. Of course, it must register yet another callback and relinquish control to the event loop. The next and final callback, read_response
, processes the server's reply:
該方法發(fā)送GET請求。 一個真正的應(yīng)用程序?qū)z查send
的返回值夕晓,以防整個消息不能立即發(fā)送宛乃。 但我們的要求很低,我們的應(yīng)用程序不復(fù)雜蒸辆。 它調(diào)用send
征炼,然后等待響應(yīng)。 當(dāng)然躬贡,它必須注冊另一個回調(diào)谆奥,并放棄對事件循環(huán)的控制。 下一個和最后一個回調(diào)拂玻,read_response
酸些,處理服務(wù)器的回復(fù):
# Method on Fetcher class.
def read_response(self, key, mask):
global stopped
chunk = self.sock.recv(4096) # 4k chunk size.
if chunk:
self.response += chunk
else:
selector.unregister(key.fd) # Done reading.
links = self.parse_links()
# Python set-logic:
for link in links.difference(seen_urls):
urls_todo.add(link)
Fetcher(link).fetch() # <- New Fetcher.
seen_urls.update(links)
urls_todo.remove(self.url)
if not urls_todo:
stopped = True
The callback is executed each time the selector sees that the socket is "readable", which could mean two things: the socket has data or it is closed.
每當(dāng)選擇器看到socket是“可讀的”時宰译,就會執(zhí)行回調(diào),這可能意味著兩件事情:套接字有數(shù)據(jù)或關(guān)閉魄懂。
The callback asks for up to four kilobytes of data from the socket. If less is ready, chunk
contains whatever data is available. If there is more, chunk
is four kilobytes long and the socket remains readable, so the event loop runs this callback again on the next tick. When the response is complete, the server has closed the socket and chunk
is empty.
回調(diào)從socket請求最多4K字節(jié)的數(shù)據(jù)沿侈。 如果less準(zhǔn)備好,chunk
包含任何可用的數(shù)據(jù)市栗。 如果有更多肋坚,chunk
是4K字節(jié)長,并且socket保持可讀肃廓,所以事件循環(huán)在下一個tick時再次運行這個回調(diào)智厌。 當(dāng)響應(yīng)完成時,服務(wù)器已關(guān)閉socket盲赊,并且“chunk”為空铣鹏。
The parse_links
method, not shown, returns a set of URLs. We start a new fetcher for each new URL, with no concurrency cap. Note a nice feature of async programming with callbacks: we need no mutex around changes to shared data, such as when we add links to seen_urls
. There is no preemptive multitasking, so we cannot be interrupted at arbitrary points in our code.
parse_links
方法(未顯示)返回一組URL。 我們?yōu)槊總€新網(wǎng)址開始一個新的抓取器哀蘑,沒有并發(fā)上限诚卸。 注意使用回調(diào)的異步編程的一個很好的功能:我們不需要圍繞共享數(shù)據(jù)的變化的互斥,例如當(dāng)我們添加鏈接到seen_urls
绘迁。 沒有搶先的多任務(wù)合溺,所以我們不能在我們的代碼中的任意點被打斷。
We add a global stopped
variable and use it to control the loop:
我們添加一個全局stopped
變量缀台,并使用它來控制循環(huán):
stopped = False
def loop():
while not stopped:
events = selector.select()
for event_key, event_mask in events:
callback = event_key.data
callback()
Once all pages are downloaded the fetcher stops the global event loop and the program exits.
一旦所有頁面被下載棠赛,fetcher會停止全局事件循環(huán),程序退出膛腐。
This example makes async's problem plain: spaghetti code. We need some way to express a series of computations and I/O operations, and schedule multiple such series of operations to run concurrently. But without threads, a series of operations cannot be collected into a single function: whenever a function begins an I/O operation, it explicitly saves whatever state will be needed in the future, then returns. You are responsible for thinking about and writing this state-saving code.
這個例子使async的問題很簡單:spaghetti code睛约。 我們需要一些方法來表達(dá)一系列計算和I / O操作,并且調(diào)度多個這樣的一系列操作以并發(fā)運行哲身。 但是沒有線程辩涝,一系列操作不能被收集到單個函數(shù)中:每當(dāng)一個函數(shù)開始I / O操作時,它顯式地保存將來需要的任何狀態(tài)勘天,然后返回怔揩。 你負(fù)責(zé)思考和編寫這個state-saving代碼。
Let us explain what we mean by that. Consider how simply we fetched a URL on a thread with a conventional blocking socket:
讓我們解釋一下我們的意思脯丝。 考慮我們?nèi)绾魏唵蔚卦谝粋€具有常規(guī)阻塞socket的線程上獲取一個URL:
# Blocking version.
def fetch(url):
sock = socket.socket()
sock.connect(('xkcd.com', 80))
request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
sock.send(request.encode('ascii'))
response = b''
chunk = sock.recv(4096)
while chunk:
response += chunk
chunk = sock.recv(4096)
# Page is now downloaded.
links = parse_links(response)
q.add(links)
What state does this function remember between one socket operation and the next? It has the socket, a URL, and the accumulating response
. A function that runs on a thread uses basic features of the programming language to store this temporary state in local variables, on its stack. The function also has a "continuation"—that is, the code it plans to execute after I/O completes. The runtime remembers the continuation by storing the thread's instruction pointer. You need not think about restoring these local variables and the continuation after I/O. It is built in to the language.
這個函數(shù)在一個socket操作和下一個socket操作之間記住什么狀態(tài)商膊? 它有socket,一個URL和累積的“響應(yīng)”巾钉。 在線程上運行的函數(shù)使用編程語言的基本特性將該臨時狀態(tài)存儲在其堆棧中的局部變量中翘狱。 該函數(shù)還有一個“continuation”秘案,即它計劃在I / O完成后執(zhí)行的代碼砰苍。 運行時通過存儲線程的指令指針來記住continuation潦匈。 您不必考慮恢復(fù)這些局部變量和I / O后的繼續(xù)。 它是內(nèi)置的語言赚导。
But with a callback-based async framework, these language features are no help. While waiting for I/O, a function must save its state explicitly, because the function returns and loses its stack frame before I/O completes. In lieu of local variables, our callback-based example stores sock
and response
as attributes of self
, the Fetcher instance. In lieu of the instruction pointer, it stores its continuation by registering the callbacks connected
and read_response
. As the application's features grow, so does the complexity of the state we manually save across callbacks. Such onerous bookkeeping makes the coder prone to migraines.
但是使用基于回調(diào)的異步框架茬缩,這些語言功能沒有幫助。 在等待I / O時吼旧,函數(shù)必須顯式地保存其狀態(tài)凰锡,因為函數(shù)在I / O完成之前返回并丟失其堆棧。 代替局部變量圈暗,我們的基于回調(diào)的例子將sock
和response
存儲為self
的屬性掂为,F(xiàn)etcher實例。 代替指令指針员串,它通過注冊回調(diào)connected
和read_response
來存儲它的continuation勇哗。 隨著應(yīng)用程序的功能增長,我們在回調(diào)中手動保存的狀態(tài)的復(fù)雜性也在增加寸齐。 這樣繁重的記錄使得編碼器傾向于偏頭痛欲诺。
Even worse, what happens if a callback throws an exception, before it schedules the next callback in the chain? Say we did a poor job on the parse_links
method and it throws an exception parsing some HTML:
更糟的是,如果回調(diào)引發(fā)異常渺鹦,在調(diào)度鏈中的下一個回調(diào)之前會發(fā)生什么扰法? 我們在parse_links
方法上做了一個不好的工作,它拋出一個解析一些HTML的異常:
Traceback (most recent call last):
File "loop-with-callbacks.py", line 111, in <module>
loop()
File "loop-with-callbacks.py", line 106, in loop
callback(event_key, event_mask)
File "loop-with-callbacks.py", line 51, in read_response
links = self.parse_links()
File "loop-with-callbacks.py", line 67, in parse_links
raise Exception('parse error')
Exception: parse error
The stack trace shows only that the event loop was running a callback. We do not remember what led to the error. The chain is broken on both ends: we forgot where we were going and whence we came. This loss of context is called "stack ripping", and in many cases it confounds the investigator. Stack ripping also prevents us from installing an exception handler for a chain of callbacks, the way a "try / except" block wraps a function call and its tree of descendents.[^7]
堆棧跟蹤僅顯示事件循環(huán)正在運行回調(diào)毅厚。 我們不記得是什么導(dǎo)致的錯誤塞颁。 鏈條在兩端都斷了:我們忘了我們?nèi)ツ睦铮覀儚哪膩怼?這種上下文的丟失被稱為“堆棧翻錄”吸耿,并且在許多情況下它混淆了研究者殴边。 堆棧翻錄還防止我們?yōu)榛卣{(diào)鏈安裝異常處理程序,“try / except”塊包裝函數(shù)調(diào)用及其后代樹珍语。[^ 7]
So, even apart from the long debate about the relative efficiencies of multithreading and async, there is this other debate regarding which is more error-prone: threads are susceptible to data races if you make a mistake synchronizing them, but callbacks are stubborn to debug due to stack ripping.
因此锤岸,除了關(guān)于多線程和異步的相對效率的長期爭論之外,還有另一個爭論是更容易出錯的:如果你錯誤的同步它們板乙,線程就容易受到數(shù)據(jù)競爭是偷,但是回調(diào)由于堆棧翻錄,固執(zhí)的調(diào)試 募逞。
Coroutines 協(xié)程
We entice you with a promise. It is possible to write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded programming. This combination is achieved with a pattern called "coroutines". Using Python 3.4's standard asyncio library, and a package called "aiohttp", fetching a URL in a coroutine is very direct[^10]:
我們用一個承諾誘惑你蛋铆。 可以編寫異步代碼,將回調(diào)的效率與多線程編程的經(jīng)典好看結(jié)合起來放接。 這種組合通過稱為“協(xié)程”的模式來實現(xiàn)刺啦。 使用Python 3.4的標(biāo)準(zhǔn)asyncio庫和一個名為“aiohttp”的包,在協(xié)程中獲取一個URL是非常直接的[^ 10]:
@asyncio.coroutine
def fetch(self, url):
response = yield from self.session.get(url)
body = yield from response.read()
It is also scalable. Compared to the 50k of memory per thread and the operating system's hard limits on threads, a Python coroutine takes barely 3k of memory on Jesse's system. Python can easily start hundreds of thousands of coroutines.
它也是可擴展的纠脾。 與每個線程的50k內(nèi)存和操作系統(tǒng)對線程的硬限制相比玛瘸,Python協(xié)程在Jesse系統(tǒng)上只需要3k的內(nèi)存蜕青。 Python可以輕松啟動數(shù)十萬個協(xié)程。
The concept of a coroutine, dating to the elder days of computer science, is simple: it is a subroutine that can be paused and resumed. Whereas threads are preemptively multitasked by the operating system, coroutines multitask cooperatively: they choose when to pause, and which coroutine to run next.
協(xié)程的概念糊渊,可以追溯到計算機科學(xué)的祖先右核,很簡單:它是一個可以暫停和恢復(fù)的子程序。 而線程是由操作系統(tǒng)搶占式多任務(wù)渺绒,協(xié)同多任務(wù)協(xié)作:他們選擇何時暫停贺喝,以及哪個協(xié)程運行下一步。
There are many implementations of coroutines; even in Python there are several. The coroutines in the standard "asyncio" library in Python 3.4 are built upon generators, a Future class, and the "yield from" statement. Starting in Python 3.5, coroutines are a native feature of the language itself[^17]; however, understanding coroutines as they were first implemented in Python 3.4, using pre-existing language facilities, is the foundation to tackle Python 3.5's native coroutines.
有很多協(xié)同的實現(xiàn); 即使在Python有幾個宗兼。 Python 3.4中的標(biāo)準(zhǔn)“asyncio”庫中的協(xié)程是基于generator躏鱼,F(xiàn)uture類和“yield from”語句構(gòu)建的。 從Python 3.5開始殷绍,協(xié)程是語言本身的一個本地特性[^ 17]; 然而挠他,了解協(xié)同程序,因為他們第一次在Python 3.4中實現(xiàn)篡帕,使用預(yù)先存在的語言設(shè)施殖侵,是解決Python 3.5的本地協(xié)同程序的基礎(chǔ)。
To explain Python 3.4's generator-based coroutines, we will engage in an exposition of generators and how they are used as coroutines in asyncio, and trust you will enjoy reading it as much as we enjoyed writing it. Once we have explained generator-based coroutines, we shall use them in our async web crawler.
為了解釋Python 3.4的基于生成器的協(xié)程镰烧,我們將介紹一些生成器拢军,以及它們?nèi)绾卧赼syncio中用作協(xié)同程序,并且相信你會喜歡閱讀它怔鳖,就像我們喜歡寫它一樣茉唉。 一旦我們解釋了基于生成器的協(xié)同程序,我們將使用它們在我們的異步Web爬蟲结执。
How Python Generators Work Python生成器如何工作
Before you grasp Python generators, you have to understand how regular Python functions work. Normally, when a Python function calls a subroutine, the subroutine retains control until it returns, or throws an exception. Then control returns to the caller:
在掌握Python生成器之前度陆,您必須了解常規(guī)Python函數(shù)的工作原理。 通常献幔,當(dāng)Python函數(shù)調(diào)用子例程時懂傀,子例程保留控制權(quán),直到返回或拋出異常蜡感。 然后控制權(quán)返回給調(diào)用者:
>>> def foo():
... bar()
...
>>> def bar():
... pass
The standard Python interpreter is written in C. The C function that executes a Python function is called, mellifluously, PyEval_EvalFrameEx
. It takes a Python stack frame object and evaluates Python bytecode in the context of the frame. Here is the bytecode for foo
:
標(biāo)準(zhǔn)的Python解釋器是用C編寫的蹬蚁。執(zhí)行Python函數(shù)的C函數(shù)被稱為PyEval_EvalFrameEx
。 它需要一個Python椫P耍框架對象犀斋,并在框架的上下文中評估Python字節(jié)碼。 這里是foo
的字節(jié)碼:
>>> import dis
>>> dis.dis(foo)
2 0 LOAD_GLOBAL 0 (bar)
3 CALL_FUNCTION 0 (0 positional, 0 keyword pair)
6 POP_TOP
7 LOAD_CONST 0 (None)
10 RETURN_VALUE
The foo
function loads bar
onto its stack and calls it, then pops its return value from the stack, loads None
onto the stack, and returns None
.
foo
函數(shù)將bar
加載到它的堆棧上并調(diào)用它情连,然后從堆棧中彈出其返回值叽粹,將None
加載到堆棧中,并返回None
。
When PyEval_EvalFrameEx
encounters the CALL_FUNCTION
bytecode, it creates a new Python stack frame and recurses: that is, it calls PyEval_EvalFrameEx
recursively with the new frame, which is used to execute bar
.
當(dāng)PyEval_EvalFrameEx
遇到CALL_FUNCTION
字節(jié)碼時虫几,它創(chuàng)建一個新的Python棿覆樱框架和遞歸:也就是說,它用新的框架遞歸調(diào)用PyEval_EvalFrameEx
持钉,用來執(zhí)行bar
衡招。
It is crucial to understand that Python stack frames are allocated in heap memory! The Python interpreter is a normal C program, so its stack frames are normal stack frames. But the Python stack frames it manipulates are on the heap. Among other surprises, this means a Python stack frame can outlive its function call. To see this interactively, save the current frame from within bar
:
了解Python堆棧在堆內(nèi)存中分配是至關(guān)重要的篱昔! Python解釋器是一個正常的C程序每强,所以它的堆棧是正常的堆棧幀。 但是* Python *堆椫莨簦框架操縱是在堆上空执。 除此之外,這意味著Python堆棧幀可以超過其函數(shù)調(diào)用穗椅。 要以交互方式查看辨绊,請從bar
中保存當(dāng)前幀:
>>> import inspect
>>> frame = None
>>> def foo():
... bar()
...
>>> def bar():
... global frame
... frame = inspect.currentframe()
...
>>> foo()
>>> # The frame was executing the code for 'bar'.
>>> frame.f_code.co_name
'bar'
>>> # Its back pointer refers to the frame for 'foo'.
>>> caller_frame = frame.f_back
>>> caller_frame.f_code.co_name
'foo'
\aosafigure[240pt]{crawler-images/function-calls.png}{Function Calls}{500l.crawler.functioncalls}
The stage is now set for Python generators, which use the same building blocks—code objects and stack frames—to marvelous effect.
該階段現(xiàn)在設(shè)置為Python生成器,它使用相同的構(gòu)建塊 - 代碼對象和堆棧幀 - 以奇妙的效果匹表。
This is a generator function: 這是一個生成器函數(shù):
>>> def gen_fn():
... result = yield 1
... print('result of yield: {}'.format(result))
... result2 = yield 2
... print('result of 2nd yield: {}'.format(result2))
... return 'done'
...
When Python compiles gen_fn
to bytecode, it sees the yield
statement and knows that gen_fn
is a generator function, not a regular one. It sets a flag to remember this fact:
當(dāng)Python將gen_fn
編譯成字節(jié)碼時门坷,它看到yield
語句,并且知道gen_fn
是一個生成器函數(shù)袍镀,而不是一個常規(guī)函數(shù)默蚌。 它設(shè)置一個標(biāo)志,以記住這個事實:
>>> # The generator flag is bit position 5.
>>> generator_bit = 1 << 5
>>> bool(gen_fn.__code__.co_flags & generator_bit)
True
When you call a generator function, Python sees the generator flag, and it does not actually run the function. Instead, it creates a generator:
當(dāng)你調(diào)用一個生成器函數(shù)苇羡,Python看到生成器標(biāo)志绸吸,它實際上不運行該函數(shù)。 相反设江,它創(chuàng)建一個生成器:
>>> gen = gen_fn()
>>> type(gen)
<class 'generator'>
A Python generator encapsulates a stack frame plus a reference to some code, the body of gen_fn
:
Python生成器封裝了一個棧幀加上一些代碼的引用锦茁,gen_fn的主體:
>>> gen.gi_code.co_name
'gen_fn'
All generators from calls to gen_fn
point to this same code. But each has its own stack frame. This stack frame is not on any actual stack, it sits in heap memory waiting to be used:
來自gen_fn
調(diào)用的所有生成器都指向這個相同的代碼。 但每個都有自己的堆棧幀叉存。 這個堆棧幀不在任何實際堆棧码俩,它在堆內(nèi)存等待被使用:
\aosafigure[240pt]{crawler-images/generator.png}{Generators}{500l.crawler.generators}
The frame has a "last instruction" pointer, the instruction it executed most recently. In the beginning, the last instruction pointer is -1, meaning the generator has not begun:
該幀具有“最后指令”指針,它是最近執(zhí)行的指令歼捏。 開始時握玛,最后一個指令指針是-1,表示生成器尚未開始:
>>> gen.gi_frame.f_lasti
-1
When we call send
, the generator reaches its first yield
, and pauses. The return value of send
is 1, since that is what gen
passes to the yield
expression:
當(dāng)我們調(diào)用'send'時甫菠,生成器到達(dá)它的第一個“yield”挠铲,并暫停。 send
的返回值是1寂诱,因為這是gen
傳遞給yield
表達(dá)式:
>>> gen.send(None)
1
The generator's instruction pointer is now 3 bytecodes from the start, part way through the 56 bytes of compiled Python:
生成器的指令指針現(xiàn)在是3個字節(jié)碼拂苹,部分通過編譯的Python的56個字節(jié):
>>> gen.gi_frame.f_lasti
3
>>> len(gen.gi_code.co_code)
56
The generator can be resumed at any time, from any function, because its stack frame is not actually on the stack: it is on the heap. Its position in the call hierarchy is not fixed, and it need not obey the first-in, last-out order of execution that regular functions do. It is liberated, floating free like a cloud.
生成器可以在任何時候從任何函數(shù)恢復(fù),因為它的堆棧幀實際上不在堆棧上:它在堆上。 它在調(diào)用層次結(jié)構(gòu)中的位置不是固定的瓢棒,并且它不需要遵守常規(guī)函數(shù)執(zhí)行的先進(jìn)先出順序浴韭。 它是解放的,浮動自由像云脯宿。
We can send the value "hello" into the generator and it becomes the result of the yield
expression, and the generator continues until it yields 2:
我們可以發(fā)送值“hello”到生成器念颈,它成為yield
表達(dá)式的結(jié)果,生成器繼續(xù)连霉,直到它產(chǎn)生2:
>>> gen.send('hello')
result of yield: hello
2
Its stack frame now contains the local variable result
:
它的堆棧幀現(xiàn)在包含局部變量result:
>>> gen.gi_frame.f_locals
{'result': 'hello'}
Other generators created from gen_fn
will have their own stack frames and local variables.
從gen_fn
創(chuàng)建的其他生成器將有自己的堆棧幀和局部變量
When we call send
again, the generator continues from its second yield
, and finishes by raising the special StopIteration
exception:
當(dāng)我們再次調(diào)用send
時榴芳,生成器從它的第二個yield
繼續(xù),并且通過提高特殊的StopIteration異常來結(jié)束:
>>> gen.send('goodbye')
result of 2nd yield: goodbye
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration: done
The exception has a value, which is the return value of the generator: the string "done"
.
異常有一個值跺撼,它是生成器的返回值:字符串"done"
窟感。
Building Coroutines With Generators 構(gòu)造帶有生成器的協(xié)程
So a generator can pause, and it can be resumed with a value, and it has a return value. Sounds like a good primitive upon which to build an async programming model, without spaghetti callbacks! We want to build a "coroutine": a routine that is cooperatively scheduled with other routines in the program. Our coroutines will be a simplified version of those in Python's standard "asyncio" library. As in asyncio, we will use generators, futures, and the "yield from" statement.
因此,發(fā)生器可以暫停歉井,并且可以使用值恢復(fù)柿祈,并且它具有返回值。 聽起來像一個很好的原語哩至,構(gòu)建一個異步編程模型躏嚎,沒有意大利面條回調(diào)! 我們想建立一個“協(xié)程”:一個與程序中的其他程序合作安排的程序菩貌。 我們的協(xié)程將是Python標(biāo)準(zhǔn)“asyncio”庫中的簡化版本卢佣。 在asyncio中,我們將使用generator菜谣,futures和“yield from”語句珠漂。
First we need a way to represent some future result that a coroutine is waiting for. A stripped-down version:
首先,我們需要一種方法來表示協(xié)程正在等待的future結(jié)果尾膊。 精簡版本:
class Future:
def __init__(self):
self.result = None
self._callbacks = []
def add_done_callback(self, fn):
self._callbacks.append(fn)
def set_result(self, result):
self.result = result
for fn in self._callbacks:
fn(self)
A future is initially "pending". It is "resolved" by a call to set_result
.[^12]
future 最初是“待定”媳危。 它是通過調(diào)用set_result
來“解析”的。[^ 12]
Let us adapt our fetcher to use futures and coroutines. We wrote fetch
with a callback:
讓我們調(diào)整我們的fetcher使用futures 和協(xié)程冈敛。 我們用回調(diào)寫了fetch
:
class Fetcher:
def fetch(self):
self.sock = socket.socket()
self.sock.setblocking(False)
try:
self.sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
selector.register(self.sock.fileno(),
EVENT_WRITE,
self.connected)
def connected(self, key, mask):
print('connected!')
# And so on....
The fetch
method begins connecting a socket, then registers the callback, connected
, to be executed when the socket is ready. Now we can combine these two steps into one coroutine:
fetch
方法開始連接一個socket待笑,然后注冊回調(diào),connect
抓谴,當(dāng)socket就緒時執(zhí)行暮蹂。 現(xiàn)在我們可以將這兩個步驟組合成一個協(xié)程:
def fetch(self):
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect(('xkcd.com', 80))
except BlockingIOError:
pass
f = Future()
def on_connected():
f.set_result(None)
selector.register(sock.fileno(),
EVENT_WRITE,
on_connected)
yield f
selector.unregister(sock.fileno())
print('connected!')
Now fetch
is a generator function, rather than a regular one, because it contains a yield
statement. We create a pending future, then yield it to pause fetch
until the socket is ready. The inner function on_connected
resolves the future.
現(xiàn)在fetch是一個生成器函數(shù),而不是一個常規(guī)的函數(shù)癌压,因為它包含一個yield語句仰泻。 我們創(chuàng)建一個待定的Future,然后讓它暫停抓取滩届,直到socket準(zhǔn)備就緒集侯。 內(nèi)部函數(shù)on_connected解析Future。
But when the future resolves, what resumes the generator? We need a coroutine driver. Let us call it "task":
但是,當(dāng)future 解決棠枉,用什么來恢復(fù)生成器浓体? 我們需要一個協(xié)程driver程序。 讓我們稱之為“任務(wù)”:
class Task:
def __init__(self, coro):
self.coro = coro
f = Future()
f.set_result(None)
self.step(f)
def step(self, future):
try:
next_future = self.coro.send(future.result)
except StopIteration:
return
next_future.add_done_callback(self.step)
# Begin fetching http://xkcd.com/353/
fetcher = Fetcher('/353/')
Task(fetcher.fetch())
loop()
The task starts the fetch
generator by sending None
into it. Then fetch
runs until it yields a future, which the task captures as next_future
. When the socket is connected, the event loop runs the callback on_connected
, which resolves the future, which calls step
, which resumes fetch
.
任務(wù)通過發(fā)送None
來啟動fetch
生成器辈讶。 然后fetch
運行命浴,直到它產(chǎn)生一個future,任務(wù)捕獲為next_future
贱除。 當(dāng)套接字連接時生闲,事件循環(huán)運行回調(diào)on_connected
,它解析future勘伺,它調(diào)用step
跪腹,它恢復(fù)fetch
褂删。
Factoring Coroutines With yield from
Once the socket is connected, we send the HTTP GET request and read the server response. These steps need no longer be scattered among callbacks; we gather them into the same generator function:
一旦socket連接飞醉,我們發(fā)送HTTP GET請求并讀取服務(wù)器響應(yīng)。 這些步驟不再分散在回調(diào)中; 我們將它們收集到相同的生成器函數(shù)中:
def fetch(self):
# ... connection logic from above, then:
sock.send(request.encode('ascii'))
while True:
f = Future()
def on_readable():
f.set_result(sock.recv(4096))
selector.register(sock.fileno(),
EVENT_READ,
on_readable)
chunk = yield f
selector.unregister(sock.fileno())
if chunk:
self.response += chunk
else:
# Done reading.
break
This code, which reads a whole message from a socket, seems generally useful. How can we factor it from fetch
into a subroutine? Now Python 3's celebrated yield from
takes the stage. It lets one generator delegate to another.
這個代碼屯阀,從socket讀取整個消息缅帘,似乎很有用。 我們?nèi)绾螌⑺鼜膄etch轉(zhuǎn)換為子程序难衰? 現(xiàn)在Python 3的yield from
走上舞臺钦无。 它讓一個生成器委托給另一個。
To see how, let us return to our simple generator example:
為了看看怎么做盖袭,讓我們回到我們簡單的生成器示例:
>>> def gen_fn():
... result = yield 1
... print('result of yield: {}'.format(result))
... result2 = yield 2
... print('result of 2nd yield: {}'.format(result2))
... return 'done'
...
To call this generator from another generator, delegate to it with yield from
:
要從另一個生成器調(diào)用這個生成器失暂,使用yield from
來委托它:
>>> # Generator function:
>>> def caller_fn():
... gen = gen_fn()
... rv = yield from gen
... print('return value of yield-from: {}'
... .format(rv))
...
>>> # Make a generator from the
>>> # generator function.
>>> caller = caller_fn()
The caller
generator acts as if it were gen
, the generator it is delegating to:
caller
生成器就像是gen
,它被委托給:
>>> caller.send(None)
1
>>> caller.gi_frame.f_lasti
15
>>> caller.send('hello')
result of yield: hello
2
>>> caller.gi_frame.f_lasti # Hasn't advanced.
15
>>> caller.send('goodbye')
result of 2nd yield: goodbye
return value of yield-from: done
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration
While caller
yields from gen
, caller
does not advance. Notice that its instruction pointer remains at 15, the site of its yield from
statement, even while the inner generator gen
advances from one yield
statement to the next.[^13] From our perspective outside caller
, we cannot tell if the values it yields are from caller
or from the generator it delegates to. And from inside gen
, we cannot tell if values are sent in from caller
or from outside it. The yield from
statement is a frictionless channel, through which values flow in and out of gen
until gen
completes.
雖然caller
從 gen
產(chǎn)生鳄虱,caller
不會前進(jìn)弟塞。 注意,它的指令指針保持在15拙已,即它的yield from語句的位置决记,即使內(nèi)部生成器gen
從一個yield語句前進(jìn)到下一個。[^ 13]從我們的角度看倍踪, 我們不能知道它產(chǎn)生的值是從caller
還是從它委派的生成器系宫。 從gen
里面,我們不能知道值是從caller
還是從外部發(fā)送建车。 “yield from”語句是一個無摩擦的通道扩借,盡管值通過它流入和離開gen
,直到gen
完成缤至。
A coroutine can delegate work to a sub-coroutine with yield from
and receive the result of the work. Notice, above, that caller
printed "return value of yield-from: done". When gen
completed, its return value became the value of the yield from
statement in caller
:
協(xié)程可以將工作委托給具有yield from
的子協(xié)程潮罪,并接收工作的結(jié)果。 注意,上面的caller
打印“return value of yield-from: done”错洁。 當(dāng)gen
完成時秉宿,其返回值成為caller'中
yield from`語句的值:
rv = yield from gen
Earlier, when we criticized callback-based async programming, our most strident complaint was about "stack ripping": when a callback throws an exception, the stack trace is typically useless. It only shows that the event loop was running the callback, not why. How do coroutines fare?
早些時候,當(dāng)我們批評基于回調(diào)的異步編程時屯碴,我們最強烈的投訴是關(guān)于“stack ripping”:當(dāng)回調(diào)拋出異常時描睦,堆棧跟蹤通常是無用的毒返。 它只顯示事件循環(huán)正在運行回調(diào)德绿,而不是為什么。 協(xié)程如何運行轿曙?
>>> def gen_fn():
... raise Exception('my error')
>>> caller = caller_fn()
>>> caller.send(None)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in caller_fn
File "<input>", line 2, in gen_fn
Exception: my error
This is much more useful! The stack trace shows caller_fn
was delegating to gen_fn
when it threw the error. Even more comforting, we can wrap the call to a sub-coroutine in an exception handler, the same is with normal subroutines:
這更有用今艺! 堆棧跟蹤顯示 caller_fn
在委托gen_fn
時拋出錯誤韵丑。 更令人欣慰的是,我們可以將調(diào)用包裝到異常處理程序中的子協(xié)程虚缎,同樣的是使用正常的子程序:
>>> def gen_fn():
... yield 1
... raise Exception('uh oh')
...
>>> def caller_fn():
... try:
... yield from gen_fn()
... except Exception as exc:
... print('caught {}'.format(exc))
...
>>> caller = caller_fn()
>>> caller.send(None)
1
>>> caller.send('hello')
caught uh oh
So we factor logic with sub-coroutines just like with regular subroutines. Let us factor some useful sub-coroutines from our fetcher. We write a read
coroutine to receive one chunk:
因此撵彻,我們使用子協(xié)程,就像使用常規(guī)子程序一樣实牡。 讓我們從我們的fetcher中得到一些有用的子協(xié)程陌僵。 我們寫一個read
協(xié)程來接收一個塊:
def read(sock):
f = Future()
def on_readable():
f.set_result(sock.recv(4096))
selector.register(sock.fileno(), EVENT_READ, on_readable)
chunk = yield f # Read one chunk.
selector.unregister(sock.fileno())
return chunk
We build on read
with a read_all
coroutine that receives a whole message:
我們使用read
協(xié)程構(gòu)建read_all
,它接收一條完整的消息:
def read_all(sock):
response = []
# Read whole response.
chunk = yield from read(sock)
while chunk:
response.append(chunk)
chunk = yield from read(sock)
return b''.join(response)
If you squint the right way, the yield from
statements disappear and these look like conventional functions doing blocking I/O. But in fact, read
and read_all
are coroutines. Yielding from read
pauses read_all
until the I/O completes. While read_all
is paused, asyncio's event loop does other work and awaits other I/O events; read_all
is resumed with the result of read
on the next loop tick once its event is ready.
如果你以正確的方式看待创坞,yield from
語句消失碗短,這些看起來像阻塞I / O的常規(guī)函數(shù)。 但事實上题涨,read
和read_all
是協(xié)程偎谁。 從 read
讀取暫停read_all
,直到I / O完成纲堵。 當(dāng)read_all
被暫停時巡雨,asyncio的事件循環(huán)執(zhí)行其他工作,并等待其他I / O事件; read_all
在其事件準(zhǔn)備就緒后婉支,在下一個循環(huán)中返回“read”的結(jié)果鸯隅。
At the stack's root, fetch
calls read_all
:
在棧的根,fetch
調(diào)用read_all
:
class Fetcher:
def fetch(self):
# ... connection logic from above, then:
sock.send(request.encode('ascii'))
self.response = yield from read_all(sock)
Miraculously, the Task class needs no modification. It drives the outer fetch
coroutine just the same as before:
奇怪的是向挖,Task類不需要修改蝌以。 它與之前一樣驅(qū)動外部fetch
協(xié)程:
Task(fetcher.fetch())
loop()
When read
yields a future, the task receives it through the channel of yield from
statements, precisely as if the future were yielded directly from fetch
. When the loop resolves a future, the task sends its result into fetch
, and the value is received by read
, exactly as if the task were driving read
directly:
當(dāng)read
產(chǎn)生future時,任務(wù)通過yield from
語句的通道接收它何之,就好像future直接從fetch
中獲得跟畅。 當(dāng)循環(huán)解決future時,任務(wù)將其結(jié)果發(fā)送到fetch
溶推,并且值由read
接收徊件,就像任務(wù)直接驅(qū)動read
:
\aosafigure[240pt]{crawler-images/yield-from.png}{Yield From}{500l.crawler.yieldfrom}
To perfect our coroutine implementation, we polish out one mar: our code uses yield
when it waits for a future, but yield from
when it delegates to a sub-coroutine. It would be more refined if we used yield from
whenever a coroutine pauses. Then a coroutine need not concern itself with what type of thing it awaits.
為了完善我們的協(xié)程實現(xiàn)奸攻,我們拋棄一個mar:我們的代碼在等待future時使用yield
,而在委托給一個子協(xié)程時使用yield from
虱痕。 如果我們每當(dāng)一個協(xié)程暫停時使用yield from
睹耐,它會更精確。 然后協(xié)同程序不需要關(guān)心它等待什么類型的事情部翘。
We take advantage of the deep correspondence in Python between generators and iterators. Advancing a generator is, to the caller, the same as advancing an iterator. So we make our Future class iterable by implementing a special method:
我們利用Python在生成器和迭代器之間的深層對應(yīng)硝训。 推進(jìn)生成器對于調(diào)用者,與推進(jìn)迭代器相同新思。 所以我們通過實現(xiàn)一個特殊的方法使我們的Future類可迭代:
# Method on Future class.
def __iter__(self):
# Tell Task to resume me here.
yield self
return self.result
The future's __iter__
method is a coroutine that yields the future itself. Now when we replace code like this:
future的__iter__
方法是一個協(xié)同程序窖梁,它產(chǎn)生future本身。 現(xiàn)在當(dāng)我們像這樣替換代碼:
# f is a Future.
yield f
...with this:
# f is a Future.
yield from f
...the outcome is the same! The driving Task receives the future from its call to send
, and when the future is resolved it sends the new result back into the coroutine.
...結(jié)果是一樣的夹囚! 驅(qū)動任務(wù)從其對send
的調(diào)用接收future 纵刘,并且當(dāng)future 被解決時,它將新的結(jié)果發(fā)送回協(xié)程荸哟。
What is the advantage of using yield from
everywhere? Why is that better than waiting for futures with yield
and delegating to sub-coroutines with yield from
? It is better because now, a method can freely change its implementation without affecting the caller: it might be a normal method that returns a future that will resolve to a value, or it might be a coroutine that contains yield from
statements and returns a value. In either case, the caller need only yield from
the method in order to wait for the result.
使用yield from
的優(yōu)勢是什么假哎? 為什么比等待具有 yield
的future,并委托給具有yield from
的子協(xié)程更好敲茄? 它是更好的位谋,因為現(xiàn)在山析,一個方法可以自由地改變其實現(xiàn)堰燎,而不影響調(diào)用者:它可能是一個正常的方法,返回一個future將解析一個值笋轨,或者它可能是一個協(xié)程包含yield from
語句 和返回一個值秆剪。 在任一情況下,調(diào)用者只需要yield from
方法來等待結(jié)果爵政。
Gentle reader, we have reached the end of our enjoyable exposition of coroutines in asyncio. We peered into the machinery of generators, and sketched an implementation of futures and tasks. We outlined how asyncio attains the best of both worlds: concurrent I/O that is more efficient than threads and more legible than callbacks. Of course, the real asyncio is much more sophisticated than our sketch. The real framework addresses zero-copy I/O, fair scheduling, exception handling, and an abundance of other features.
溫柔的讀者仅讽,我們已經(jīng)到達(dá)了我們愉快的在asyncio的協(xié)程的終點。 我們探討了生成器的機制钾挟,并草擬了一個futures 和tasks的實現(xiàn)洁灵。 我們概述了asyncio如何實現(xiàn)兩個中最好的:并發(fā)I / O比線程更有效,比回調(diào)更清晰掺出。 當(dāng)然徽千,真正的asyncio比我們的草圖更復(fù)雜。 真正的框架解決了零拷貝I / O汤锨,公平調(diào)度双抽,異常處理和大量其他功能。
To an asyncio user, coding with coroutines is much simpler than you saw here. In the code above we implemented coroutines from first principles, so you saw callbacks, tasks, and futures. You even saw non-blocking sockets and the call to select
. But when it comes time to build an application with asyncio, none of this appears in your code. As we promised, you can now sleekly fetch a URL:
對于asyncio用戶闲礼,使用協(xié)程的代碼比你在這里看到的要簡單得多牍汹。 在上面的代碼中铐维,我們從第一個原則實現(xiàn)協(xié)程,所以你看到回調(diào)慎菲,tasks和futures嫁蛇。 你甚至看到非阻塞socket和調(diào)用 select
。 但是當(dāng)使用asyncio構(gòu)建應(yīng)用程序時露该,這些都不會出現(xiàn)在您的代碼中棠众。 正如我們承諾的,你現(xiàn)在可以順利地獲取一個URL:
@asyncio.coroutine
def fetch(self, url):
response = yield from self.session.get(url)
body = yield from response.read()
Satisfied with this exposition, we return to our original assignment: to write an async web crawler, using asyncio.
我們回到我們原來的任務(wù):寫一個異步的網(wǎng)絡(luò)爬蟲有决,使用asyncio闸拿。
** 簡書篇幅有限,未完待續(xù)书幕。下文請查看500 Lines or Less:A Web Crawler With asyncio Coroutines異步網(wǎng)絡(luò)爬蟲(二) **
存放著翻譯的文章的我自己的github:https://github.com/fst034356/500LinesorLessToCN
GitHub上的是整個文章新荤,不需要像簡書這樣分開。
由于英語功力實在有限台汇,錯誤難免苛骨,歡迎各位大佬指正,順便求個star苟呐。