以往爬蟲(chóng)都是用自己寫(xiě)的一個(gè)爬蟲(chóng)框架,一群Workers去Master那領(lǐng)取任務(wù)后開(kāi)始爬。進(jìn)程數(shù)量等于處理器核心數(shù),通過(guò)增開(kāi)線程數(shù)提高爬取速度抬伺。
最近看了Celery,接口真是優(yōu)美灾梦,挺想試驗(yàn)下異步模型來(lái)寫(xiě)個(gè)爬蟲(chóng)峡钓。
模擬目標(biāo)
為了方便測(cè)試,用Tornado搭了一個(gè)簡(jiǎn)易的服務(wù)器若河,用來(lái)模擬被爬的網(wǎng)站能岩。
功能很簡(jiǎn)單,每個(gè)請(qǐng)求阻塞6秒才回復(fù)
import tornado.web
import tornado.ioloop
import time
from concurrent.futures import ThreadPoolExecutor
from tornado.concurrent import run_on_executor
import tornado.gen
class MainHandler(tornado.web.RequestHandler):
executor = ThreadPoolExecutor(40)
@tornado.web.asynchronous
@tornado.gen.coroutine
def get(self):
print(time.asctime())
yield self.sleep(6)
self.write('from server:' + time.asctime())
self.finish()
@run_on_executor
def sleep(self, sec):
time.sleep(sec)
if __name__ == '__main__':
app = tornado.web.Application(handlers=[
('^/.*', MainHandler)
])
app.listen(10240)
tornado.ioloop.IOLoop.instance().start()
消費(fèi)者
task里就一個(gè)spider函數(shù)萧福,功能是利用gevent去請(qǐng)求給定的目標(biāo)
import gevent.monkey
gevent.monkey.patch_socket()
from celery import Celery
import socket
import requests
import gevent
app = Celery('tasks',
broker='redis://127.0.0.1:6379/3',
backend='redis://127.0.0.1:6379/3')
@app.task
def spider(url):
resp = gevent.spawn(requests.get, url)
tmp = 0
while True:
print('wait...', tmp)
if resp.ready():
return 'from:' + socket.getfqdn() + '\nres:' + str(resp.value.text)
gevent.sleep(1)
tmp += 1
用gevent模式啟動(dòng)Celery
celery worker -A tasks --loglevel info -c 100 -P gevent
生產(chǎn)者
利用剛剛編寫(xiě)的spider函數(shù)去爬取目標(biāo)
測(cè)試中拉鹃,下面代碼開(kāi)了6個(gè)進(jìn)程,結(jié)果均在7秒內(nèi)返回,證明成功了膏燕。
from tasks import spider
import time
import random
res = spider.delay('http://127.0.0.1:10240/{}'.format(random.randint(1, 999)))
i = 0
while True:
if res.ready():
print('res:', res.get())
break
else:
print('wait...', i)
time.sleep(1)
i += 1
Celery的部分日志輸出:
可以看出在一個(gè)Celery進(jìn)程內(nèi)炭庙,多個(gè)spider函數(shù)輪替執(zhí)行的
[2016-08-20 21:27:11,281: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1
[2016-08-20 21:27:11,313: INFO/MainProcess] Received task: tasks.spider[7b8b6f63-2bef-491e-a3a8-fdbcff824b9c]
[2016-08-20 21:27:11,314: WARNING/MainProcess] wait...
[2016-08-20 21:27:11,314: WARNING/MainProcess] 0
[2016-08-20 21:27:11,316: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1
[2016-08-20 21:27:11,354: INFO/MainProcess] Received task: tasks.spider[5aa05e65-504d-4a04-8247-3f5708bfa46f]
[2016-08-20 21:27:11,356: WARNING/MainProcess] wait...
[2016-08-20 21:27:11,356: WARNING/MainProcess] 0
[2016-08-20 21:27:11,357: INFO/MainProcess] Starting new HTTP connection (1): 127.0.0.1
[2016-08-20 21:27:11,821: WARNING/MainProcess] wait...
[2016-08-20 21:27:11,821: WARNING/MainProcess] 1
[2016-08-20 21:27:11,989: WARNING/MainProcess] wait...
[2016-08-20 21:27:11,990: WARNING/MainProcess] 1
[2016-08-20 21:27:12,059: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,059: WARNING/MainProcess] 2
[2016-08-20 21:27:12,208: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,209: WARNING/MainProcess] 1
[2016-08-20 21:27:12,225: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,225: WARNING/MainProcess] 1
[2016-08-20 21:27:12,246: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,247: WARNING/MainProcess] 2
[2016-08-20 21:27:12,282: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,282: WARNING/MainProcess] 1
[2016-08-20 21:27:12,316: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,316: WARNING/MainProcess] 1
[2016-08-20 21:27:12,357: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,357: WARNING/MainProcess] 1
[2016-08-20 21:27:12,823: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,823: WARNING/MainProcess] 2
[2016-08-20 21:27:12,991: WARNING/MainProcess] wait...
[2016-08-20 21:27:12,992: WARNING/MainProcess] 2
[2016-08-20 21:27:13,061: WARNING/MainProcess] wait...
[2016-08-20 21:27:13,061: WARNING/MainProcess] 3
[2016-08-20 21:27:13,210: WARNING/MainProcess] wait...
[2016-08-20 21:27:13,211: WARNING/MainProcess] 2
[2016-08-20 21:27:13,227: WARNING/MainProcess] wait...
[2016-08-20 21:27:13,227: WARNING/MainProcess] 2
最后
借助Celery,爬蟲(chóng)很容易實(shí)現(xiàn)橫向擴(kuò)展煌寇,在多臺(tái)服務(wù)器上增加消費(fèi)者進(jìn)程即可;
借助gevent逾雄,單進(jìn)程內(nèi)requests做到了非阻塞阀溶,而我過(guò)去是用多線程對(duì)付阻塞的。
Celery鸦泳,gevent我也是初學(xué)一天银锻,這小玩意兒做出來(lái)后,得開(kāi)始看文檔了深入了解了做鹰!