安裝
服務(wù)
pip install scrapyd
使用命令行工具
python3 -m pip install scrapyd-client
python連接包
python3 -m pip install python-scrapyd-api
找到python文件路徑供置,設(shè)置scrapyd可以直接執(zhí)行
image.png
啟動服務(wù)
scrapyd
image.png
啟動會讀取python包下的默認配置文件
image.png
max_proc
最大的scrapy線程數(shù),默認值是0,代表不限制哩都,表示可用cpu個數(shù) * max_proc_per_cpu
max_proc_per_cpu
每個CPU最大的scrapy線程數(shù)
bind_address
綁定ip虹钮,修改為0.0.0.0就可以在別的機器訪問服務(wù)器了(防火墻端口要放開)
http_port
端口
運行scrapyd后,在瀏覽器打開對應(yīng)的地址,可以看到如下界面:
image.png
example里面的例子不是很全埃碱,可以修改項目下的website.py,添加
<p><code>curl http://localhost:6800/schedule.json -d project=default -d spider=somespider</code></p>
<p><code> curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444 </code></p>
<p><code> curl http://localhost:6800/listprojects.json </code></p>
<p><code> curl http://localhost:6800/listversions.json?project=myproject </code></p>
<p><code> curl http://localhost:6800/listspiders.json?project=myproject </code></p><p><code> curl http://localhost:6800/listjobs.json?project=myproject </code></p>
<p><code> curl http://localhost:6800/delproject.json -d project=myproject </code></p>
<p><code> curl http://localhost:6800/delversion.json -d project=myproject -d version=r99 </code></p>
修改完成如下:
image.png
到這里scrapyd服務(wù)啟動完成嚎尤,下面我們通過一個deploy新浪新聞爬蟲說明如何把一個爬蟲項目上傳到scrapyd
cd sinanew
進入爬蟲項目頂目錄可以看到如下結(jié)構(gòu)image.png
其中scrapy.cfg就是跟deploy有關(guān)的配置文件
image.png
url參數(shù)如果是本機默認即可荔仁,project代表項目名稱是scrapyd里面每個project的標識
上傳使用scrapyd-deploy命令,安裝scrapyd-client時會有可執(zhí)行文件放在python/bin目錄下芽死,同樣需要做個軟連接
ln -s /usr/local/bin/python3/bin/scrapyd-deploy /usr/bin/scrapyd-deploy
image.png
scrapyd-deploy -l
會根據(jù)scrapy.cfg文件列出可以選擇的tag和project,tag是:后面的標識
image.png
image.png
上傳使用命令
scrapyd-deploy <target> -p <project> --version <version>
scrapyd-deploy abc -p sinanews --version 1
version參數(shù)可空乏梁,會隨機生成一串,成功deploy返回如下信息:
image.png
同時回到web界面會看到剛剛上傳的項目关贵。
image.png
試著使用一下api
list project
[root@localhost sinanews]# curl http://localhost:6800/listprojects.json
{"node_name": "localhost.localdomain", "status": "ok", "projects": ["sinanews"]}
list version
[root@localhost sinanews]# curl http://localhost:6800/listversions.json?project=sinanews
{"node_name": "localhost.localdomain", "status": "ok", "versions": ["1"]}
再deploy一個版本2遇骑,然后list version
[root@localhost sinanews]# curl http://localhost:6800/listversions.json?project=sinanews
{"node_name": "localhost.localdomain", "status": "ok", "versions": ["1", "2"]}
list spiders
[root@localhost sinanews]# curl http://localhost:6800/listspiders.json?project=sinanews
{"node_name": "localhost.localdomain", "status": "ok", "spiders": ["sina"]}
運行爬蟲
[root@localhost sinanews]# curl http://localhost:6800/schedule.json -d project=sinanews -d spider=sina
{"node_name": "localhost.localdomain", "status": "ok", "jobid": "2157910a9ef811e995c020040fe78714"}
取消任務(wù)
[root@localhost sinanews]# curl http://localhost:6800/cancel.json -d project=sinanews -d job=2157910a9ef811e995c020040fe78714
{"node_name": "localhost.localdomain", "status": "ok", "prevstate": null}
刪除項目
curl http://localhost:6800/delproject.json -d project=myproject
刪掉指定版本
curl http://localhost:6800/delversion.json -d project=myproject -d version=r99
日志文件存放
日志目錄/項目名稱/爬蟲名稱/任務(wù)ID.log,存儲個數(shù)根據(jù)配置文檔來定
image.png
egg
項目代碼上傳會打包成egg文件
image.png
分別是eggs目錄/項目名稱/代碼版本號
使用scrapyd_api
調(diào)度
from scrapyd_api import ScrapydAPI
scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule(project_name, spider_name)
源碼修改,方便使用cancel
#scrapyd.webservice.py
class SpiderId(WsResource):
def render_POST(self, txrequest):
args = native_stringify_dict(copy(txrequest.args), keys_only=False)
project = args['project'][0]
spider = args['spider'][0]
spiders = self.root.launcher.processes.values()
running = [(s.job,s.start_time.isoformat(' '))
for s in spiders if (s.project == project and s.spider == spider)]
# queue = self.root.poller.queues[project]
# pending = [(x["_job"],) for x in queue.list() if x["name"] == spider]
finished = [(s.job,s.start_time.isoformat(' ')) for s in self.root.launcher.finished
if (s.project == project and s.spider == spider)]
alist = running + finished
if len(alist) == 0:
return {"node_name": self.root.nodename, "status": "error", "message": 'no such project or spider'}
last_id = max(alist,key=lambda a:a[0])
return {"node_name": self.root.nodename, "status": "ok", 'id': last_id[0]}
#scrapyd.default_scrapyd.conf
spiderid.json = scrapyd.webservice.SpiderId
#scrapyd.website.py
<p><code> curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444 </code></p>
<p><code> curl http://localhost:6800/listprojects.json </code></p>
<p><b><code> curl http://localhost:6800/spiderid.json -d project=myproject -d spider=spider</b></code></p>
<p><code> curl http://localhost:6800/listversions.json?project=myproject </code></p>
<p><code> curl http://localhost:6800/listspiders.json?project=myproject </code></p><p><code> curl http://localhost:6800/listjobs.json?project=myproject </code></p>
<p><code> curl http://localhost:6800/delproject.json -d project=myproject </code></p>
<p><code> curl http://localhost:6800/delversion.json -d project=myproject -d version=r99 </code></p>
scrapyd-api代碼修改
#contants.py
SPIDERID_ENDPOINT = 'spiderid'
DEFAULT_ENDPOINTS = {
ADD_VERSION_ENDPOINT: '/addversion.json',
CANCEL_ENDPOINT: '/cancel.json',
DELETE_PROJECT_ENDPOINT: '/delproject.json',
DELETE_VERSION_ENDPOINT: '/delversion.json',
LIST_JOBS_ENDPOINT: '/listjobs.json',
LIST_PROJECTS_ENDPOINT: '/listprojects.json',
LIST_SPIDERS_ENDPOINT: '/listspiders.json',
LIST_VERSIONS_ENDPOINT: '/listversions.json',
SCHEDULE_ENDPOINT: '/schedule.json',
SPIDERID_ENDPOINT: '/spiderid.json',
}
wrapper.py
def spiderid(self, project, spider):
"""
"""
url = self._build_url(constants.SPIDERID_ENDPOINT)
params = {'project': project, 'spider': spider}
json = self.client.post(url, data=params)
return json['id']