yispider一款分布式爬蟲平臺(tái)瞬沦,幫助你更好的管理和開發(fā)爬蟲笋粟。
內(nèi)置一套爬蟲定義規(guī)則(模版)狞谱,可使用模版快速定義爬蟲糙置,也可當(dāng)作框架手動(dòng)開發(fā)爬蟲 .
.
碼云地址:https://gitee.com/bilibala/YiSpider
github地址:https://github.com/2young2simple/yispider
架構(gòu)
目前框架分為2個(gè)部分:
1.爬蟲部分(spider節(jié)點(diǎn)):
內(nèi)部結(jié)構(gòu)參考python scrapy框架摄闸,主要由 schedule,page process,pipline 4個(gè)部分組成善镰,單個(gè)爬蟲單獨(dú)調(diào)度器妹萨,單獨(dú)上下文管理,目前內(nèi)置2中pipline的方式,控制臺(tái)和文件,節(jié)點(diǎn)信息注冊(cè)在etcd上用于manage節(jié)點(diǎn)發(fā)現(xiàn)炫欺。
-
core
:負(fù)責(zé)爬蟲生命周期乎完、上下文的管理,負(fù)責(zé)爬蟲的運(yùn)行品洛。 -
schedule
:負(fù)責(zé)爬蟲請(qǐng)求的調(diào)度树姨。(目前只有一種基于channel的調(diào)度器,無法單個(gè)爬蟲多worker運(yùn)行毫别,可自行實(shí)現(xiàn)基于redis娃弓,或者mq服務(wù)的調(diào)度器即可實(shí)現(xiàn)
)
-
process (page process)
:負(fù)責(zé)請(qǐng)求結(jié)果的處理。
-
pipline
: 結(jié)果的輸出輸出到不同渠道,如控制臺(tái)岛宦,文件台丛,消息隊(duì)列,數(shù)據(jù)庫(kù)等等
-
register
:負(fù)責(zé)服務(wù)的注冊(cè)(目前只支持etcd) -
http
: 提供一些http接口
2.管理部分(manage節(jié)點(diǎn)):
負(fù)責(zé)spider節(jié)點(diǎn)的管理砾肺,用etcd進(jìn)行spider節(jié)點(diǎn)的發(fā)現(xiàn)挽霉。通過http與spider節(jié)點(diǎn)通訊。
開始使用
1. Json模版
http接口調(diào)用
curl -d '{"id":"douban-movie","Name":"douban-movie","request":[{"url":"https://movie.douban.com/j/new_search_subjects?sort=T\u0026range=0,10\u0026tags=\u0026start={0-100,20}","method":"get","type":"","data":null,"header":null,"cookies":{"url":"","data":""},"process_name":"movie"}],"process":[{"name":"movie","reg_url":null,"type":"json","template_rule":{"Rule":null},"json_rule":{"Rule":{"casts":"casts","cover":"cover","id":"id","node":"array|data","rate":"rate","star":"star","title":"title","url":"url"}},"add_queue":null}],"pipline":"file","depth":0,"end_count":0}' "http://127.0.0.1:7774/task/addAndRun"
豆瓣電影模版
{
"id": "douban-movie",
"Name": "douban-movie",
"request": [
{
"url": "https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10,20}",
"method": "get",
"process_name": "movie"
}
],
"process": [
{
"name": "movie",
"type": "json",
"json_rule": {
"Rule": {
"casts": "casts",
"cover": "cover",
"id": "id",
"node": "array|data",
"rate": "rate",
"star": "star",
"title": "title",
"url": "url"
}
},
"add_queue": null
}
],
"pipline": "file",
"depth": 0,
"end_count": 0
}
dilidili模版
{
"id": "dilidili",
"Name": "dilidili",
"request": [
{
"url": "http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
"method": "get",
"process_name": "animelist"
}
],
"process": [
{
"name": "animelist",
"type": "template",
"template_rule": {
"Rule": {
"content": "text|dd div",
"desc": "text|dd p",
"href": "attr.href|dt a",
"img": "attr.src|dt a img",
"node": "array|.anime_list dl",
"title": "text|dd h3 a"
}
},
"add_queue": [
{
"url": "http://www.dilidili.wang{href}",
"method": "get",
"process_name": "animeinfo"
}
]
},
{
"name": "animeinfo",
"type": "template",
"template_rule": {
"Rule": {
"episode": "texts|.time_con .swiper-slide .clear li a em",
"episode-link": "attrs.href|.time_con .swiper-slide .clear li a",
"title": "text|.detail dl dd h1"
}
},
"add_queue": [
{
"url": "{episode-link}",
"method": "get",
"process_name": "episodeinfo"
}
]
},
{
"name": "episodeinfo",
"reg_url": null,
"type": "template",
"template_rule": {
"Rule": {
"player": "attr.src|.player_main iframe",
"title": "text|#intro2 h1",
"url": "attr.href|link[rel=\"canonical\"]"
}
},
"add_queue": null
}
],
"pipline": "file",
"depth": 0,
"end_count": 0
}
2. 代碼模版 編寫
豆瓣電影
package main
import (
"YiSpider/spider/model"
"YiSpider/spider"
spider2 "YiSpider/spider/spider"
)
func main(){
task := &model.Task{
Id:"douban-movie",
Name:"douban-movie",
Request:[]*model.Request{
{
Method:"get",
Url:"https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start={0-10000,20}",
ProcessName:"movie",
},
},
Process: []model.Process{
{
Name:"movie",
Type:"json",
JsonRule:model.JsonRule{
Rule:map[string]string{
"node":"array|data",
"rate":"rate",
"star":"star",
"id":"id",
"url":"url",
"title":"title",
"cover":"cover",
"casts":"casts",
},
},
},
},
Pipline:"file",
}
app := spider.New()
app.AddSpider(spider2.InitWithTask(task))
app.Run()
}
dilidili番劇
package main
import (
"YiSpider/spider/model"
"YiSpider/spider"
spider2 "YiSpider/spider/spider"
)
func main(){
task := &model.Task{
Id:"dilidili",
Name:"dilidili",
Request:[]*model.Request{
{
Method:"get",
Url:"http://www.dilidili.wang/{gaoxiao|kehuan|yundong|danmei|zhiyuxi|luoli|zhenren|zhuangbi|youxi|tuili|qingchun|kongbu|jizhan|rexue|qingxiaoshuo|maoxian|hougong|qihuan|tongnian|lianai|meishaonv|lizhi|baihe|paomianfan|yinv}/",
ProcessName:"animelist",
},
},
Process: []model.Process{
{
Name:"animelist",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"node":"array|.anime_list dl",
"img":"attr.src|dt a img",
"title":"text|dd h3 a",
"href":"attr.href|dt a",
"content":"text|dd div",
"desc":"text|dd p",
},
},
AddQueue:[]*model.Request{
{
Method: "get",
Url: "http://www.dilidili.wang{href}",
ProcessName: "animeinfo",
},
},
},
{
Name:"animeinfo",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"episode":"texts|.time_con .swiper-slide .clear li a em",
"title":"text|.detail dl dd h1",
"episode-link":"attrs.href|.time_con .swiper-slide .clear li a",
},
},
AddQueue:[]*model.Request{
{
Method: "get",
Url: "{episode-link}",
ProcessName: "episodeinfo",
},
},
},
{
Name:"episodeinfo",
Type:"template",
TemplateRule:model.TemplateRule{
Rule:map[string]string{
"url":"attr.href|link[rel=\"canonical\"]",
"title":"text|#intro2 h1",
"player":"attr.src|.player_main iframe",
},
},
},
},
Pipline:"file",
}
app := spider.New()
app.AddSpider(spider2.InitWithTask(task))
app.Run()
}
- 純代碼編寫
碼云地址:https://gitee.com/bilibala/YiSpider
github地址:https://github.com/2young2simple/yispider