無(wú)意中發(fā)現(xiàn)了colly
,我一直是使用python進(jìn)行爬蟲(chóng)的蓖租, 學(xué)習(xí)golang的使用粱侣, 用go
參考scrapy
架構(gòu)寫(xiě)了一個(gè)爬蟲(chóng)的框架demo辆毡。我一直以為go不適合做爬蟲(chóng), go的領(lǐng)域是后端服務(wù)甜害。然后去搜索了一下colly
, 發(fā)現(xiàn)還是很流行舶掖。 我個(gè)人還是比較喜歡爬蟲(chóng), 網(wǎng)絡(luò)上的數(shù)據(jù)就是公開(kāi)的API尔店, 所以眨攘, 爬蟲(chóng)去請(qǐng)求接口獲取數(shù)據(jù)。當(dāng)然我是遵循君子協(xié)議的嚣州。
好鲫售, 下面進(jìn)入正題,介紹colly
colly介紹
Lightning Fast and Elegant Scraping Framework for Gophers
Colly provides a clean interface to write any kind of crawler/scraper/spider.
官方的介紹该肴,gocolly快速優(yōu)雅情竹,在單核上每秒可以發(fā)起1K以上請(qǐng)求;以回調(diào)函數(shù)的形式提供了一組接口匀哄,可以實(shí)現(xiàn)任意類(lèi)型的爬蟲(chóng)秦效;依賴(lài)goquery庫(kù)可以像jquery一樣選擇web元素。
安裝使用
go get -u github.com/gocolly/colly/...
import "github.com/gocolly/colly"
架構(gòu)特點(diǎn)
了解爬蟲(chóng)的都知道一個(gè)爬蟲(chóng)請(qǐng)求的生命周期
- 構(gòu)建請(qǐng)求
- 發(fā)送請(qǐng)求
- 獲取文檔或數(shù)據(jù)
- 解析文檔或清洗數(shù)據(jù)
- 數(shù)據(jù)處理或持久化
scrapy的設(shè)計(jì)理念是將上面的每一個(gè)步驟抽離出來(lái)涎嚼,然后做出組件的形式阱州, 最后通過(guò)調(diào)度組成流水線的工作形式。
我們看一下scrapy的架構(gòu)圖法梯, 這里只是簡(jiǎn)單的介紹下苔货, 后面有時(shí)間,我深入介紹scrapy
如圖立哑,downloader
負(fù)責(zé)請(qǐng)求獲取頁(yè)面夜惭,spiders
中寫(xiě)具體解析文檔的邏輯,item PipeLine
數(shù)據(jù)最后處理铛绰, 中間有一些中間件诈茧,可以一些功能的裝飾。比如至耻,代理若皱,請(qǐng)求頻率等。
我們介紹一下colly的架構(gòu)特點(diǎn)
colly的邏輯更像是面向過(guò)程編程的尘颓, colly的邏輯就是按上面生命周期的順序管道處理, 只是在不同階段晦譬,加上回調(diào)函數(shù)進(jìn)行過(guò)濾的時(shí)候進(jìn)行處理疤苹。
下面也按照這個(gè)邏輯進(jìn)行介紹
源碼分析
先給一個(gè)??
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
這是官方給的示例, 可以看到colly.NewCollector
創(chuàng)建一個(gè)收集器
敛腌, colly的所有處理邏輯都是以Collector
為核心進(jìn)行操作的卧土。
我們看一下 Collector
結(jié)構(gòu)體的定義
// Collector provides the scraper instance for a scraping job
type Collector struct {
// UserAgent is the User-Agent string used by HTTP requests
UserAgent string
// MaxDepth limits the recursion depth of visited URLs.
// Set it to 0 for infinite recursion (default).
MaxDepth int
// AllowedDomains is a domain whitelist.
// Leave it blank to allow any domains to be visited
AllowedDomains []string
// DisallowedDomains is a domain blacklist.
DisallowedDomains []string
// DisallowedURLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request will be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
DisallowedURLFilters []*regexp.Regexp
// URLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request won't be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
URLFilters []*regexp.Regexp
// AllowURLRevisit allows multiple downloads of the same URL
AllowURLRevisit bool
// MaxBodySize is the limit of the retrieved response body in bytes.
// 0 means unlimited.
// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
MaxBodySize int
// CacheDir specifies a location where GET requests are cached as files.
// When it's not defined, caching is disabled.
CacheDir string
// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
// the target host's robots.txt file. See http://www.robotstxt.org/ for more
// information.
IgnoreRobotsTxt bool
// Async turns on asynchronous network communication. Use Collector.Wait() to
// be sure all requests have been finished.
Async bool
// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
// to true to enable it.
ParseHTTPErrorResponse bool
// ID is the unique identifier of a collector
ID uint32
// DetectCharset can enable character encoding detection for non-utf8 response bodies
// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
DetectCharset bool
// RedirectHandler allows control on how a redirect will be managed
RedirectHandler func(req *http.Request, via []*http.Request) error
// CheckHead performs a HEAD request before every GET to pre-validate the response
CheckHead bool
store storage.Storage
debugger debug.Debugger
robotsMap map[string]*robotstxt.RobotsData
htmlCallbacks []*htmlCallbackContainer
xmlCallbacks []*xmlCallbackContainer
requestCallbacks []RequestCallback
responseCallbacks []ResponseCallback
errorCallbacks []ErrorCallback
scrapedCallbacks []ScrapedCallback
requestCount uint32
responseCount uint32
backend *httpBackend
wg *sync.WaitGroup
lock *sync.RWMutex
}
上面的具體屬性我就不介紹了惫皱, 看看注釋也就懂了。
我就先按上面的示例解釋源碼
// 創(chuàng)建一個(gè) Collector對(duì)象
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// 添加一個(gè)HTML的回調(diào)函數(shù)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// 添加一個(gè) Requset回調(diào)函數(shù)
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// 開(kāi)始爬取
c.Visit("https://hackerspaces.org/")
回調(diào)函數(shù)如何用尤莺? 什么作用旅敷? 先賣(mài)個(gè)關(guān)子, c.Visit("https://hackerspaces.org/")
是入口颤霎, 那就先分析它媳谁,
// Visit starts Collector's collecting job by creating a
// request to the URL specified in parameter.
// Visit also calls the previously provided callbacks
func (c *Collector) Visit(URL string) error {
if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)
}
??又出來(lái)一個(gè)新的method,
func (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error {
// 檢查請(qǐng)求是否合法
if err := c.requestCheck(u, method, depth, checkRevisit); err != nil {
return err
}
// 解析url友酱,
parsedURL, err := url.Parse(u)
if err != nil {
return err
}
if parsedURL.Scheme == "" {
parsedURL.Scheme = "http"
}
if !c.isDomainAllowed(parsedURL.Hostname()) {
return ErrForbiddenDomain
}
// robots協(xié)議
if method != "HEAD" && !c.IgnoreRobotsTxt {
if err = c.checkRobots(parsedURL); err != nil {
return err
}
}
// headers
if hdr == nil {
hdr = http.Header{"User-Agent": []string{c.UserAgent}}
}
rc, ok := requestData.(io.ReadCloser)
if !ok && requestData != nil {
rc = ioutil.NopCloser(requestData)
}
// The Go HTTP API ignores "Host" in the headers, preferring the client
// to use the Host field on Request.
host := parsedURL.Host
if hostHeader := hdr.Get("Host"); hostHeader != "" {
host = hostHeader
}
// 構(gòu)造http.Request
req := &http.Request{
Method: method,
URL: parsedURL,
Proto: "HTTP/1.1",
ProtoMajor: 1,
ProtoMinor: 1,
Header: hdr,
Body: rc,
Host: host,
}
// 請(qǐng)求的數(shù)據(jù)(requestData)轉(zhuǎn)換成io.ReadCloser接口數(shù)據(jù)
setRequestBody(req, requestData)
u = parsedURL.String()
c.wg.Add(1)
// 異步方式
if c.Async {
go c.fetch(u, method, depth, requestData, ctx, hdr, req)
return nil
}
return c.fetch(u, method, depth, requestData, ctx, hdr, req)
}
上面很大篇幅都是檢查晴音, 現(xiàn)在還在 request
的階段, 還沒(méi)有response缔杉,看c.fetch
fetch就是colly的核心內(nèi)容
func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error {
defer c.wg.Done()
if ctx == nil {
ctx = NewContext()
}
request := &Request{
URL: req.URL,
Headers: &req.Header,
Ctx: ctx,
Depth: depth,
Method: method,
Body: requestData,
collector: c, // 這里將Collector放到request中锤躁,這個(gè)可以對(duì)請(qǐng)求繼續(xù)處理
ID: atomic.AddUint32(&c.requestCount, 1),
}
// 回調(diào)函數(shù)處理 request
c.handleOnRequest(request)
if request.abort {
return nil
}
if method == "POST" && req.Header.Get("Content-Type") == "" {
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")
}
if req.Header.Get("Accept") == "" {
req.Header.Set("Accept", "*/*")
}
origURL := req.URL
// 這里是 去請(qǐng)求網(wǎng)絡(luò), 是調(diào)用了 `http.Client.Do`方法請(qǐng)求的
response, err := c.backend.Cache(req, c.MaxBodySize, c.CacheDir)
if proxyURL, ok := req.Context().Value(ProxyURLKey).(string); ok {
request.ProxyURL = proxyURL
}
// 回調(diào)函數(shù)或详,處理error
if err := c.handleOnError(response, err, request, ctx); err != nil {
return err
}
if req.URL != origURL {
request.URL = req.URL
request.Headers = &req.Header
}
atomic.AddUint32(&c.responseCount, 1)
response.Ctx = ctx
response.Request = request
err = response.fixCharset(c.DetectCharset, request.ResponseCharacterEncoding)
if err != nil {
return err
}
// 回調(diào)函數(shù) 處理Response
c.handleOnResponse(response)
// 回調(diào)函數(shù) HTML
err = c.handleOnHTML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
// 回調(diào)函數(shù)XML
err = c.handleOnXML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
// 回調(diào)函數(shù) Scraped
c.handleOnScraped(response)
return err
}
看到了系羞, 這就是一個(gè)完整的流程。 好霸琴, 我們看一下回調(diào)函數(shù)做了什么觉啊?
func (c *Collector) handleOnRequest(r *Request) {
if c.debugger != nil {
c.debugger.Event(createEvent("request", r.ID, c.ID, map[string]string{
"url": r.URL.String(),
}))
}
for _, f := range c.requestCallbacks {
f(r)
}
}
核心就 for _, f := range c.requestCallbacks { f(r) }
這句,下面我每個(gè)回調(diào)函數(shù)都介紹一下
回調(diào)函數(shù)
這里介紹按生命周期的順序來(lái)介紹
1. OnRequest
// OnRequest registers a function. Function will be executed on every
// request made by the Collector
// 這里是注冊(cè)回調(diào)函數(shù)到 requestCallbacks
func (c *Collector) OnRequest(f RequestCallback) {
c.lock.Lock()
if c.requestCallbacks == nil {
c.requestCallbacks = make([]RequestCallback, 0, 4)
}
c.requestCallbacks = append(c.requestCallbacks, f)
c.lock.Unlock()
}
// 在fetch中調(diào)用最早調(diào)用的
func (c *Collector) handleOnRequest(r *Request) {
if c.debugger != nil {
c.debugger.Event(createEvent("request", r.ID, c.ID, map[string]string{
"url": r.URL.String(),
}))
}
for _, f := range c.requestCallbacks {
f(r)
}
}
2. OnResponse & handleOnResponse
// OnResponse registers a function. Function will be executed on every response
func (c *Collector) OnResponse(f ResponseCallback) {
c.lock.Lock()
if c.responseCallbacks == nil {
c.responseCallbacks = make([]ResponseCallback, 0, 4)
}
c.responseCallbacks = append(c.responseCallbacks, f)
c.lock.Unlock()
}
func (c *Collector) handleOnResponse(r *Response) {
if c.debugger != nil {
c.debugger.Event(createEvent("response", r.Request.ID, c.ID, map[string]string{
"url": r.Request.URL.String(),
"status": http.StatusText(r.StatusCode),
}))
}
for _, f := range c.responseCallbacks {
f(r)
}
}
3. OnHTML & handleOnHTML
// OnHTML registers a function. Function will be executed on every HTML
// element matched by the GoQuery Selector parameter.
// GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback) {
c.lock.Lock()
if c.htmlCallbacks == nil {
c.htmlCallbacks = make([]*htmlCallbackContainer, 0, 4)
}
c.htmlCallbacks = append(c.htmlCallbacks, &htmlCallbackContainer{
Selector: goquerySelector,
Function: f,
})
c.lock.Unlock()
}
// 這個(gè)解析html的邏輯比較多一些
func (c *Collector) handleOnHTML(resp *Response) error {
if len(c.htmlCallbacks) == 0 || !strings.Contains(strings.ToLower(resp.Headers.Get("Content-Type")), "html") {
return nil
}
doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
if href, found := doc.Find("base[href]").Attr("href"); found {
resp.Request.baseURL, _ = url.Parse(href)
}
for _, cc := range c.htmlCallbacks {
i := 0
doc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection) {
for _, n := range s.Nodes {
e := NewHTMLElementFromSelectionNode(resp, s, n, i)
i++
if c.debugger != nil {
c.debugger.Event(createEvent("html", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Selector,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
}
})
}
return nil
}
4. OnXML & handleOnXML
// OnXML registers a function. Function will be executed on every XML
// element matched by the xpath Query parameter.
// xpath Query is used by https://github.com/antchfx/xmlquery
func (c *Collector) OnXML(xpathQuery string, f XMLCallback) {
c.lock.Lock()
if c.xmlCallbacks == nil {
c.xmlCallbacks = make([]*xmlCallbackContainer, 0, 4)
}
c.xmlCallbacks = append(c.xmlCallbacks, &xmlCallbackContainer{
Query: xpathQuery,
Function: f,
})
c.lock.Unlock()
}
func (c *Collector) handleOnXML(resp *Response) error {
if len(c.xmlCallbacks) == 0 {
return nil
}
contentType := strings.ToLower(resp.Headers.Get("Content-Type"))
isXMLFile := strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml") || strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml.gz")
if !strings.Contains(contentType, "html") && (!strings.Contains(contentType, "xml") && !isXMLFile) {
return nil
}
if strings.Contains(contentType, "html") {
doc, err := htmlquery.Parse(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
if e := htmlquery.FindOne(doc, "http://base"); e != nil {
for _, a := range e.Attr {
if a.Key == "href" {
resp.Request.baseURL, _ = url.Parse(a.Val)
break
}
}
}
for _, cc := range c.xmlCallbacks {
for _, n := range htmlquery.Find(doc, cc.Query) {
e := NewXMLElementFromHTMLNode(resp, n)
if c.debugger != nil {
c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Query,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
}
}
} else if strings.Contains(contentType, "xml") || isXMLFile {
doc, err := xmlquery.Parse(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
for _, cc := range c.xmlCallbacks {
xmlquery.FindEach(doc, cc.Query, func(i int, n *xmlquery.Node) {
e := NewXMLElementFromXMLNode(resp, n)
if c.debugger != nil {
c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Query,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
})
}
}
return nil
}
5. OnError & handleOnError
這個(gè)會(huì)多次調(diào)用沈贝, 如果 err != nil情況下調(diào)用比較多
杠人, 爬蟲(chóng)異常的情況下,會(huì)調(diào)用
// OnError registers a function. Function will be executed if an error
// occurs during the HTTP request.
func (c *Collector) OnError(f ErrorCallback) {
c.lock.Lock()
if c.errorCallbacks == nil {
c.errorCallbacks = make([]ErrorCallback, 0, 4)
}
c.errorCallbacks = append(c.errorCallbacks, f)
c.lock.Unlock()
}
func (c *Collector) handleOnError(response *Response, err error, request *Request, ctx *Context) error {
if err == nil && (c.ParseHTTPErrorResponse || response.StatusCode < 203) {
return nil
}
if err == nil && response.StatusCode >= 203 {
err = errors.New(http.StatusText(response.StatusCode))
}
if response == nil {
response = &Response{
Request: request,
Ctx: ctx,
}
}
if c.debugger != nil {
c.debugger.Event(createEvent("error", request.ID, c.ID, map[string]string{
"url": request.URL.String(),
"status": http.StatusText(response.StatusCode),
}))
}
if response.Request == nil {
response.Request = request
}
if response.Ctx == nil {
response.Ctx = request.Ctx
}
for _, f := range c.errorCallbacks {
f(response, err)
}
return err
}
6. OnScraped & handleOnScraped
最后一步的回調(diào)函數(shù)處理
// OnScraped registers a function. Function will be executed after
// OnHTML, as a final part of the scraping.
func (c *Collector) OnScraped(f ScrapedCallback) {
c.lock.Lock()
if c.scrapedCallbacks == nil {
c.scrapedCallbacks = make([]ScrapedCallback, 0, 4)
}
c.scrapedCallbacks = append(c.scrapedCallbacks, f)
c.lock.Unlock()
}
func (c *Collector) handleOnScraped(r *Response) {
if c.debugger != nil {
c.debugger.Event(createEvent("scraped", r.Request.ID, c.ID, map[string]string{
"url": r.Request.URL.String(),
}))
}
for _, f := range c.scrapedCallbacks {
f(r)
}
}
注冊(cè)回調(diào)函數(shù)的method還有幾個(gè)沒(méi)有列出來(lái)宋下,感興趣的嗡善,自己看一下,
上面介紹完了学歧, 再回頭看??
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
一般文檔解析放在html, xml 中
頁(yè)面跳轉(zhuǎn)爬取
一般處理就2種罩引,一種是相同邏輯的頁(yè)面,比如下一頁(yè)
枝笨,另一種袁铐,就是不同邏輯的,比如子頁(yè)面
- 在
html
,xml
横浑,解析出來(lái)以后剔桨,構(gòu)建新的請(qǐng)求,我們看一下徙融,相同頁(yè)面
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// If attribute class is this long string return from callback
// As this a is irrelevant
if e.Attr("class") == "Button_1qxkboh-o_O-primary_cv02ee-o_O-md_28awn8-o_O-primaryLink_109aggg" {
return
}
link := e.Attr("href")
// If link start with browse or includes either signup or login return from callback
if !strings.HasPrefix(link, "/browse") || strings.Index(link, "=signup") > -1 || strings.Index(link, "=login") > -1 {
return
}
// start scaping the page under the link found
e.Request.Visit(link)
})
上面是 HTML的回調(diào)函數(shù)洒缀,解析頁(yè)面,獲取了url
,使用 e.Request.Visit(link)
, 其實(shí)就是 e.Request.collector.Visit(link)
我解釋一下
func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error {
defer c.wg.Done()
if ctx == nil {
ctx = NewContext()
}
request := &Request{
URL: req.URL,
Headers: &req.Header,
Ctx: ctx,
Depth: depth,
Method: method,
Body: requestData,
collector: c, // 這個(gè)上面有介紹
ID: atomic.AddUint32(&c.requestCount, 1),
}
....
}}
// Visit continues Collector's collecting job by creating a
// request and preserves the Context of the previous request.
// Visit also calls the previously provided callbacks
func (r *Request) Visit(URL string) error {
return r.collector.scrape(r.AbsoluteURL(URL), "GET", r.Depth+1, nil, r.Ctx, nil, true)
}
這種方法在實(shí)際開(kāi)發(fā)中經(jīng)常會(huì)用到。
- 子頁(yè)面的處理邏輯
colly中主要是以Collector
為中心树绩, 然后各種回調(diào)函數(shù)進(jìn)行處理萨脑,子頁(yè)面需要不同的回調(diào)函數(shù),所以就需要新的Collector
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: coursera.org, www.coursera.org
colly.AllowedDomains("coursera.org", "www.coursera.org"),
// Cache responses to prevent multiple download of pages
// even if the collector is restarted
colly.CacheDir("./coursera_cache"),
)
// Create another collector to scrape course details
detailCollector := c.Clone()
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
})
// On every a HTML element which has name attribute call callback
c.OnHTML(`a[name]`, func(e *colly.HTMLElement) {
// Activate detailCollector if the link contains "coursera.org/learn"
courseURL := e.Request.AbsoluteURL(e.Attr("href"))
if strings.Index(courseURL, "coursera.org/learn") != -1 {
// 子頁(yè)面或其他頁(yè)面
detailCollector.Visit(courseURL)
}
})
持久化
Collector
對(duì)象有一個(gè)屬性 store storage.Storage
是存儲(chǔ)的饺饭,這個(gè)是將數(shù)據(jù)直接存儲(chǔ)下來(lái)渤早,沒(méi)有清洗。
比如瘫俊, 我需要將數(shù)據(jù)持久化到數(shù)據(jù)庫(kù)中鹊杖,其實(shí)很簡(jiǎn)單, 在回調(diào)函數(shù)中處理军援。
給個(gè)例子
c.OnHTML("#currencies-all tbody tr", func(e *colly.HTMLElement) {
mysql.WriteObjectStrings([]string{
e.ChildText(".currency-name-container"),
e.ChildText(".col-symbol"),
e.ChildAttr("a.price", "data-usd"),
e.ChildAttr("a.volume", "data-usd"),
e.ChildAttr(".market-cap", "data-usd"),
e.ChildAttr(".percent-change[data-timespan=\"1h\"]", "data-percentusd"),
e.ChildAttr(".percent-change[data-timespan=\"24h\"]", "data-percentusd"),
e.ChildAttr(".percent-change[data-timespan=\"7d\"]", "data-percentusd"),
})
})
總結(jié)
好了仅淑,介紹完了,我沒(méi)有介紹如何使用胸哥,我自己也沒(méi)有寫(xiě)任何的代碼涯竟, 我只想分享給你這種軟件架構(gòu)的特點(diǎn)以及設(shè)計(jì)模式, 希望你可以借鑒應(yīng)用到工作中空厌,一般寫(xiě)框架都是采用這種思維庐船。
下面這張圖很形象,爬蟲(chóng)框架就這些東西嘲更。