SpringBoot整合WebMagic實(shí)現(xiàn)爬蟲(簡(jiǎn)單入門含gitee源碼)

SpringBoot整合WebMagic

前言

為什么我要整合WebMagic ?

WebMagic是一個(gè)簡(jiǎn)單靈活的Java爬蟲框架〈矗基于WebMagic嫡秕,你可以快速開發(fā)出一個(gè)高效、易維護(hù)的爬蟲苹威。

因?yàn)椴糠志W(wǎng)站它不支持外鏈圖片上傳,而我已經(jīng)把我的圖片資源上傳了,所以我需要把所有的資源進(jìn)行獲取整合再在部分網(wǎng)站重新上傳;

舉個(gè)??:

圖片上傳失敗例子

紅框里的就是上傳失敗的昆咽。

很無奈,人家不支持外鏈地址,那沒有辦法,自己重新上傳吧!

所以我通過WebMagic進(jìn)行爬取我放置在其他網(wǎng)站的博客再進(jìn)行上傳。

如下??:


csdn博客文章

爬取后獲得的資源:


爬取(120436445)圖片資源

文檔迭代

version 功能 詳細(xì)描述
0.0.1 通過整合WebMagic實(shí)現(xiàn)了通用列表(多頁面)及詳細(xì)內(nèi)容(單頁面)的通用爬取接口 通用列表(多頁面是指列表頁面中的詳情頁),單頁就是指的詳情頁

項(xiàng)目技術(shù)版本(技術(shù)選型)

jdk - version - 1.8

maven - version - 3.8.1

SpringBoot - version - 2.2.2.RELEASE

Swagger - version - 2.7.0

remark - version - 1.0.0

webmagic - version - 0.7.3

其他 參考pom

項(xiàng)目地址

gitee地址: https://gitee.com/zjydzyjs/spring-boot-use-case-collection/tree/master/webmagic

前置知識(shí)

WebMagic 需自主學(xué)習(xí)

[webmagic官網(wǎng)](http://webmagic.io/)

[webmagic詳情文檔](http://webmagic.io/docs/zh/)

這里說一下我的感慨,webmagic的作者是真厲害,他再官方上也說了,該框架是他自己業(yè)余開發(fā),這個(gè)精神值得佩服,在此感謝作者([Yihua Huang](https://github.com/code4craft));

假設(shè)你已經(jīng)學(xué)習(xí)了該項(xiàng)目牙甫,可以繼續(xù)學(xué)習(xí)了掷酗。

WebMagic 簡(jiǎn)單測(cè)試

文檔中的第一個(gè)用例:
文檔中的第一個(gè)用例

你可以再idea中進(jìn)行搜索(快速按兩次 shift),輸入GithubRepoPageProcessor類就可以發(fā)現(xiàn)這個(gè)作者已經(jīng)寫好的用例了,我們可以測(cè)試一下窟哺。

package us.codecraft.webmagic.processor.example;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class GithubRepoPageProcessor implements PageProcessor {
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(10000);

    public GithubRepoPageProcessor() {
    }

    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-])").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("http://h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name") == null) {
            page.setSkip(true);
        }

        page.putField("readme", page.getHtml().xpath("http://div[@id='readme']/tidyText()"));
    }

    public Site getSite() {
        return this.site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl(new String[]{"https://github.com/code4craft"}).thread(5).run();
    }
}

你直接通過通過以下用例進(jìn)行調(diào)用就可以測(cè)試了,在類{@link com.blacktea.webmagic.demo.WebMagicTest}里,

@Test
void testDemo1(){
    Spider.create(new GithubRepoPageProcessor())
            .addUrl(new String[]{"https://github.com/code4craft"})
            .thread(5).run();
}

運(yùn)行得到錯(cuò)誤內(nèi)容:

10:34:13.307 [main] DEBUG us.codecraft.webmagic.scheduler.QueueScheduler - push to queue https://github.com/code4craft
10:34:13.730 [main] INFO us.codecraft.webmagic.Spider - Spider github.com started!
10:34:13.762 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: standard
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAuthCache - Auth cache not set in the context
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection request: [route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:13.774 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection leased: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 1 of 100; total allocated: 1 of 5]
10:34:13.775 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Opening connection {s}->https://github.com:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultHttpClientConnectionOperator - Connecting to github.com/140.82.112.4:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Connecting socket to github.com/140.82.112.4:443 with timeout 10000
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled protocols: [TLSv1]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled cipher suites:[TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_DSS_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Starting handshake
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-0: Shutdown connection
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection discarded
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:14.316 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
    at sun.security.ssl.Alert.createSSLException(Alert.java:131)
    at sun.security.ssl.Alert.createSSLException(Alert.java:117)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:340)
    at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
    at sun.security.ssl.TransportContext.dispatch(TransportContext.java:186)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:154)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1279)
    at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1188)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:401)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
    at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
    at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
    at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
    at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
10:34:15.324 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.

那第一個(gè)例子就出現(xiàn)這個(gè)問題,那得解決呀,是吧?

于是我就在github的issues找到了該問題泻轰!

(protocol_version)bug解決

作者說是會(huì)在0.7.4中解決,你可能會(huì)說那我們換版本就好了,抱歉,作者可能是工作太忙了,所以0.7.4遲遲不出,那沒辦法,那咱們就自己改吧。

手動(dòng)解決 protocol_version bug

先找到作者說的**HttpClientGenerator**類中的方法*buildSSLConnectionSocketFactory*脏答,
private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
    try {
        return new SSLConnectionSocketFactory(this.createIgnoreVerifySSL());
    } catch (KeyManagementException var2) {
        this.logger.error("ssl connection fail", var2);
    } catch (NoSuchAlgorithmException var3) {
        this.logger.error("ssl connection fail", var3);
    }

    return SSLConnectionSocketFactory.getSocketFactory();
}

將里面的內(nèi)容改造就行,不過因?yàn)槭窃创a,如果你不想重新打包的話,就需要自己去重寫引用HttpClientGenerator

如何知道在那里使用上HttpClientGenerator的?
官網(wǎng)說了,WebMagic的結(jié)構(gòu)分為`Downloader`糕殉、`PageProcessor`亩鬼、`Scheduler`殖告、`Pipeline`四大組件,

那我們?cè)趺粗肋@四大組件是怎么相互關(guān)聯(lián)的呢?

可以從之前的那個(gè)**GithubRepoPageProcessor**類的執(zhí)行方法中獲得
 Spider.create(new GithubRepoPageProcessor())
        .addUrl(new String[]{"https://github.com/code4craft"})
        .thread(5)
        .run();
從上可以知道,執(zhí)行一個(gè)爬取任務(wù),需要通過**Spider**去啟動(dòng),那我們查看一下
package us.codecraft.webmagic;
......

public class Spider implements Runnable, Task {
    protected Downloader downloader;
    protected List<Pipeline> pipelines = new ArrayList();
    protected PageProcessor pageProcessor;
    protected List<Request> startRequests;
    protected Site site;
    protected String uuid;
    protected Scheduler scheduler = new QueueScheduler();
    protected Logger logger = LoggerFactory.getLogger(this.getClass());
    protected CountableThreadPool threadPool;
    protected ExecutorService executorService;
    protected int threadNum = 1;
    protected AtomicInteger stat = new AtomicInteger(0);
    protected boolean exitWhenComplete = true;
    protected static final int STAT_INIT = 0;
    protected static final int STAT_RUNNING = 1;
    protected static final int STAT_STOPPED = 2;
    protected boolean spawnUrl = true;
    protected boolean destroyWhenExit = true;
    private ReentrantLock newUrlLock = new ReentrantLock();
    private Condition newUrlCondition;
    private List<SpiderListener> spiderListeners;
    private final AtomicLong pageCount;
    private Date startTime;
    private int emptySleepTime;

 ...... 省略
     
     
    /** @deprecated */
    public Spider downloader(Downloader downloader) {
        return this.setDownloader(downloader);
    }

    public Spider setDownloader(Downloader downloader) {
        this.checkIfRunning();
        this.downloader = downloader;
        return this;
    }
     
     public void run() {
        this.checkRunningStat();
        this.initComponent();
        this.logger.info("Spider {} started!", this.getUUID());
        ......
        this.logger.info("Spider {} closed! {} pages downloaded.", this.getUUID(), this.pageCount.get());
     }
    
    protected void initComponent() {
        if (this.downloader == null) {
            // 在這里也看到了默認(rèn)使用 HttpClientDownloader
            this.downloader = new HttpClientDownloader();
        }

        if (this.pipelines.isEmpty()) {
            this.pipelines.add(new ConsolePipeline());
        }

        this.downloader.setThread(this.threadNum);
        if (this.threadPool == null || this.threadPool.isShutdown()) {
            if (this.executorService != null && !this.executorService.isShutdown()) {
                this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
            } else {
                this.threadPool = new CountableThreadPool(this.threadNum);
            }
        }

        if (this.startRequests != null) {
            Iterator var1 = this.startRequests.iterator();

            while(var1.hasNext()) {
                Request request = (Request)var1.next();
                this.addRequest(request);
            }

            this.startRequests.clear();
        }

        this.startTime = new Date();
    }
    
}

從上面可以發(fā)現(xiàn)四個(gè)都有,我們點(diǎn)擊進(jìn)Downloader

可以發(fā)現(xiàn)


Downloader接口實(shí)現(xiàn)

三個(gè)實(shí)現(xiàn)類(紅框內(nèi)容,my那個(gè)是我自己實(shí)現(xiàn)的)!

然后你點(diǎn)進(jìn)HttpClientDownloader

package us.codecraft.webmagic.downloader;

......
@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
    private Logger logger = LoggerFactory.getLogger(this.getClass());
    private final Map<String, CloseableHttpClient> httpClients = new HashMap();
    // 在這里就發(fā)現(xiàn)使用HttpClientGenerator
    private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
    private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
    private ProxyProvider proxyProvider;
    private boolean responseHeader = true;
    ......省略
}

從以上我們知道了Spider類中有屬性Download接口,該接口有實(shí)現(xiàn)類HttpClientDownloader雳锋,當(dāng)你在執(zhí)行Spider.run()時(shí),run()會(huì)執(zhí)行initComponent()黄绩,如果你沒有設(shè)置Spiderdownload屬性,就會(huì)自動(dòng)的選擇HttpClientDownloader作為實(shí)現(xiàn)類,而HttpClientDownloader中又使用了HttpClientGenerator作為參數(shù),那我們既然要改HttpClientGenerator類中的方法buildSSLConnectionSocketFactory(),那就得重寫HttpClientGeneratorDownload

實(shí)現(xiàn),得到MyHttpClientGeneratorMyHttpClientDownloader玷过。

怎么使用新得Download實(shí)現(xiàn)?

1.再在使用Spider得時(shí)候setDownloader(MyHttpClientDownloader),這樣就可以了爽丹。

Spider.create(pageProcessor)
        .setDownloader(new MyHttpClientDownloader());

2.或者增加一個(gè)類用于繼承Spider類,重寫initComponent()方法中得

    protected void initComponent() {
        if (this.downloader == null) {
            // 修改這里為 this.downloader = new MyHttpClientDownloader();
            this.downloader = new HttpClientDownloader();
        }
        
        if (this.pipelines.isEmpty()) {
        this.pipelines.add(new ConsolePipeline());
    }

    this.downloader.setThread(this.threadNum);
    if (this.threadPool == null || this.threadPool.isShutdown()) {
        if (this.executorService != null && !this.executorService.isShutdown()) {
            this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
        } else {
            this.threadPool = new CountableThreadPool(this.threadNum);
        }
    }

    if (this.startRequests != null) {
        Iterator var1 = this.startRequests.iterator();

        while(var1.hasNext()) {
            Request request = (Request)var1.next();
            this.addRequest(request);
        }

        this.startRequests.clear();
    }

    this.startTime = new Date();
}

再次執(zhí)行testDemo1()

    @Test
    void testDemo1(){
        // 出現(xiàn)錯(cuò)誤 -> javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
//        Spider.create(new GithubRepoPageProcessor())
//                .addUrl(new String[]{"https://github.com/code4craft"})
//                .thread(5).run();

        //使用MyHttpClientDownloader解決  protocol_version,但是可能出現(xiàn)github請(qǐng)求超時(shí),這個(gè)與github訪問有關(guān)系,我訪問不了
        Spider.create(new GithubRepoPageProcessor())
                .addUrl(new String[]{"https://github.com/code4craft"})
                .setDownloader(new MyHttpClientDownloader())
                .thread(5).run();
    }

官網(wǎng)其他用例

package us.codecraft.webmagic.processor;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;

public interface PageProcessor {
    void process(Page var1);

    Site getSite();
}

你可以通過查找PageProcessor接口得實(shí)現(xiàn),得到官網(wǎng)用例,

PageProcessor實(shí)現(xiàn)

本文實(shí)現(xiàn)

#### 功能

當(dāng)前版本通過webmagic實(shí)現(xiàn)三個(gè)[接口](https://gitee.com/zjydzyjs/spring-boot-use-case-collection/blob/master/webmagic/src/main/java/com/blacktea/webmagic/demo/controller/ProTestController.java)筑煮,可實(shí)現(xiàn)三個(gè)通用功能。

接口請(qǐng)求報(bào)文

名稱 必填 類型 說明
fileName true String 表示下載后得文件名稱(要求帶文件類型后綴)粤蝎;
itemPageUrl false String 目標(biāo)頁面地址真仲;(該項(xiàng)與helpUrl、targetUrl是分開的,只有進(jìn)行文章詳情爬取的時(shí)候才需要itemPageUrl)
helpUrl false String 提供需要目標(biāo)地址的頁面url,例如CSDN的列表頁;
targetUrl false String 找尋目標(biāo)地址的匹配url,暫時(shí)僅支持 regx 匹配;
fields true List<Field> 匹配規(guī)則集合初澎,長(zhǎng)度必須大于0
key true String (Field.key) fields[index].key屬性,用于標(biāo)記該條規(guī)則得別名,在導(dǎo)出zip時(shí),會(huì)自動(dòng)將該key作為一級(jí)或二級(jí)(item一級(jí),list二級(jí))目錄名稱作為zip內(nèi)文件分區(qū)
value true String (Field.value) fields[index].value屬性,用于在頁面中進(jìn)行匹配得內(nèi)容
type true String (Field.type) fields[index].type屬性,與Field.value進(jìn)行組合使用,當(dāng)前支持 ['XPath', 'Regex', 'Css', 'JsonPath']秸应,是對(duì)webmagi{@link us.codecraft.webmagic.model.annotation.Type}得復(fù)用,源碼下得matchVal()方法
fileType true String (Field.fileType) fields[index].fileType屬性,是對(duì)進(jìn)行匹配后得到結(jié)果數(shù)據(jù)按照類型進(jìn)行處理,源碼下得getDataMap()方法
zip分區(qū)說明
為了區(qū)分匹配后獲取的資源與避免文件名稱重復(fù)導(dǎo)致文件被覆蓋,我這里做了目錄分區(qū)

1. 當(dāng)你用`itemPageUrl` 時(shí),文件結(jié)構(gòu)為 

    xxx.zip/key/fileType;

    2. 當(dāng)你用`helpUrl and targetUrl`時(shí),文件結(jié)構(gòu)為

    第幾個(gè)頁面/key/fileType;
匹配結(jié)果調(diào)試

WebMagicUtil.getSelectableValue()

private static Object getSelectableValue(Field field,Page page, boolean all){
    Selectable selectable = WebMagicUtil.matchVal(field, page);
    log.debug("當(dāng)前field.key={},匹配的結(jié)果為:{}",field.getKey(),JSON.toJSONString(selectable.all()));
    if (Validator.isNotNull(selectable)){
        // selectable get()/all(),看導(dǎo)出單個(gè)具體文件(md),還是壓縮文件(zip)
        if (all){
           return selectable.all();
        }else {
           return selectable.get();
        }
    }
    return null;
}

該方法會(huì)打印出符合的匹配結(jié)果的,默認(rèn)打印全部結(jié)果碑宴。

1. 通用詳情頁導(dǎo)出md
1.1 swagger下載md文件

使用swagger-ui進(jìn)行md文件下載,地址 -> http://localhost:8000/pro/item/export/md

post,application/json

swagger-md下載請(qǐng)求參數(shù)
1.2 實(shí)際入?yún)son(CSDN示例)
{
    "fileName":"md下載測(cè)試.md",
    "itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
    "fields":[
        {"key":"md","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
        {"key":"md1","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"}
    ]
}

多個(gè)key我會(huì)把多個(gè)key匹配到的內(nèi)容一起寫入到一個(gè)md,只要匹配到符合的內(nèi)容在一個(gè)以上就會(huì)進(jìn)行 額外拼接 "# md分割-下一篇\n\n"软啼,再拼接下一篇。

1.3 效果
md下載swagger
md文件最終效果

因?yàn)槲疫@里爬取的匹配結(jié)果為2,所以我就會(huì)再拼接一條 "# md分割-下一篇\n\n"延柠。

注意: 這里不是因?yàn)?fields.size= 2,所以才導(dǎo)出2個(gè)的,是因?yàn)槟闼渲玫囊?guī)則匹配后的內(nèi)容結(jié)果為2祸挪。

有可能你配置的規(guī)則沒有匹配到內(nèi)容

或者

配置 一個(gè)規(guī)則卻可以出現(xiàn)多個(gè)拼接

多個(gè)拼接(多個(gè)根據(jù)WebMagicUtil.csdnProcess()的all參數(shù)決定)也是有可能的!

all 為

false,即為符合當(dāng)前這條field規(guī)則的最多只取一條,也就是一個(gè)匹配結(jié)果[0]拼接index=0,

true,取出所有,也就是一個(gè)匹配結(jié)果[size],循環(huán)多個(gè)拼接。
2. 通用詳情頁導(dǎo)出zip
###### 2.1 swagger下載zip

使用swagger-ui進(jìn)行zip文件下載,[地址](http://localhost:8000/swagger-ui.html#!/ProTestController/exportCSDNMdUsingPOST) -> http://localhost:8000/pro/item/export/zip

post,application/json
2.2 json

CSDN示例

{
    "fileName":"img下載測(cè)試.zip",
    "itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
    "fields":[
          {"key":"images","value":"http://*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
          {"key":"icons","value":"http://div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
    ]
}

這里建議使用postman進(jìn)行測(cè)試,postman可以將最終結(jié)果文件流保存為zip文件

2.3 效果
詳情zip下載效果
3. 通用列表頁獲取詳情頁進(jìn)行zip導(dǎo)出
3.1下載地址
http://localhost:8000/pro/list/export/zip

post,application/json
3.2 json
CSDN示例:
{
    "fileName":"xxx.zip",
    "helpUrl":"https://blog.csdn.net/weixin_43917143",
    "targetUrl":"(https://blog\\.csdn\\.net/weixin_43917143/article/details/\\w+)",
    "fields":[
        {"key":"md","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
           {"key":"images","value":"http://*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
            {"key":"icons","value":"http://div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
    ]
}
簡(jiǎn)書示例:
{
    "fileName":"簡(jiǎn)書-xxx.zip",
    "helpUrl":"http://www.reibang.com/u/7dcee71ee038",
    "targetUrl":"(https://www\\.jianshu\\.com/p/\\w+)",
    "fields":[
        {"key":"md","value":"http://div[@class=\"_gp-ck\"]/section[@class=\"ouvjEz\"][1]","type":"XPath","fileType":"MD"}
    ]
}
3.3 效果

CSDN效果:

列表zip下載效果
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末贞间,一起剝皮案震驚了整個(gè)濱河市贿条,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌增热,老刑警劉巖闪唆,帶你破解...
    沈念sama閱讀 219,366評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異钓葫,居然都是意外死亡悄蕾,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,521評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門础浮,熙熙樓的掌柜王于貴愁眉苦臉地迎上來帆调,“玉大人,你說我怎么就攤上這事豆同》” “怎么了?”我有些...
    開封第一講書人閱讀 165,689評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵影锈,是天一觀的道長(zhǎng)芹务。 經(jīng)常有香客問我,道長(zhǎng)鸭廷,這世上最難降的妖魔是什么枣抱? 我笑而不...
    開封第一講書人閱讀 58,925評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮辆床,結(jié)果婚禮上佳晶,老公的妹妹穿的比我還像新娘。我一直安慰自己讼载,他們只是感情好轿秧,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,942評(píng)論 6 392
  • 文/花漫 我一把揭開白布中跌。 她就那樣靜靜地躺著,像睡著了一般菇篡。 火紅的嫁衣襯著肌膚如雪漩符。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,727評(píng)論 1 305
  • 那天驱还,我揣著相機(jī)與錄音陨仅,去河邊找鬼。 笑死铝侵,一個(gè)胖子當(dāng)著我的面吹牛灼伤,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播咪鲜,決...
    沈念sama閱讀 40,447評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼狐赡,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了疟丙?” 一聲冷哼從身側(cè)響起颖侄,我...
    開封第一講書人閱讀 39,349評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎享郊,沒想到半個(gè)月后览祖,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,820評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡炊琉,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,990評(píng)論 3 337
  • 正文 我和宋清朗相戀三年展蒂,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片苔咪。...
    茶點(diǎn)故事閱讀 40,127評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡锰悼,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出团赏,到底是詐尸還是另有隱情箕般,我是刑警寧澤,帶...
    沈念sama閱讀 35,812評(píng)論 5 346
  • 正文 年R本政府宣布舔清,位于F島的核電站丝里,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏体谒。R本人自食惡果不足惜杯聚,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,471評(píng)論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望营密。 院中可真熱鬧械媒,春花似錦目锭、人聲如沸评汰。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,017評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽被去。三九已至主儡,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間惨缆,已是汗流浹背糜值。 一陣腳步聲響...
    開封第一講書人閱讀 33,142評(píng)論 1 272
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留坯墨,地道東北人寂汇。 一個(gè)月前我還...
    沈念sama閱讀 48,388評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像捣染,于是被迫代替她去往敵國(guó)和親骄瓣。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,066評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容