SpringBoot整合WebMagic
前言
為什么我要整合WebMagic ?
WebMagic是一個(gè)簡(jiǎn)單靈活的Java爬蟲框架〈矗基于WebMagic嫡秕,你可以快速開發(fā)出一個(gè)高效、易維護(hù)的爬蟲苹威。
因?yàn)椴糠志W(wǎng)站它不支持外鏈圖片上傳,而我已經(jīng)把我的圖片資源上傳了,所以我需要把所有的資源進(jìn)行獲取整合再在部分網(wǎng)站重新上傳;
舉個(gè)??:
紅框里的就是上傳失敗的昆咽。
很無奈,人家不支持外鏈地址,那沒有辦法,自己重新上傳吧!
所以我通過WebMagic進(jìn)行爬取我放置在其他網(wǎng)站的博客再進(jìn)行上傳。
如下??:
爬取后獲得的資源:
文檔迭代
version | 功能 | 詳細(xì)描述 |
---|---|---|
0.0.1 | 通過整合WebMagic實(shí)現(xiàn)了通用列表(多頁面)及詳細(xì)內(nèi)容(單頁面)的通用爬取接口 | 通用列表(多頁面是指列表頁面中的詳情頁),單頁就是指的詳情頁 |
項(xiàng)目技術(shù)版本(技術(shù)選型)
jdk - version - 1.8
maven - version - 3.8.1
SpringBoot - version - 2.2.2.RELEASE
Swagger - version - 2.7.0
remark - version - 1.0.0
webmagic - version - 0.7.3
其他 參考pom
項(xiàng)目地址
gitee地址: https://gitee.com/zjydzyjs/spring-boot-use-case-collection/tree/master/webmagic
前置知識(shí)
WebMagic 需自主學(xué)習(xí)
[webmagic官網(wǎng)](http://webmagic.io/)
[webmagic詳情文檔](http://webmagic.io/docs/zh/)
這里說一下我的感慨,webmagic的作者是真厲害,他再官方上也說了,該框架是他自己業(yè)余開發(fā),這個(gè)精神值得佩服,在此感謝作者([Yihua Huang](https://github.com/code4craft));
假設(shè)你已經(jīng)學(xué)習(xí)了該項(xiàng)目牙甫,可以繼續(xù)學(xué)習(xí)了掷酗。
WebMagic 簡(jiǎn)單測(cè)試
文檔中的第一個(gè)用例:
你可以再idea中進(jìn)行搜索(快速按兩次 shift),輸入GithubRepoPageProcessor類就可以發(fā)現(xiàn)這個(gè)作者已經(jīng)寫好的用例了,我們可以測(cè)試一下窟哺。
package us.codecraft.webmagic.processor.example;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class GithubRepoPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(10000);
public GithubRepoPageProcessor() {
}
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-])").all());
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
page.putField("name", page.getHtml().xpath("http://h1[@class='public']/strong/a/text()").toString());
if (page.getResultItems().get("name") == null) {
page.setSkip(true);
}
page.putField("readme", page.getHtml().xpath("http://div[@id='readme']/tidyText()"));
}
public Site getSite() {
return this.site;
}
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl(new String[]{"https://github.com/code4craft"}).thread(5).run();
}
}
你直接通過通過以下用例進(jìn)行調(diào)用就可以測(cè)試了,在類{@link com.blacktea.webmagic.demo.WebMagicTest}里,
@Test
void testDemo1(){
Spider.create(new GithubRepoPageProcessor())
.addUrl(new String[]{"https://github.com/code4craft"})
.thread(5).run();
}
運(yùn)行得到錯(cuò)誤內(nèi)容:
10:34:13.307 [main] DEBUG us.codecraft.webmagic.scheduler.QueueScheduler - push to queue https://github.com/code4craft
10:34:13.730 [main] INFO us.codecraft.webmagic.Spider - Spider github.com started!
10:34:13.762 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: standard
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAuthCache - Auth cache not set in the context
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection request: [route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:13.774 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection leased: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 1 of 100; total allocated: 1 of 5]
10:34:13.775 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Opening connection {s}->https://github.com:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultHttpClientConnectionOperator - Connecting to github.com/140.82.112.4:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Connecting socket to github.com/140.82.112.4:443 with timeout 10000
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled protocols: [TLSv1]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled cipher suites:[TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_DSS_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Starting handshake
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-0: Shutdown connection
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection discarded
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:14.316 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
at sun.security.ssl.Alert.createSSLException(Alert.java:131)
at sun.security.ssl.Alert.createSSLException(Alert.java:117)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:340)
at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
at sun.security.ssl.TransportContext.dispatch(TransportContext.java:186)
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:154)
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1279)
at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1188)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:401)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
10:34:15.324 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.
那第一個(gè)例子就出現(xiàn)這個(gè)問題,那得解決呀,是吧?
于是我就在github的issues找到了該問題泻轰!
作者說是會(huì)在0.7.4中解決,你可能會(huì)說那我們換版本就好了,抱歉,作者可能是工作太忙了,所以0.7.4遲遲不出,那沒辦法,那咱們就自己改吧。
手動(dòng)解決 protocol_version bug
先找到作者說的**HttpClientGenerator**類中的方法*buildSSLConnectionSocketFactory*脏答,
private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
try {
return new SSLConnectionSocketFactory(this.createIgnoreVerifySSL());
} catch (KeyManagementException var2) {
this.logger.error("ssl connection fail", var2);
} catch (NoSuchAlgorithmException var3) {
this.logger.error("ssl connection fail", var3);
}
return SSLConnectionSocketFactory.getSocketFactory();
}
將里面的內(nèi)容改造就行,不過因?yàn)槭窃创a,如果你不想重新打包的話,就需要自己去重寫引用HttpClientGenerator
如何知道在那里使用上HttpClientGenerator的?
官網(wǎng)說了,WebMagic的結(jié)構(gòu)分為`Downloader`糕殉、`PageProcessor`亩鬼、`Scheduler`殖告、`Pipeline`四大組件,
那我們?cè)趺粗肋@四大組件是怎么相互關(guān)聯(lián)的呢?
可以從之前的那個(gè)**GithubRepoPageProcessor**類的執(zhí)行方法中獲得
Spider.create(new GithubRepoPageProcessor())
.addUrl(new String[]{"https://github.com/code4craft"})
.thread(5)
.run();
從上可以知道,執(zhí)行一個(gè)爬取任務(wù),需要通過**Spider**去啟動(dòng),那我們查看一下
package us.codecraft.webmagic;
......
public class Spider implements Runnable, Task {
protected Downloader downloader;
protected List<Pipeline> pipelines = new ArrayList();
protected PageProcessor pageProcessor;
protected List<Request> startRequests;
protected Site site;
protected String uuid;
protected Scheduler scheduler = new QueueScheduler();
protected Logger logger = LoggerFactory.getLogger(this.getClass());
protected CountableThreadPool threadPool;
protected ExecutorService executorService;
protected int threadNum = 1;
protected AtomicInteger stat = new AtomicInteger(0);
protected boolean exitWhenComplete = true;
protected static final int STAT_INIT = 0;
protected static final int STAT_RUNNING = 1;
protected static final int STAT_STOPPED = 2;
protected boolean spawnUrl = true;
protected boolean destroyWhenExit = true;
private ReentrantLock newUrlLock = new ReentrantLock();
private Condition newUrlCondition;
private List<SpiderListener> spiderListeners;
private final AtomicLong pageCount;
private Date startTime;
private int emptySleepTime;
...... 省略
/** @deprecated */
public Spider downloader(Downloader downloader) {
return this.setDownloader(downloader);
}
public Spider setDownloader(Downloader downloader) {
this.checkIfRunning();
this.downloader = downloader;
return this;
}
public void run() {
this.checkRunningStat();
this.initComponent();
this.logger.info("Spider {} started!", this.getUUID());
......
this.logger.info("Spider {} closed! {} pages downloaded.", this.getUUID(), this.pageCount.get());
}
protected void initComponent() {
if (this.downloader == null) {
// 在這里也看到了默認(rèn)使用 HttpClientDownloader
this.downloader = new HttpClientDownloader();
}
if (this.pipelines.isEmpty()) {
this.pipelines.add(new ConsolePipeline());
}
this.downloader.setThread(this.threadNum);
if (this.threadPool == null || this.threadPool.isShutdown()) {
if (this.executorService != null && !this.executorService.isShutdown()) {
this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
} else {
this.threadPool = new CountableThreadPool(this.threadNum);
}
}
if (this.startRequests != null) {
Iterator var1 = this.startRequests.iterator();
while(var1.hasNext()) {
Request request = (Request)var1.next();
this.addRequest(request);
}
this.startRequests.clear();
}
this.startTime = new Date();
}
}
從上面可以發(fā)現(xiàn)四個(gè)都有,我們點(diǎn)擊進(jìn)Downloader類
可以發(fā)現(xiàn)
三個(gè)實(shí)現(xiàn)類(紅框內(nèi)容,my那個(gè)是我自己實(shí)現(xiàn)的)!
然后你點(diǎn)進(jìn)HttpClientDownloader類
package us.codecraft.webmagic.downloader;
......
@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
private Logger logger = LoggerFactory.getLogger(this.getClass());
private final Map<String, CloseableHttpClient> httpClients = new HashMap();
// 在這里就發(fā)現(xiàn)使用HttpClientGenerator
private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
private ProxyProvider proxyProvider;
private boolean responseHeader = true;
......省略
}
從以上我們知道了Spider類中有屬性Download接口,該接口有實(shí)現(xiàn)類HttpClientDownloader雳锋,當(dāng)你在執(zhí)行Spider.run()時(shí),run()會(huì)執(zhí)行initComponent()黄绩,如果你沒有設(shè)置Spider的download
屬性,就會(huì)自動(dòng)的選擇HttpClientDownloader作為實(shí)現(xiàn)類,而HttpClientDownloader中又使用了HttpClientGenerator作為參數(shù),那我們既然要改HttpClientGenerator類中的方法buildSSLConnectionSocketFactory(),那就得重寫HttpClientGenerator和Download
實(shí)現(xiàn),得到MyHttpClientGenerator和MyHttpClientDownloader玷过。
怎么使用新得Download實(shí)現(xiàn)?
1.再在使用Spider得時(shí)候setDownloader(MyHttpClientDownloader),這樣就可以了爽丹。
Spider.create(pageProcessor)
.setDownloader(new MyHttpClientDownloader());
2.或者增加一個(gè)類用于繼承Spider類,重寫initComponent()方法中得
protected void initComponent() {
if (this.downloader == null) {
// 修改這里為 this.downloader = new MyHttpClientDownloader();
this.downloader = new HttpClientDownloader();
}
if (this.pipelines.isEmpty()) {
this.pipelines.add(new ConsolePipeline());
}
this.downloader.setThread(this.threadNum);
if (this.threadPool == null || this.threadPool.isShutdown()) {
if (this.executorService != null && !this.executorService.isShutdown()) {
this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
} else {
this.threadPool = new CountableThreadPool(this.threadNum);
}
}
if (this.startRequests != null) {
Iterator var1 = this.startRequests.iterator();
while(var1.hasNext()) {
Request request = (Request)var1.next();
this.addRequest(request);
}
this.startRequests.clear();
}
this.startTime = new Date();
}
再次執(zhí)行testDemo1()
@Test
void testDemo1(){
// 出現(xiàn)錯(cuò)誤 -> javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
// Spider.create(new GithubRepoPageProcessor())
// .addUrl(new String[]{"https://github.com/code4craft"})
// .thread(5).run();
//使用MyHttpClientDownloader解決 protocol_version,但是可能出現(xiàn)github請(qǐng)求超時(shí),這個(gè)與github訪問有關(guān)系,我訪問不了
Spider.create(new GithubRepoPageProcessor())
.addUrl(new String[]{"https://github.com/code4craft"})
.setDownloader(new MyHttpClientDownloader())
.thread(5).run();
}
官網(wǎng)其他用例
package us.codecraft.webmagic.processor;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
public interface PageProcessor {
void process(Page var1);
Site getSite();
}
你可以通過查找PageProcessor接口得實(shí)現(xiàn),得到官網(wǎng)用例,
本文實(shí)現(xiàn)
#### 功能
當(dāng)前版本通過webmagic實(shí)現(xiàn)三個(gè)[接口](https://gitee.com/zjydzyjs/spring-boot-use-case-collection/blob/master/webmagic/src/main/java/com/blacktea/webmagic/demo/controller/ProTestController.java)筑煮,可實(shí)現(xiàn)三個(gè)通用功能。
接口請(qǐng)求報(bào)文
名稱 | 必填 | 類型 | 說明 |
---|---|---|---|
fileName | true | String | 表示下載后得文件名稱(要求帶文件類型后綴)粤蝎; |
itemPageUrl | false | String | 目標(biāo)頁面地址真仲;(該項(xiàng)與helpUrl、targetUrl是分開的,只有進(jìn)行文章詳情爬取的時(shí)候才需要itemPageUrl) |
helpUrl | false | String | 提供需要目標(biāo)地址的頁面url,例如CSDN的列表頁; |
targetUrl | false | String | 找尋目標(biāo)地址的匹配url,暫時(shí)僅支持 regx 匹配; |
fields | true | List<Field> | 匹配規(guī)則集合初澎,長(zhǎng)度必須大于0 |
key | true | String (Field.key) | fields[index].key屬性,用于標(biāo)記該條規(guī)則得別名,在導(dǎo)出zip時(shí),會(huì)自動(dòng)將該key作為一級(jí)或二級(jí)(item一級(jí),list二級(jí))目錄名稱作為zip內(nèi)文件分區(qū) |
value | true | String (Field.value) | fields[index].value屬性,用于在頁面中進(jìn)行匹配得內(nèi)容 |
type | true | String (Field.type) | fields[index].type屬性,與Field.value進(jìn)行組合使用,當(dāng)前支持 ['XPath', 'Regex', 'Css', 'JsonPath']秸应,是對(duì)webmagi{@link us.codecraft.webmagic.model.annotation.Type}得復(fù)用,源碼下得matchVal()方法 |
fileType | true | String (Field.fileType) | fields[index].fileType屬性,是對(duì)進(jìn)行匹配后得到結(jié)果數(shù)據(jù)按照類型進(jìn)行處理,源碼下得getDataMap()方法 |
zip分區(qū)說明
為了區(qū)分匹配后獲取的資源與避免文件名稱重復(fù)導(dǎo)致文件被覆蓋,我這里做了目錄分區(qū)
1. 當(dāng)你用`itemPageUrl` 時(shí),文件結(jié)構(gòu)為
xxx.zip/key/fileType;
2. 當(dāng)你用`helpUrl and targetUrl`時(shí),文件結(jié)構(gòu)為
第幾個(gè)頁面/key/fileType;
匹配結(jié)果調(diào)試
WebMagicUtil.getSelectableValue()
private static Object getSelectableValue(Field field,Page page, boolean all){
Selectable selectable = WebMagicUtil.matchVal(field, page);
log.debug("當(dāng)前field.key={},匹配的結(jié)果為:{}",field.getKey(),JSON.toJSONString(selectable.all()));
if (Validator.isNotNull(selectable)){
// selectable get()/all(),看導(dǎo)出單個(gè)具體文件(md),還是壓縮文件(zip)
if (all){
return selectable.all();
}else {
return selectable.get();
}
}
return null;
}
該方法會(huì)打印出符合的匹配結(jié)果的,默認(rèn)打印全部結(jié)果碑宴。
1. 通用詳情頁導(dǎo)出md
1.1 swagger下載md文件
使用swagger-ui進(jìn)行md文件下載,地址 -> http://localhost:8000/pro/item/export/md
post,application/json
1.2 實(shí)際入?yún)son(CSDN示例)
{
"fileName":"md下載測(cè)試.md",
"itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
"fields":[
{"key":"md","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
{"key":"md1","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"}
]
}
多個(gè)key我會(huì)把多個(gè)key匹配到的內(nèi)容一起寫入到一個(gè)md,只要匹配到符合的內(nèi)容在一個(gè)以上就會(huì)進(jìn)行 額外拼接 "# md分割-下一篇\n\n"软啼,再拼接下一篇。
1.3 效果
因?yàn)槲疫@里爬取的匹配結(jié)果為2,所以我就會(huì)再拼接一條 "# md分割-下一篇\n\n"延柠。
注意: 這里不是因?yàn)?fields.size= 2,所以才導(dǎo)出2個(gè)的,是因?yàn)槟闼渲玫囊?guī)則匹配后的內(nèi)容結(jié)果為2祸挪。
有可能你配置的規(guī)則沒有匹配到內(nèi)容
或者
配置 一個(gè)規(guī)則卻可以出現(xiàn)多個(gè)拼接
多個(gè)拼接(多個(gè)根據(jù)WebMagicUtil.csdnProcess()的all參數(shù)決定)也是有可能的!
all 為
false,即為符合當(dāng)前這條field規(guī)則的最多只取一條,也就是一個(gè)匹配結(jié)果[0]拼接index=0,
true,取出所有,也就是一個(gè)匹配結(jié)果[size],循環(huán)多個(gè)拼接。
2. 通用詳情頁導(dǎo)出zip
###### 2.1 swagger下載zip
使用swagger-ui進(jìn)行zip文件下載,[地址](http://localhost:8000/swagger-ui.html#!/ProTestController/exportCSDNMdUsingPOST) -> http://localhost:8000/pro/item/export/zip
post,application/json
2.2 json
CSDN示例
{
"fileName":"img下載測(cè)試.zip",
"itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
"fields":[
{"key":"images","value":"http://*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
{"key":"icons","value":"http://div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
]
}
這里建議使用postman進(jìn)行測(cè)試,postman可以將最終結(jié)果文件流保存為zip文件
2.3 效果
3. 通用列表頁獲取詳情頁進(jìn)行zip導(dǎo)出
3.1下載地址
http://localhost:8000/pro/list/export/zip
post,application/json
3.2 json
CSDN示例:
{
"fileName":"xxx.zip",
"helpUrl":"https://blog.csdn.net/weixin_43917143",
"targetUrl":"(https://blog\\.csdn\\.net/weixin_43917143/article/details/\\w+)",
"fields":[
{"key":"md","value":"http://*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
{"key":"images","value":"http://*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
{"key":"icons","value":"http://div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
]
}
簡(jiǎn)書示例:
{
"fileName":"簡(jiǎn)書-xxx.zip",
"helpUrl":"http://www.reibang.com/u/7dcee71ee038",
"targetUrl":"(https://www\\.jianshu\\.com/p/\\w+)",
"fields":[
{"key":"md","value":"http://div[@class=\"_gp-ck\"]/section[@class=\"ouvjEz\"][1]","type":"XPath","fileType":"MD"}
]
}
3.3 效果
CSDN效果: