jsoup 是一款Java 的HTML解析器力试,可直接解析某個(gè)URL地址肪虎、HTML文本內(nèi)容豌研。它提供了一套非常省力的API,可通過(guò)DOM贬丛,CSS以及類似于jQuery的操作方法來(lái)取出和操作數(shù)據(jù)撩银。
話不多說(shuō) 下面直接來(lái)整合一個(gè)小案例來(lái)測(cè)試一下:
先做準(zhǔn)備工作,我們一共需要用到4個(gè)jar包
1.JSoup Java HTML Parser的jar包
2.JUnit的jar包
3.Apache Commons IO的jar包
4.Apache Commons Lang的jar包
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>
點(diǎn)進(jìn)Jsoup的源碼可以看到構(gòu)造方法被私有了,都是靜態(tài)方法,就是一個(gè)簡(jiǎn)單的工具類豺憔。那么來(lái)嘗試一下吧.
//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//
package org.jsoup;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import org.jsoup.helper.DataUtil;
import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Document.OutputSettings;
import org.jsoup.parser.Parser;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Whitelist;
public class Jsoup {
private Jsoup() {
}
public static Document parse(String html, String baseUri) {
return Parser.parse(html, baseUri);
}
public static Document parse(String html, String baseUri, Parser parser) {
return parser.parseInput(html, baseUri);
}
public static Document parse(String html) {
return Parser.parse(html, "");
}
public static Connection connect(String url) {
return HttpConnection.connect(url);
}
public static Document parse(File in, String charsetName, String baseUri) throws IOException {
return DataUtil.load(in, charsetName, baseUri);
}
public static Document parse(File in, String charsetName) throws IOException {
return DataUtil.load(in, charsetName, in.getAbsolutePath());
}
public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException {
return DataUtil.load(in, charsetName, baseUri);
}
public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOException {
return DataUtil.load(in, charsetName, baseUri, parser);
}
public static Document parseBodyFragment(String bodyHtml, String baseUri) {
return Parser.parseBodyFragment(bodyHtml, baseUri);
}
public static Document parseBodyFragment(String bodyHtml) {
return Parser.parseBodyFragment(bodyHtml, "");
}
public static Document parse(URL url, int timeoutMillis) throws IOException {
Connection con = HttpConnection.connect(url);
con.timeout(timeoutMillis);
return con.get();
}
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
Cleaner cleaner = new Cleaner(whitelist);
Document clean = cleaner.clean(dirty);
return clean.body().html();
}
public static String clean(String bodyHtml, Whitelist whitelist) {
return clean(bodyHtml, "", whitelist);
}
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, OutputSettings outputSettings) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
Cleaner cleaner = new Cleaner(whitelist);
Document clean = cleaner.clean(dirty);
clean.outputSettings(outputSettings);
return clean.body().html();
}
public static boolean isValid(String bodyHtml, Whitelist whitelist) {
Document dirty = parseBodyFragment(bodyHtml, "");
Cleaner cleaner = new Cleaner(whitelist);
return cleaner.isValid(dirty);
}
}
Jsoup的parse方法重載了很多參數(shù),有讀取字符串的,file文件的.下面以簡(jiǎn)單的例子來(lái)做演示.
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
import java.net.URL;
public class JsoupDemo {
//解析url
@Test
public void urldemo() throws IOException {
//解析url,返回一個(gè)文檔對(duì)象
Document document =Jsoup.parse(new URL("http://www.baidu.com"),1000);
//獲取名字為title標(biāo)簽的元素
// String title =document.getElementsByTag("title").first().text();
Elements elements =document.getElementsByTag("title");
System.out.println(elements.text());
}
@Test
public void stringdemo() throws Exception {
//使用工具類讀取文件,獲取字符串
String str=FileUtils.readFileToString(new File("C:\\Users\\asus\\Desktop\\嗶哩嗶哩 (゜-゜)つロ 干杯_-bilibili.html"),"utf8");
//解析字符串额获,返回文檔對(duì)象
Document document=Jsoup.parse(str);
//獲取第一個(gè)名為title標(biāo)簽的內(nèi)容
String string=document.getElementsByTag("title").first().text();
//打印輸出
System.out.println(string);
}
@Test
public void filedemo() throws IOException {
//解析文件,返回dom對(duì)象
Document document=Jsoup.parse(new File("C:\\Users\\asus\\Desktop\\嗶哩嗶哩 (゜-゜)つロ 干杯_-bilibili.html"),"utf8");
//獲取第一個(gè)名為title標(biāo)簽的內(nèi)容
String str=document.getElementsByTag("title").first().text();
//打印輸出
System.out.println(str);
}
}
Jsoup還有很多封裝好的方法就不一一演示了.
DOM對(duì)象封裝的方法:
document.getElementById();//根據(jù)id查詢?cè)?document.getElementsByTag();//根據(jù)標(biāo)簽獲取元素
document.getElementsByClass();//根據(jù)class獲取元素
document.getElementsByAttribute();//根據(jù)屬性獲取元素
Element封裝的方法:
element.id();//從元素中獲取id
element.className();//從元素中獲取className
element.attr();//從元素中獲取屬性的值attr
element.attributes();//從元素中獲取所有屬性attributes
element.text();//從元素中獲取文本內(nèi)容text