項(xiàng)目名稱:java爬蟲
項(xiàng)目技術(shù)選型:Java货邓、Maven馏艾、Mysql举瑰、WebMagic捣辆、Jsp、Servlet
項(xiàng)目實(shí)施方式:以認(rèn)知java爬蟲框架WebMagic開發(fā)為主此迅,用所學(xué)java知識(shí)完成指定網(wǎng)站的數(shù)據(jù)爬取解析汽畴,并使用Servlet和Jsp展示到頁面
實(shí)訓(xùn)環(huán)境:一人一機(jī),邊講邊練
實(shí)訓(xùn)簡介:
本次實(shí)訓(xùn)的主要目的是增強(qiáng)學(xué)生對(duì)于WebMagic框架和Servlet的了解耸序,并結(jié)合所學(xué)的理論知識(shí)進(jìn)行爬蟲實(shí)戰(zhàn)忍些。需要同學(xué)掌握包括目前市場上使用廣泛的Mysql數(shù)據(jù)、Java語言坎怪、WebMagic框架和Servlet的開發(fā)罢坝,并了解大中型大數(shù)據(jù)行業(yè)的基本模式知識(shí)。
此次實(shí)訓(xùn)選擇的案例有:
Mysql數(shù)據(jù)庫基本操作
Java基本語法使用
WebMagic框架搭建并開發(fā)爬蟲項(xiàng)目
通過學(xué)習(xí)這些內(nèi)容可大大提升學(xué)生對(duì)計(jì)算機(jī)知識(shí)的理解搅窿,促進(jìn)專業(yè)課程的學(xué)習(xí)嘁酿,從而潛移默化的提升學(xué)生的就業(yè)競爭力隙券。
步驟:
1、下載闹司、安裝好Maven娱仔,并在Eclipse中配置好Maven的相關(guān)設(shè)置憨闰。
1)墓塌、下載、安裝Maven
下載地址:http://maven.apache.org/download.cgi衡楞,根據(jù)自己系統(tǒng)選擇合適版本進(jìn)行下載:
解壓下載的文件到合適的位置即完成了Maven的安裝:
2)借卧、設(shè)置環(huán)境變量
復(fù)制Maven的安裝路徑下bin目錄的路徑盹憎,將其添加到電腦的環(huán)境變量中去:
復(fù)制bin目錄所在的路徑:
添加環(huán)境變量:
在cmd下輸入:mvn --version 檢查Maven是否安裝成功,出現(xiàn)以下提示則安裝成功:
3)谓娃、可忽略:修改Maven安裝目錄 conf下的settings.xml文件(E:\apache-maven-3.5.4\conf\settings.xml)脚乡,來配置本地倉庫的位置和將遠(yuǎn)程倉庫鏡像修改成阿里云鏡像:
配置本地倉庫,在下面加上自己所要?jiǎng)?chuàng)建的本地倉庫的地址(根據(jù)自身情況設(shè)置):
Maven倉庫默認(rèn)在國外,使用難免很慢,尤其是下載依賴的時(shí)候滨达,速度賊慢,換成國內(nèi)阿里云鏡像后會(huì)在速度上有很大的提升:
<mirror>
<id>aliyun</id>
<name>aliyun Maven</name>
<mirrorOf>*</mirrorOf>
<url>http://maven.aliyun.com/nexus/content/repositories/central</url>
</mirror>
4)奶稠、Eclipse的配置
以下步驟,在每個(gè)人的電腦上顯示的內(nèi)容可能會(huì)不一樣(截圖來自不同的項(xiàng)目捡遍,請(qǐng)忽略包名锌订、類名等信息,部分截圖來自網(wǎng)絡(luò)画株,不同截圖里的相關(guān)信息可能不同)辆飘,但操作步驟是一樣的,只要照著做就行了谓传,在Eclipse上安裝maven,打開Eclipse點(diǎn)擊window>prferences之后會(huì)彈出:
點(diǎn)擊確定之后會(huì)出現(xiàn):
點(diǎn)擊finish之后:
在Eclipse中配置Maven:
打開Eclipse的首選項(xiàng)設(shè)置
找到Maven的配置項(xiàng)目
設(shè)置Maven的全局配置文件settings.xml
更新配置信息
2蜈项、在Eclipse中創(chuàng)建Maven項(xiàng)目
1)、開啟eclipse续挟,右鍵new——》other紧卒,如下圖找到maven project或者直接搜索maven projec:
創(chuàng)建項(xiàng)目:
2)、選擇Maven Project诗祸,請(qǐng)選中Create a simple project(skip archetype selection),之后點(diǎn)擊Next :
3)跑芳、填寫Group id和Artifact id, Version默認(rèn)直颅,Packaging默認(rèn)為jar,Name博个,Description選填,其他的可以都不填寫:
之后點(diǎn)擊Finish即可功偿,此時(shí)需要等待一段時(shí)間下載所需要的文件盆佣,創(chuàng)建后的完整項(xiàng)目結(jié)構(gòu)應(yīng)如下圖所示:
3、編寫Java爬蟲項(xiàng)目代碼,抓取https://hr.tencent.com/position.php網(wǎng)站的相關(guān)信息:
1)罪塔、所需要抓取網(wǎng)頁內(nèi)容:職位名稱投蝉、職位類別、人數(shù)征堪、地點(diǎn)瘩缆、發(fā)布時(shí)間
2)、根據(jù)所要抓取的內(nèi)容(抓取內(nèi)容包括:職位名稱佃蚜、職位類別庸娱、人數(shù)、地點(diǎn)谐算、發(fā)布時(shí)間)熟尉,可參照下面的SQL語句設(shè)計(jì)數(shù)據(jù)庫(mysql):
/*
SQLyog Ultimate v12.5.0 (64 bit)
MySQL - 5.5.27 : Database - mysql_java
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`mysql_java` /*!40100 DEFAULT CHARACTER SET utf8 */;
USE `mysql_java`;
/*Table structure for table `tencent_position` */
DROP TABLE IF EXISTS `tencent_position`;
CREATE TABLE `tencent_position` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`p_name` varchar(200) NOT NULL,
`p_link` varchar(200) NOT NULL,
`p_type` varchar(100) NOT NULL,
`p_num` varchar(20) NOT NULL,
`p_location` varchar(20) NOT NULL,
`p_publish_time` varchar(20) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1125 DEFAULT CHARSET=utf8;
3)、在創(chuàng)建的項(xiàng)目下洲脂,首先需要配置好pom.xml,然后分別創(chuàng)建四個(gè)類和一個(gè)接口(名字自己冉锒):MySQLUtils、TencentPageProcessor恐锦、TencentPosition往果、TencentPositionDao(接口)、TencentPositionDaoImpl
配置pom.xml:
pom.xml文件的設(shè)置:填寫好<dependency..../dependency>后的內(nèi)容后一铅,一定記得要按Ctrl+S/保存按鈕陕贮,之后Eclipse會(huì)自動(dòng)從設(shè)置好的Maven倉庫中下載所需要的文件,可能需要一定的時(shí)間:
dependency數(shù)據(jù)來自:http://mvnrepository.com/ 分別搜索:webmagic潘飘、mysql會(huì)顯示相關(guān)內(nèi)容
點(diǎn)擊搜索得到的內(nèi)容肮之,復(fù)制框內(nèi)的代碼到pom.xml的<dependency..../dependency>代碼塊中:
可以在Maven Dependencies庫中查看是否下載完成:
以下是示例代碼,自己編碼時(shí)卜录,請(qǐng)記得一定要改動(dòng)代碼戈擒。
MySQLUtils類代碼如下:
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
public class MySQLUtils {
private static Connection connection;
public static Connection getConnection() throws ClassNotFoundException, SQLException {
if (connection == null) {
Class.forName("com.mysql.jdbc.Driver");
String url = "jdbc:mysql://localhost:3306/mysql_java";//URL、User艰毒、Password需根據(jù)自己的實(shí)際情況填寫
String user = "root";
String password = "root";
return DriverManager.getConnection(url, user, password);
}
return connection;
}
}
TencentPosition代碼如下:
public class TencentPosition {
private String positionName;
private String positionLink;
private String positionType;
private String positionNum;
private String workLocation;
private String publishTime;
public TencentPosition() {
super();
}
public TencentPosition(String positionName, String positionLink, String positionType, String positionNum,
String workLocation, String publishTime) {
super();
this.positionName = positionName;
this.positionLink = positionLink;
this.positionType = positionType;
this.positionNum = positionNum;
this.workLocation = workLocation;
this.publishTime = publishTime;
}
public String getPositionName() {
return positionName;
}
public void setPositionName(String positionName) {
this.positionName = positionName;
}
public String getPositionLink() {
return positionLink;
}
public void setPositionLink(String positionLink) {
this.positionLink = positionLink;
}
public String getPositionType() {
return positionType;
}
public void setPositionType(String positionType) {
this.positionType = positionType;
}
public String getPositionNum() {
return positionNum;
}
public void setPositionNum(String positionNum) {
this.positionNum = positionNum;
}
public String getWorkLocation() {
return workLocation;
}
public void setWorkLocation(String workLocation) {
this.workLocation = workLocation;
}
public String getPublishTime() {
return publishTime;
}
public void setPublishTime(String publishTime) {
this.publishTime = publishTime;
}
@Override
public String toString() {
return "TencentPosition [positionName=" + positionName + ", positionLink=" + positionLink + ", positionType="
+ positionType + ", positionNum=" + positionNum + ", workLocation=" + workLocation + ", publishTime="
+ publishTime + "]";
}
}
TencentPositionDao接口代碼如下:
public interface TencentPositionDao {
int add(TencentPosition position);
}
TencentPositionDaoImpl類代碼如下:
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
public class TencentPositionDaoImpl implements TencentPositionDao {
public int add(TencentPosition position) {
String sql = "INSERT INTO tencent_position(p_name, p_link, p_type, p_num, p_location, p_publish_time)"
+ " VALUES(?, ?, ?, ?, ?, ?)";
Connection conn = null;
PreparedStatement pst = null;
try {
conn = MySQLUtils.getConnection();
pst = conn.prepareStatement(sql);
pst.setString(1, position.getPositionName());
pst.setString(2, position.getPositionLink());
pst.setString(3, position.getPositionType());
pst.setString(4, position.getPositionNum());
pst.setString(5, position.getWorkLocation());
pst.setString(6, position.getPublishTime());
return pst.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} finally {
if (pst != null) {
try {
pst.close();
} catch (SQLException e) {
e.printStackTrace();
} finally {
pst = null;
}
}
if (conn != null) {
try {
conn.close();
} catch (SQLException e) {
e.printStackTrace();
} finally {
conn = null;
}
}
}
return 0;
}
}
TencentPageProcessor類代碼如下:
import java.util.List;
import java.util.concurrent.atomic.AtomicLong;import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class TencentPageProcessor implements PageProcessor {
private Site site = Site.me().setRetryTimes(5).setSleepTime(1000);
private static TencentPositionDao dao = new TencentPositionDaoImpl();
public static AtomicLong count = new AtomicLong();
public static AtomicLong total = new AtomicLong();
public Site getSite() {
return site;
}
public void process(Page page) {
List<String> urlList = page.getHtml().links().regex("https://hr.tencent.com/position.php\\?&start=\\d+").all();
System.out.println(urlList);
page.addTargetRequests(urlList);
List<String> positionNames = page.getHtml().xpath("http://tr[@class='odd']/td[1]/a/text()").all();
List<String> positionLinks = page.getHtml().xpath("http://tr[@class='odd']/td[1]/a/@href").all();
List<String> positionTypes = page.getHtml().xpath("http://tr[@class='odd']/td[2]/text()").all();
List<String> positionNums = page.getHtml().xpath("http://tr[@class='odd']/td[3]/text()").all();
List<String> workLocations = page.getHtml().xpath("http://tr[@class='odd']/td[4]/text()").all();
List<String> publishTimes = page.getHtml().xpath("http://tr[@class='odd']/td[5]/text()").all();
for (int i = 0; i < positionNames.size(); i++) {
TencentPosition position = new TencentPosition();
position.setPositionName(positionNames.get(i));
position.setPositionLink(positionLinks.get(i));
position.setPositionType(positionTypes.get(i));
position.setPositionNum(positionNums.get(i));
position.setPublishTime(publishTimes.get(i));
position.setWorkLocation(workLocations.get(i));
dao.add(position);
}
//String positionName = page.getHtml().xpath("http://tr[@class='odd']/td[1]/a/text()").get();
//String positionType = page.getHtml().xpath("http://tr[@class='odd']/td[2]/text()").get();
//String positionLink = "https://hr.tencent.com/" + page.getHtml().xpath("http://tr[@class='odd']/td[1]/a/@href").get();
//String positionNum = page.getHtml().xpath("http://tr[@class='odd']/td[3]/text()").get();
//String workLocation = page.getHtml().xpath("http://tr[@class='odd']/td[4]/text()").get();
//String publishTime = page.getHtml().xpath("http://tr[@class='odd']/td[5]/text()").get();
//page.putField("positionName", positionName);
//page.putField("positionLink", positionLink);
//page.putField("positionType", positionType);
//page.putField("positionNum", positionNum);
//page.putField("workLocation", workLocation);
//page.putField("publishTime", publishTime);
//TencentPosition position = new TencentPosition();
//position.setPositionName(positionName);
//position.setPositionLink(positionLink);
//position.setPositionType(positionType);
//position.setPositionNum(positionNum);
//position.setPublishTime(publishTime);
//position.setWorkLocation(workLocation);
//dao.add(position);
}
public static void main(String[] args) {
Spider.create(new TencentPageProcessor())
.addUrl("https://hr.tencent.com/position.php?&start=0")
.addPipeline(new JsonFilePipeline("web_code"))
.thread(100)
.run();
}
}
4筐高、編碼完成,點(diǎn)擊運(yùn)行现喳、進(jìn)行測試
當(dāng)控制臺(tái)顯示如下內(nèi)容時(shí),則表示抓取成功:
此時(shí)可以查看自己的數(shù)據(jù)庫看是否有數(shù)據(jù)犬辰,如果有數(shù)據(jù)嗦篱,并且數(shù)據(jù)庫中的數(shù)據(jù)和網(wǎng)頁中需要抓取的數(shù)據(jù)一致,則表示Java爬蟲項(xiàng)目已完成:
實(shí)驗(yàn)完成幌缝。
轉(zhuǎn)載請(qǐng)保留或注明出處:http://www.reibang.com/p/afa7071a4458