PHP爬蟲抓取segmentfault問答
一 需求概述
抓取中國領(lǐng)先的開發(fā)者社區(qū)segment.com網(wǎng)站上問答及標(biāo)簽數(shù)據(jù),側(cè)面反映最新的技術(shù)潮流以及國內(nèi)程序猿的關(guān)注焦點(diǎn).
注:抓取腳本純屬個人技術(shù)鍛煉,非做任何商業(yè)用途.
二 開發(fā)環(huán)境及包依賴
運(yùn)行環(huán)境
- CentOS Linux release 7.0.1406 (Core)
- PHP7.0.2
- Redis3.0.5
- Mysql5.5.46
- Composer1.0-dev
composer依賴
三 流程與實(shí)踐
首先,先設(shè)計(jì)兩張表:post
,post_tag
CREATE TABLE `post` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'pk',
`post_id` varchar(32) NOT NULL COMMENT '文章id',
`author` varchar(64) NOT NULL COMMENT '發(fā)布用戶',
`title` varchar(512) NOT NULL COMMENT '文章標(biāo)題',
`view_num` int(11) NOT NULL COMMENT '瀏覽次數(shù)',
`reply_num` int(11) NOT NULL COMMENT '回復(fù)次數(shù)',
`collect_num` int(11) NOT NULL COMMENT '收藏次數(shù)',
`tag_num` int(11) NOT NULL COMMENT '標(biāo)簽個數(shù)',
`vote_num` int(11) NOT NULL COMMENT '投票次數(shù)',
`post_time` date NOT NULL COMMENT '發(fā)布日期',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '抓 取時(shí)間',
PRIMARY KEY (`id`),
KEY `idx_post_id` (`post_id`)
) ENGINE=MyISAM AUTO_INCREMENT=7108 DEFAULT CHARSET=utf8 COMMENT='帖子';
CREATE TABLE `post_tag` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'PK',
`post_id` varchar(32) NOT NULL COMMENT '帖子ID',
`tag_name` varchar(128) NOT NULL COMMENT '標(biāo)簽名稱',
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=15349 DEFAULT CHARSET=utf8 COMMENT='帖子-標(biāo)簽關(guān)聯(lián)表';
當(dāng)然有同學(xué)說,這么設(shè)計(jì)不對,標(biāo)簽是個獨(dú)立的主體,應(yīng)該設(shè)計(jì)post
,tag
,post_tag
三張表,文檔和標(biāo)簽之間再建立聯(lián)系,這樣不僅清晰明了,而且查詢也很方便.
這里簡單處理是因?yàn)槭紫炔皇呛苷降拈_發(fā)需求,自娛自樂,越簡單搞起來越快,另外三張表抓取入庫時(shí)就要多一張表,更重要的判斷標(biāo)簽重復(fù)性,導(dǎo)致抓取速度減慢.
整個項(xiàng)目工程文件如下:
app/config/config.php /*配置文件*/
app/helper/Db.php /*入庫腳本*/
app/helper/Redis.php /*緩存服務(wù)*/
app/helper/Spider.php /*抓取解析服務(wù)*/
app/helper/Util.php /*工具*/
app/vendor/composer/ /*composer自動加*/
app/vendor/symfony/ /*第三方抓取服務(wù)*/
app/vendor/autoload.php /*自動加載*/
app/composer.json /*項(xiàng)目配置*/
app/composer.lock /*項(xiàng)目配置*/
app/run.php /*入口腳本*/
因?yàn)楣δ芎芎唵?所以沒有必要引用第三方開源的PHP框架
基本配置
class Config
{
public static $spider = [
'base_url' => 'http://segmentfault.com/questions?',
'from_page' => 1,
'timeout' => 5,
];
public static $redis = [
'host' => '127.0.0.1',
'port' => 10000,
'timeout' => 5,
];
public static $mysql = [
'host' => '127.0.0.1',
'port' => '3306',
'dbname' => 'segmentfault',
'dbuser' => 'user',
'dbpwd' => 'user',
'charset' => 'utf8',
];
}```
這里要有兩點(diǎn)要注意:
第一,要開啟`CURLOPT_FOLLOWLOCATION`301跟蹤抓取,因?yàn)閟egmentfautl官方會做域名跳轉(zhuǎn),比如`http://www.segmentfault.com/`會跳轉(zhuǎn)到到"http://segmentfault.com"等等.
第二,指定UserAgent,否則會出現(xiàn)301重定向到瀏覽器升級頁面.
**crawler解析處理**
public function craw()
{
$content = $this->getUrlContent($this->getUrl());
$crawler = new Crawler();
$crawler->addHtmlContent($content);
$found = $crawler->filter(".stream-list__item"); //判斷是否頁面已經(jīng)結(jié)束
if ($found->count()) {
$data = $found->each(function (Crawler $node, $i) {
//問答ID
$href = trim($node->filter(".author li a")->eq(1)->attr('href'));
$a = explode("/", $href);
$post_id = isset($a[2]) ? $a[2] : 0;
//檢查該問答是否已經(jīng)抓取過
if ($post_id == 0 || !(new Redis())->checkPostExists($post_id)) {
return $this->getPostData($node, $post_id, $href);
}
return false;
});
//去除空的數(shù)據(jù)
foreach ($data as $i => $v) {
if (!$v) {
unset($data[$i]);
}
}
$data = array_values($data);
$this->incrementPage();
$continue = true;
} else {
$data = [];
$continue = false;
}
return [$data, $continue];
}
private function getPostData(Crawler $node, $post_id, $href){
$tmp = [];
$tmp['post_id'] = $post_id;
//標(biāo)題
$tmp['title'] = trim($node->filter(".summary h2.title a")->text());
//回答數(shù)
$tmp['reply_num'] = intval(trim($node->filter(".qa-rank .answers")->text()));
//瀏覽數(shù)
$tmp['view_num'] = intval(trim($node->filter(".qa-rank .views")->text()));
//投票數(shù)
$tmp['vote_num'] = intval(trim($node->filter(".qa-rank .votes")->text()));
//發(fā)布者
$tmp['author'] = trim($node->filter(".author li a")->eq(0)->text());
//發(fā)布時(shí)間
$origin_time = trim($node->filter(".author li a")->eq(1)->text());
if (mb_substr($origin_time, -2, 2, 'utf-8') == '提問') {
$tmp['post_time'] = Util::parseDate($origin_time);
} else {
$tmp['post_time'] = Util::parseDate($this->getPostDateByDetail($href));
}
//收藏?cái)?shù)
$collect = $node->filter(".author .pull-right");
if ($collect->count()) {
$tmp['collect_num'] = intval(trim($collect->text()));
} else {
$tmp['collect_num'] = 0;
}
$tmp['tags'] = [];
//標(biāo)簽列表
$tags = $node->filter(".taglist--inline");
if ($tags->count()) {
$tmp['tags'] = $tags->filter(".tagPopup")->each(function (Crawler $node, $i) {
return $node->filter('.tag')->text();
});
}
$tmp['tag_num'] = count($tmp['tags']);
return $tmp;
}
通過crawler將抓取的列表解析成待入庫的二維數(shù)據(jù),每次抓完,分頁參數(shù)遞增. 這里要注意幾點(diǎn):
1.有些問答已經(jīng)抓取過了,入庫時(shí)需要排除,因此此處加入了redis緩存判斷.
2.問答的創(chuàng)建時(shí)間需要根據(jù)"提問","解答","更新"狀態(tài)來動態(tài)解析.
3.需要把類似"5分鐘前","12小時(shí)前","3天前"解析成標(biāo)準(zhǔn)的`Y-m-d`格式
**入庫操作**
public function multiInsert($post)
{
if (!$post || !is_array($post)) {
return false;
}
$this->beginTransaction();
try {
//問答入庫
if (!$this->multiInsertPost($post)) {
throw new Exception("failed(insert post)");
}
//標(biāo)簽入庫
if (!$this->multiInsertTag($post)) {
throw new Exception("failed(insert tag)");
}
$this->commit();
$this->pushPostIdToCache($post);
$ret = true;
} catch (Exception $e) {
$this->rollBack();
$ret = false;
}
return $ret;
}
采用事務(wù)+批量方式的一次提交入庫,入庫完成后將`post_id`加入redis緩存
**啟動作業(yè)**
require './vendor/autoload.php';
use helper\Spider;use helper\Db;
$spider = new Spider();
while (true) {
echo 'crawling from page:' . $spider->getUrl() . PHP_EOL;
list($data, $ret) = $data = $spider->craw();
if ($data) {
$ret = (new Db)->multiInsert($data);
echo count($data) . " new post crawled " . ($ret ? 'success' : 'failed') . PHP_EOL;
} else {
echo 'no new post crawled'.PHP_EOL;
}
echo PHP_EOL;
if (!$ret) {
exit("work done");
}
};
運(yùn)用while無限循環(huán)的方式執(zhí)行抓取,遇到抓取失敗時(shí),自動退出,中途可以按`Ctrl + C`中斷執(zhí)行.
### 四 效果展示
____
**抓取執(zhí)行中**
![start](http://upload-images.jianshu.io/upload_images/67516-fb5b96370a5c728e.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
**問答截圖**
![post](http://upload-images.jianshu.io/upload_images/67516-03d4743d1de325c3.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
**標(biāo)簽截圖**
![tag](http://upload-images.jianshu.io/upload_images/67516-aec36581e5ff95aa.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
### 五 總結(jié)
____
以上的設(shè)計(jì)思路和腳本基本上可以完成簡單的抓取和統(tǒng)計(jì)分析任務(wù)了.
我們先看下TOP25標(biāo)簽統(tǒng)計(jì)結(jié)果:
![tag_stat.jpg](http://upload-images.jianshu.io/upload_images/67516-3d9233f8795bd0e4.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
可以看出segmentfault站點(diǎn)里,討論最熱的前三名是`javascript`,`php`,`java`,而且前25個標(biāo)簽里跟前端相關(guān)的(這里不包含移動APP端)居然有13個,占比50%以上了.
每月標(biāo)簽統(tǒng)計(jì)一次標(biāo)簽,就可以很方便的掌握最新的技術(shù)潮流,哪些技術(shù)的關(guān)注度有所下降,又有哪些在上升.
**有待完善或不足之處**
1.單進(jìn)程抓取,速度有些慢,如果開啟多進(jìn)程的,則需要考慮進(jìn)程間避免重復(fù)抓取的問題
2.暫不支持增量更新,每次抓取到從配置項(xiàng)的指定頁碼開始一直到結(jié)束,可以根據(jù)已抓取的`post_id`做終止判斷(`post_id`雖不是連續(xù)自增,但是一直遞增的)
[1]:http://segmentfault.com
[2]:http://symfony.com/doc/current/components/dom_crawler.html
[3]:https://github.com/sinopex/self-learning-project/tree/master/segmentfault