引言
團(tuán)隊(duì)最近經(jīng)常需要分析一些網(wǎng)站數(shù)據(jù)钮惠,需要從多個(gè)數(shù)據(jù)網(wǎng)站去手動(dòng)復(fù)制數(shù)據(jù)到 Excel 里面,這種重復(fù)勞動(dòng)且沒(méi)有意義的體力活應(yīng)該交給機(jī)器去干鸟赫,釋放出人的勞動(dòng)力去干更有意思的事强缘,所以有了學(xué)習(xí)采集方法的這篇文章添忘。開(kāi)源的采集庫(kù)有 python 的 scraper买决,java 的 selenium沛婴,ruby 的 watir,nodejs 的 puppeteer督赤,golang 的 chromedp⌒何茫基于快速上手入門(mén)就選擇了 puppeteer躲舌,備選是 chromedp,因?yàn)槿粘J鞘褂?golang 開(kāi)發(fā)項(xiàng)目性雄。
目錄
- 環(huán)境搭建
- 網(wǎng)頁(yè)截屏 demo
- terminal 運(yùn)行 script 采集目標(biāo)數(shù)據(jù)
- web 服務(wù)化運(yùn)行 script 采集目標(biāo)數(shù)據(jù)
- 總結(jié)
- 了解更多
1没卸、環(huán)境搭建
# mac terminal 運(yùn)行
# 安裝homebrew,配置國(guó)內(nèi)鏡像的參考 https://mirrors.ustc.edu.cn/help/brew.git.html
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# 等待 node 安裝完成
$ brew install node
# 查看版本
$ node -v
$ npm -v
# 安裝puppeteer 環(huán)境
$ npm i puppeteer
2秒旋、網(wǎng)頁(yè)截屏 demo
// https://github.com/puppeteer/puppeteer/blob/main/examples/screenshot.js
"use strict";
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("http://example.com");
await page.screenshot({ path: "example.png" });
await browser.close();
})();
mac 下面直接運(yùn)行可能會(huì)提示瀏覽器版本問(wèn)題约计,需要指定下載對(duì)應(yīng)的版本才能運(yùn)行起來(lái)。所以修改之后的代碼
// 版本號(hào)查找鏈接 http://omahaproxy.appspot.com/
const puppeteer = require("puppeteer");
const browserFetcher = puppeteer.createBrowserFetcher();
// 下載指定版本的chrome瀏覽器迁筛,下載完成之后返回chrome瀏覽器對(duì)象
browserFetcher.download("809590").then(async (res) => {
// options 參數(shù)見(jiàn)
// https://zhaoqize.github.io/puppeteer-api-zh_CN/#?product=Puppeteer&version=v8.0.0&show=api-class-puppeteer
const options = {
executablePath: res.executablePath, // chrome執(zhí)行路徑
headless: true, // 瀏覽器無(wú)頭模式煤蚌,后臺(tái)運(yùn)行,false 會(huì)打卡瀏覽器
defaultViewport: {
width: 1800,
height: 768,
},
args: ["--start-maximized"],
};
puppeteer.launch(options).then(async (browser) => {
const page = await browser.newPage();
await page.goto("http://www.baidu.com");
await page.screenshot({ path: "baidu.png" });
await browser.close();
});
});
# 運(yùn)行上面的腳本文件
$ node test.js
$ ls baidu.png
至此已經(jīng) puppeteer 入門(mén)了。
3尉桩、terminal 運(yùn)行 script 采集目標(biāo)數(shù)據(jù)
接下來(lái)就是開(kāi)始針對(duì)團(tuán)隊(duì)需要分析的數(shù)據(jù)采集了筒占,先熟悉 puppeteer api 文檔,主要熟悉 Page蜘犁、JSHandle翰苫、以及 ElementHandle 對(duì)象,下面的代碼會(huì)經(jīng)常用到這 3 個(gè) api这橙。
除了上面的 3 個(gè)常用對(duì)象之外奏窑,還要熟悉 chrome 開(kāi)發(fā)者工具的 api,知道怎么去查找 dom 節(jié)點(diǎn)的路徑屈扎。下圖是直接打開(kāi) chrome 開(kāi)發(fā)者工具埃唯,在 Elements 面板里面,選擇要查找的 dom 節(jié)點(diǎn)上右擊彈出菜單助隧。
<img src="https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/7960a891dae9454f9eaccd3ab36b70c7~tplv-k3u1fbpfcp-zoom-1.image" width="30%">
具體幾個(gè)功能見(jiàn) Console Utilities API reference筑凫。
下面就開(kāi)始實(shí)戰(zhàn),要想拿到目標(biāo)網(wǎng)站的數(shù)據(jù)并村,需要 2 個(gè)步驟巍实,賬號(hào)登陸和打開(kāi)指定網(wǎng)頁(yè)。登陸的代碼是通過(guò)腳本登陸之后把 cookie 保存到本地文件里面哩牍,然后采集的腳本就可以直接載入 cookie 文件直接使用棚潦,這樣避免了沒(méi)有身份的問(wèn)題。這個(gè)登陸腳本的弊端是無(wú)非解決無(wú)頭模式下面的掃碼登陸膝昆。
直接上代碼丸边,登陸代碼腳本如下,
// login.js
const puppeteer = require("puppeteer");
const fs = require("fs").promises;
const fs2 = require("fs");
const puppeteerNode = puppeteer;
const browserFetcher = puppeteerNode.createBrowserFetcher();
browserFetcher.download("809590").then((res) => {
let options = {
executablePath: res.executablePath, //chrome執(zhí)行路徑
headless: false, //瀏覽器無(wú)頭模式
defaultViewport: {
width: 1800,
height: 768,
},
args: ["--start-maximized"],
};
puppeteer.launch(options).then(async (browser) => {
let cookies = {};
const page = await browser.newPage();
if (fs2.existsSync("./cookies.json")) {
const cookiesString = await fs.readFile("./cookies.json");
let cookies = JSON.parse(cookiesString);
await page.setCookie(...cookies);
await page.goto("https://dy.mock.com/login?routerstr=workbench");
} else {
await page.goto("https://dy.mock.com/login?routerstr=workbench");
await page.waitForSelector("#app > div > div.bg_login > div > div > div.of_hidden > div");
const tabs = await page.$$("#app > div > div.bg_login > div > div > div.of_hidden > div");
await tabs[1].click().then(async (res) => {
await page.waitForSelector("#input-msg > div > input");
await page.waitForSelector("#app > div > div > div > div > div > div > form > div.form_item.pointer");
const inputs = await page.$$("#input-msg > div > input");
await inputs[0].type("賬號(hào)");
await inputs[1].type("密碼");
await page.click(
"#app > div > div > div > div > div > div > form > div.form_item.pointer"
);
////有頭模式下面荚孵,需要等待時(shí)間妹窖,以便掃碼登陸操作完成
// await page.waitForTimeout(5000);
cookies = await page.cookies();
await fs.writeFile("./cookies.json", JSON.stringify(cookies, null, 2));
});
}
browser.close();
});
});
$ node login.js
$ ls cookies.json
下面是抓取頁(yè)面內(nèi)容的腳本代碼。
// grab.js
const puppeteer = require("puppeteer");
const fs = require("fs").promises;
const browserFetcher = puppeteer.createBrowserFetcher();
browserFetcher.download("809590").then((res) => {
puppeteer
.launch({
executablePath: res.executablePath, //chrome執(zhí)行路徑
headless: false, //瀏覽器無(wú)頭模式
})
.then(async (browser) => {
const page = await browser.newPage();
await page.setViewport({ width: 1800, height: 768 });
// 缺少登陸驗(yàn)證收叶,默認(rèn)已經(jīng)執(zhí)行過(guò)上面的登陸腳本
const cookiesString = await fs.readFile("./cookies.json");
let cookies = JSON.parse(cookiesString);
await page.setCookie(...cookies);
await page.goto("https://dy.fake.com/kol_list/kol_list");
await page.waitForSelector("table");
await page.waitForSelector("#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div");
const tabs = await page.$$("#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div");
for (let [i, tab] of tabs.entries()) {
const tabName = await (await tab.getProperty("innerText")).jsonValue();
tab.click().then(async (res) => {
await page.waitForSelector("table");
const result = await page.$$eval("table", (tables) => {
let trs = tables[2].children[1].children;
let t = [];
let csv ="名稱,粉絲總數(shù),粉絲質(zhì)量,中位點(diǎn)贊數(shù),中位評(píng)論數(shù),中位分享數(shù),指數(shù)\n";
for (const tr of trs) {
let name = tr.children[2].innerText;
let fans = tr.children[3].innerText.replace(",", "");
let fansQ = tr.children[4].innerText.replace(",", "");
let likeAvg = tr.children[5].innerText.replace(",", "");
let commentAvg = tr.children[6].innerText.replace(",", "");
let shareAvg = tr.children[7].innerText.replace(",", "");
let index = tr.children[8].innerText.replace(",", "");
let tmp = [name,fans,fansQ,likeAvg,commentAvg,shareAvg,index];
t.push({name: name,fans: fans,q: fansQ,likeAvg: likeAvg,commentAvg: commentAvg,shareAvg: shareAvg,index: index});
csv += tmp.join(",") + "\n";
}
return [t, csv];
});
await fs.writeFile("./" + i + "-" + tabName + "-index.csv",result[1]);
});
await page.waitForTimeout(3000);
}
});
});
# 運(yùn)行腳本
$ node grab.js
$ ls -la *.csv
可以看一下抓取到的csv文件內(nèi)容
<image src="https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/6fba17f2e7204db990544c053683d9f4~tplv-k3u1fbpfcp-zoom-1.image" width="30%">
至此 terminal 爬取數(shù)據(jù)的部分就結(jié)束了骄呼。
4、web 服務(wù)運(yùn)行腳本采集數(shù)據(jù)
web 服務(wù)運(yùn)行腳本這部分代碼很簡(jiǎn)單判没,就是把上面的 terminal 的代碼包裝一下蜓萄,通過(guò) web 服務(wù) 對(duì)外訪問(wèn)提供服務(wù),這樣可以通過(guò)瀏覽器直接打開(kāi)網(wǎng)頁(yè)進(jìn)行操作澄峰,無(wú)需開(kāi) terminal 去運(yùn)行一些命令嫉沽。
web 服務(wù)這塊,我直接選擇了 eggjs 框架搭建業(yè)務(wù)邏輯代碼俏竞,不用再去寫(xiě)一些 http 相關(guān)的代碼绸硕。
代碼如下
'use strict';
const Controller = require('egg').Controller;
const path = require('path');
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
const fs2 = require('fs');
const archiver = require('archiver');
class HomeController extends Controller {
async index() {
const { ctx } = this;
await ctx.render('home/index.tpl');
}
async login() {
const { ctx } = this;
const v = await this.puppeteerLogin(ctx);
await ctx.render('home/login.tpl', { v });
}
async grab() {
const { ctx } = this;
const account = ctx.cookies.get('account');
if (account === '' || account === undefined) {
return ctx.redirect('/');
}
const puppeteerNode = puppeteer;
const browserFetcher = puppeteerNode.createBrowserFetcher();
let v = await browserFetcher.download('809590').then(res => {
const options = {
executablePath: res.executablePath, // chrome執(zhí)行路徑
headless: true, // 瀏覽器無(wú)頭模式
defaultViewport: {
width: 1800,
height: 768,
},
args: [ '--start-maximized' ],
};
const v = puppeteer.launch(options)
.then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1800, height: 768 });
const sessionCookieDir = path.join(ctx.app.config.sessionDir, account);
const fileName = account + '-target.zip';
const publicZipFile = path.join(ctx.app.config.publicDir, fileName);
if (!fs2.existsSync(sessionCookieDir)) {
return 'pls wait seconds for login ';
}
const lockFile = sessionCookieDir + '/start.lock';
if (fs2.existsSync(lockFile)) {
console.log('pls wait seconds for done');
return 'pls wait seconds for done';
}
const zipFilePath = path.join(sessionCookieDir, fileName);
if (fs2.existsSync(publicZipFile)) {
console.log(publicZipFile);
return publicZipFile;
}
await fs.writeFile(sessionCookieDir + '/start.lock', 'lock');
const sessionDataDir = path.join(sessionCookieDir, 'data');
if (!fs2.existsSync(sessionDataDir)) {
fs2.mkdirSync(sessionDataDir, '0777', true);
}
const sessionCookiePath = path.join(sessionCookieDir, 'cookies.json');
const cookiesString = await fs.readFile(sessionCookiePath);
const cookies = JSON.parse(cookiesString);
await page.setCookie(...cookies);
await page.goto('https://dy.fake.com/kol_list/kol_list');
await page.waitForSelector('table');
await page.waitForSelector('#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div');
const tabs = await page.$$('#app > div > div.main_view > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div');
for (const [ i, tab ] of tabs.entries()) {
const tabName = await (await tab.getProperty('innerText')).jsonValue();
tab.click().then(async res => {
await page.waitForSelector('table');
const result = await page.$$eval('table', tables => {
const trs = tables[2].children[1].children;
const t = [];
let csv = '名稱,粉絲總數(shù),粉絲質(zhì)量,中位點(diǎn)贊數(shù),中位評(píng)論數(shù),中位分享數(shù),指數(shù)' + "\n";
for (const tr of trs) {
const name = tr.children[2].innerText;
const fans = tr.children[3].innerText.replace(',', '');
const fansQ = tr.children[4].innerText.replace(',', '');
const likeAvg = tr.children[5].innerText.replace(',', '');
const commentAvg = tr.children[6].innerText.replace(',', '');
const shareAvg = tr.children[7].innerText.replace(',', '');
const index = tr.children[8].innerText.replace(',', '');
const tmp = [ name, fans, fansQ, likeAvg, commentAvg, shareAvg, index ];
t.push({name,fans,fansQ,likeAvg,commentAvg,shareAvg,index});
csv += tmp.join(',') + "\n";
}
return [ t, csv ];
});
await fs.writeFile(sessionDataDir + '/' + i + '-' + tabName + '-cassindex.csv', result[1]);
});
await page.waitForTimeout(1000);
}
const output = fs2.createWriteStream(zipFilePath);
const archive = archiver('zip', {zlib: {level: 9}});
await output.on('close', function() {
console.log(archive.pointer() + ' total bytes');
console.log('archiver has been finalized and the output file descriptor has closed.');
fs.copyFile(zipFilePath, publicZipFile);
fs.rm(zipFilePath);
});
await output.on('end', function() {
console.log('Data has been drained');
});
await archive.on('error', function(err) {
throw err;
});
await archive.on('warning', function(err) {
if (err.code === 'ENOENT') {
console.log('warning---->' + err);
} else {
throw err;
}
});
await archive.pipe(output);
await archive.directory(sessionDataDir, false);
await archive.finalize();
await fs.rm(lockFile);
return publicZipFile;
});
return v;
});
if (v.indexOf('/public/') >= 0) {
v = v.replaceAll(ctx.app.config.webRootDir, '');
await ctx.render('home/download.tpl', { v });
} else {
ctx.body = v;
}
}
puppeteerLogin(ctx) {
const username = ctx.request.body.username;
const password = ctx.request.body.password;
const puppeteerNode = puppeteer;
const browserFetcher = puppeteerNode.createBrowserFetcher();
const v = browserFetcher.download('809590')
.then(res => {
const options = {
executablePath: res.executablePath, // chrome執(zhí)行路徑
headless: true, // 瀏覽器無(wú)頭模式
defaultViewport: {
width: 1800,
height: 768,
},
args: [ '--start-maximized' ],
};
const v = puppeteer.launch(options)
.then(
async browser => {
let cookies = {};
const page = await browser.newPage();
const sessionCookieDir = path.join(ctx.app.config.sessionDir, username);
if (!fs2.existsSync(sessionCookieDir)) {
fs2.mkdirSync(sessionCookieDir, '0777', true);
}
const sessionCookiePath = path.join(sessionCookieDir, 'cookies.json');
if (fs2.existsSync(sessionCookiePath)) {
const cookiesString = await fs.readFile(sessionCookiePath);
const cookies = JSON.parse(cookiesString);
await page.setCookie(...cookies);
await page.goto('https://dy.fake.com/login?routerstr=workbench');
} else {
await page.goto('https://dy.fake.com/login?routerstr=workbench');
await page.waitForSelector('#app > div > div.bg_login > div > div > div.of_hidden > div');
const tabs = await page.$$('#app > div > div.bg_login > div > div > div.of_hidden > div');
await tabs[1].click()
.then(async res => {
await page.waitForSelector('#input-msg > div > input');
await page.waitForSelector('#app > div > div > div > div > div > div > form > div.form_item.pointer');
const inputs = await page.$$('#input-msg > div > input');
await inputs[0].type(username);
await inputs[1].type(password);
await page.click('#app > div > div > div > div > div > div > form > div.form_item.pointer');
await page.waitForTimeout(1500);
cookies = await page.cookies();
await fs.writeFile(sessionCookiePath, JSON.stringify(cookies, null, 2));
});
}
await browser.close();
ctx.cookies.set('account', username);
return 'success';
}
);
return v;
});
return v;
}
}
module.exports = HomeController;
web 服務(wù)部署參見(jiàn) eggjs 官網(wǎng)文檔堂竟,根據(jù)步驟部署完之后,啟動(dòng)服務(wù)可能會(huì)遇到一些問(wèn)題臣咖,在 linux 服務(wù)器下面安裝 puppeteer 涉及一些庫(kù)依賴跃捣,這些錯(cuò)誤根據(jù)具體提示,直接 Google 一下應(yīng)該能解決夺蛇。環(huán)境安裝完之后運(yùn)行代碼還是依然會(huì)遇到代碼問(wèn)題疚漆,這是因?yàn)樵?linux 下面上面的代碼需要做出調(diào)整,把瀏覽器啟動(dòng)的參數(shù)里面改成
args: [ '--start-maximized', '--no-sandbox', '--disable-setuid-sandbox'],
關(guān)閉沙箱模式之后刁赦,再啟動(dòng)服務(wù)就可以正常運(yùn)行了娶聘。
5、總結(jié)
在 puppeteer 入門(mén)過(guò)程中甚脉,遇到各種問(wèn)題丸升,節(jié)點(diǎn)查找,節(jié)點(diǎn)文本提取牺氨,循環(huán)遍歷節(jié)點(diǎn)等等狡耻,這些通過(guò)不斷的輸出調(diào)試以及查找 api 和 google 搜索,至此把遇到的問(wèn)題都給解決了猴凹,雖然都解決了問(wèn)題夷狰,但是代碼還是不夠完善的,只能是跑起來(lái)的一個(gè) demo郊霎,還需要持續(xù)優(yōu)化代碼的沼头。
對(duì)于 nodejs 使用的不多,在遇到異步回調(diào)的會(huì)忘記等待返回或者沒(méi)有執(zhí)行 promoise 的 callback书劝,導(dǎo)致寫(xiě)這塊的代碼比較慢进倍,要查找 api 再來(lái)寫(xiě)代碼。接下來(lái)要再系統(tǒng)的學(xué)一下nodejs的知識(shí)购对。
代碼里涉及到的抓取網(wǎng)頁(yè)鏈接地址被打碼了猾昆,無(wú)法正常訪問(wèn)的。需要的同學(xué)可以通過(guò)了解更多聯(lián)系獲取骡苞。
項(xiàng)目里相關(guān)鏈接
- puppeteer
- puppeteer api: 中文API
- chrome devtools: Console Utilities API reference
- eggjs
- chromedp
-
risk of spider
6毡庆、了解更多
原文鏈接:Puppeteer 入門(mén)