一累盗、vmgirls
vmgirls是一個(gè)個(gè)人維護(hù)的圖集集合網(wǎng)站.
里面有很多來(lái)自圖蟲攝影亲族,POCO攝影国裳,網(wǎng)易攝影鬓照,網(wǎng)易樂(lè)乎等好看小姐姐的寫真.
圖片基本都是2000px左右的高清大圖,收集資源,養(yǎng)眼都非常不錯(cuò).
站點(diǎn)結(jié)構(gòu)非常簡(jiǎn)單.直接用正則匹配或者框架都行.
本文介紹的是使用cheerio框架爬取。
二、分析
首頁(yè)無(wú)論是nav欄或者目錄,分類都特別的雜且不全.
最終敲定專題推薦page
只把所有的圖集分為兩類,并且可通過(guò)下拉刷新加載
查看頁(yè)面可知,界面是分段渲染的.
即打開(kāi)頁(yè)面初始渲染八個(gè)圖集
通過(guò)點(diǎn)擊加載更多,再向服務(wù)器請(qǐng)求數(shù)據(jù),再刷新頁(yè)面
F12開(kāi)發(fā)者模式觀察請(qǐng)求
每一次點(diǎn)擊更多都發(fā)送一次POST請(qǐng)求
并且攜帶四個(gè)參數(shù)
對(duì)比兩次請(qǐng)求,可初步判斷決定數(shù)據(jù)結(jié)果的參數(shù)應(yīng)該是query和paged
使用ajax嘗試請(qǐng)求數(shù)據(jù)
Request URL: https://www.vmgirls.com/wp-admin/admin-ajax.php
async function ajax(data) {
return new Promise((resolve, reject) => {
const xhr = new XMLHttpRequest()
xhr.open("post", "https://www.vmgirls.com/wp-admin/admin-ajax.php")
xhr.setRequestHeader(
"Content-Type",
"application/x-www-form-urlencoded; charset=UTF-8"
)
xhr.send(data)
xhr.onload = function () {
if (xhr.status == 200) {
resolve(xhr.responseText)
} else {
reject("error")
}
}
})
}
返回結(jié)果與預(yù)期相同,拿到了一組8個(gè)圖集對(duì)應(yīng)data
const page = 1
// options 輕私房/小姐姐
const category = "小姐姐"
const data = `append=list-archive&paged=${i}&action=ajax_load_posts&query=${encodeURI(category)}&page=tax`
通過(guò)page決定下載寫真數(shù)
寫真數(shù) = page × 8
類別 = 小姐姐 / 輕私房
使用cheerio庫(kù),傳入選擇器,捕獲元素
const $ = cheerio.load(res)
const $arr = $(".col-6 .list-item .media a")
這里不選擇img元素,測(cè)試發(fā)現(xiàn)imgsrc屬性渲染結(jié)果與源代碼不同.
應(yīng)該是執(zhí)行了某種反爬蟲js腳本.
而data-src雖然是相同的,但webp圖片格式無(wú)法通過(guò)axios arraybuffer寫入
最終敲定捕獲a元素
只需發(fā)送需要的ajax請(qǐng)求獲取圖集
再通過(guò)圖集下載對(duì)應(yīng)圖片即可
三当娱、成品展示
四吃既、代碼實(shí)現(xiàn)
函數(shù)已封裝完畢.
改變?nèi)肿兞縫age、category.
執(zhí)行 tsc ,node 即可
注意事項(xiàng)
- axios POST請(qǐng)求得不到想要的結(jié)果
- 極少部分詳情頁(yè)布局采用的是ul li布局,此代碼并不囊括
- vgirls服務(wù)器些許卡頓,圖片都是大尺寸,可適當(dāng)延長(zhǎng)axios timeout屬性
- webp圖像格式無(wú)法采用axios arraybuffer 下載
const XMLHttpRequest = require("xmlhttprequest").XMLHttpRequest
const cheerio = require("cheerio")
const axios = require("axios")
const fs = require("fs")
const path = require("path")
const chalk = require('chalk')
type IvmGirls = () => Promise<void>
type IgetDataArr = () => Array<string>
type IgetDetailArr = (dataArr: string[]) => Promise<Array<object>>
type IgetImgArr = (detailArr: any[]) => Promise<Array<object>>
type IdownloadPic = (imgArr: any[]) => void
type Iajax = (data: string) => Promise<string>
type IwriteFile = (path: string, content: any) => void
const page = 1
// options 輕私房/小姐姐
const category = "小姐姐"
const vmGirls: IvmGirls = async () => {
// 獲取ajax請(qǐng)求data arr
const dataArr = getDataArr()
// 獲取詳情頁(yè)url arr
const detailArr = await getDetailArr(dataArr)
// 獲取imgurl arr
const imgArr = await getImgArr(detailArr)
// 下載圖片
downloadPic(imgArr)
}
const getDataArr: IgetDataArr = () => {
const dataArr = []
for (let i = 1; i <= page; i++) {
const data = `append=list-archive&paged=${i}&action=ajax_load_posts&query=${encodeURI(
category
)}&page=tax`
dataArr.push(data)
}
return dataArr
}
const getDetailArr: IgetDetailArr = async (dataArr) => {
const detailArr = []
for (let i = 0; i < dataArr.length; i++) {
try {
const res = await ajax(dataArr[i])
const $ = cheerio.load(res)
const $arr = $(".col-6 .list-item .media a")
for (let j = 0; j < $arr.length; j++) {
const detailHref = $arr[j].attribs["href"]
const title = $arr[j].attribs["title"]
const obj = {
href: detailHref,
title
}
detailArr.push(obj)
}
console.log(`ajax paged ${i + 1} Successful\n`)
} catch (error) {
console.log(`ajax paged ${i + 1} Failure\n`)
continue
}
}
return detailArr
}
const getImgArr: IgetImgArr = async (detailArr) => {
const arr = []
for (let i = 0; i < detailArr.length; i++) {
try {
const res = await axios({
url: detailArr[i].href,
timeout: 3000,
})
const $ = cheerio.load(res.data)
const $arr = $("div.nc-light-gallery p a")
const imgArr = []
const title = $arr[0].attribs["alt"]
for (let j = 0; j < $arr.length; j++) {
const imgUrl = 'https://www.vmgirls.com/' + $arr[j].attribs["href"]
imgArr.push(imgUrl)
}
const obj = {
title,
url: imgArr,
}
arr.push(obj)
console.log(chalk.green(`${detailArr[i].title} atlas get success\n`))
} catch (error) {
console.log(chalk.red(`${detailArr[i].title} atlas get failure\n`))
}
}
return arr
}
const downloadPic: IdownloadPic = async (imgArr) => {
fs.mkdir("./images", () => { })
for (let atlas of imgArr) {
const title = atlas.title
fs.mkdir(`./images/${title}`, () => { })
for (let i in atlas.url) {
const extname = path.extname(atlas.url[i])
try {
const res = await axios({
url: atlas.url[i],
timeout: 2000,
responseType: "arraybuffer",
})
await writeFile(`./images/${title}/${title}_${+i + 1}${extname}`, res.data)
console.log(chalk.green(`${title} (${+i + 1}/${atlas.url.length}) Download Successful\n`))
} catch (error) {
console.log(chalk.red(`${title} (${+i + 1}/${atlas.url.length}) Download Failure\n`))
}
}
}
}
const ajax: Iajax = async (data) => {
return new Promise((resolve, reject) => {
const xhr = new XMLHttpRequest()
xhr.open("post", "https://www.vmgirls.com/wp-admin/admin-ajax.php")
xhr.setRequestHeader(
"Content-Type",
"application/x-www-form-urlencoded; charset=UTF-8"
)
xhr.send(data)
xhr.onload = function () {
if (xhr.status == 200) {
resolve(xhr.responseText)
} else {
reject("error")
}
}
})
}
const writeFile: IwriteFile = (path, content) => {
return new Promise((resolve) => {
fs.writeFile(path, content, (err) => {
if (!err) resolve()
})
})
}
vmGirls()
export { }