平時(shí)喜歡看一些書(shū)谍椅,加上朋友有時(shí)候也喜歡讓我給他爬取一些小說(shuō),趁最近空閑下來(lái)就簡(jiǎn)單的記錄一下自己寫(xiě)爬蟲(chóng)的過(guò)程吧
-
首先需要導(dǎo)入相關(guān)的模塊
import requests
from lxml import etree
-
安裝對(duì)應(yīng)模塊的方式
# pip快速安裝
pip install requests
pip install lxml
-
向網(wǎng)站發(fā)送請(qǐng)求并獲取網(wǎng)站數(shù)據(jù)
如圖所示紅框的地方為這本小說(shuō)的網(wǎng)址:https://www.xbiquge.la/7/7194/
于是就可以獲取到網(wǎng)頁(yè)數(shù)據(jù):
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8' # 此處為網(wǎng)頁(yè)編碼格式,如果為gbk的編碼格式闪檬,可以改成gbk
html = etree.HTML(response.text)
-
注意
網(wǎng)頁(yè)編碼格式需要打開(kāi)開(kāi)發(fā)者工具查看,如下圖:
-
如何獲取正文地址和章節(jié)名稱
圖中紅框的表示小說(shuō)的正文章節(jié)內(nèi)容和章節(jié)名稱于是可以獲取到對(duì)應(yīng)的信息
XPath 語(yǔ)法
對(duì)于要想快速定位到對(duì)應(yīng)內(nèi)容的位置派昧,可以用chrome瀏覽器的插件
XPath Helper
-
如圖XPath Helper使用效果
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
-
獲取正文內(nèi)容
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
for ur, na in zip(url_list, name_list):
res = requests.get(f'https://www.xbiquge.la{ur}') # 向網(wǎng)站發(fā)送請(qǐng)求并獲取網(wǎng)站數(shù)據(jù)
res.encoding = 'utf-8'
res_html = etree.HTML(res.text)
info = res_html.xpath('//div[@id="content"]/text()')
-
然后把正文內(nèi)容寫(xiě)入到文件中就完成了
完整代碼如下:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from lxml import etree
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
fp = open("修真聊天群.txt", 'w')
for ur, na in zip(url_list, name_list):
res = requests.get(f'https://www.xbiquge.la{ur}') # 向網(wǎng)站發(fā)送請(qǐng)求并獲取網(wǎng)站數(shù)據(jù)
res.encoding = 'utf-8'
res_html = etree.HTML(res.text)
info = res_html.xpath('//div[@id="content"]/text()')
fp.write(f'{na}\n\n')
print(f'{na}__{ur}') # 查看當(dāng)前章節(jié)名稱和鏈接地址
for i in info:
i = i.replace(r'\xa0', '').replace('\n\n', '\n') # 去除垃圾信息并調(diào)整排版
if i == '\r':
continue
fp.write(i) # 寫(xiě)入正文到文本中
fp.write('\n\n')
fp.close()
if __name__ == '__main__':
book()