很喜歡《莊子》一書嚷往,尤其是里面的呆若木雞和庖丁解牛兩個故事僻族,揭示了為人和處事的三種境界粘驰。準(zhǔn)備下載一個文本文檔近期重溫一下屡谐,搜索發(fā)現(xiàn)華語網(wǎng)上的質(zhì)量很高,可惜是分篇的蝌数,懶得一點一點的copy愕掏,所以考慮用python爬取下來。
1. 爬取下圖頁面(https://www.thn21.com/wen/Famous/5609.html)內(nèi)的各章節(jié)的鏈接顶伞。
import requests, re
from bs4 import BeautifulSoup
novelname = ''
names = []
urls = []
req = requests.get(url = 'https://www.thn21.com/wen/Famous/5609.html')
req.encoding = 'gb2312'
html = req.text
div_bf = BeautifulSoup(html, "html.parser")
novelname = re.search(r'(.+?)簡介', div_bf.title.string).group(1)
a = div_bf.body.select('[href^="/wen/famous/hdnj/zuangzi"]')
for each in a:
names.append(each.string) # 章節(jié)名
urls.append('https://www.thn21.com/' + each.get('href')) # 章節(jié)鏈接
2. 依次爬取各章節(jié)鏈接中的內(nèi)容饵撑。
注意因為內(nèi)容中含有繁體字,所以需要用gb18030進行解碼唆貌。
for i in range(len(a)):
with open(novelname + '.txt', 'a', encoding='utf-8') as f:
f.write(' ' + names[i] + '\n')
req = requests.get(url = urls[i])
html= req.content.decode('gb18030', 'ignore')
bf = BeautifulSoup(html, "html.parser")
texts = ""
for i in bf.body.find_all('p'):
if i.string:
texts += " " + i.string + "\n"
f.writelines(texts)
f.write('\n\n')