今天主要是利用BeautifulSoup爬一下糗百 http://www.qiushibaike.com/
包括:作者愈涩,年齡,段子內容慢显,好笑數持际,評論數
主要思想:利用BeautifulSoup獲取網頁中的數據,然后存到本地的csv
下面了解一下BeautifulSoup的用法
首先必須要導入 bs4 庫
BeautifulSoup 的用法
下面是具體代碼:
import requests
from bs4 import BeautifulSoup
import csv
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
header = {'User-Agent': user_agent}
html = requests.get('http://www.qiushibaike.com', headers = header).content
soup = BeautifulSoup(html, 'lxml')
# 獲取要爬取的部分
divs = soup.select('.article.block.untagged.mb15')
authors = soup.select('div > a > h2')
if soup.select('div.author.clearfix > div'):
ages = soup.select('div.author.clearfix > div')
else:
ages = '不知道'
contents = soup.select('a > div.content > span')
laughs = soup.select('div.stats > span > i')
comments = soup.select('div.stats > span > a > i')
#新建一個列表殴蹄,把獲取的數據存到這個列表究抓;
a = []
for author, age, content, laugh, comment in zip(authors, ages, contents, laughs, comments):
data = {
'author': author.get_text(),
'age': age.get_text(),
'content': content.get_text(),
'laugh': laugh.get_text(),
'comment': comment.get_text()
}
a.append(data)
#把列表的數據存到本地的csv文件;
csv_name = ['author', 'age', 'content', 'laugh', 'comment']
with open('qiubai.csv', 'w', newline = '',encoding='utf-8')as csvfile:
write = csv.DictWriter(csvfile, fieldnames = csv_name)
write.writeheader()
write.writerows(a)
csvfile.close()
結果:
小結:
1袭灯、利用BeautifulSoup爬取數據刺下,感覺比之前的正則方便多了,其實后面還有更方便的方法稽荧,請?zhí)较乱黄榭?a href="http://www.reibang.com/p/0abf49d3816b" target="_blank">爬蟲基礎_03——xpath橘茉;
2、這里存儲數據是用的csv文件姨丈,但是如果數據太多的話畅卓,這個方法就有局限性了,后面還會介紹一下其他的存儲方法蟋恬;