1 環(huán)境
Windows7 x64
Python 3.7
2 流程
i) 導(dǎo)入庫
ii)爬取網(wǎng)頁源代碼信息;
iii)爬取特定格式信息并保存為txt文檔丢习;
iv)爬取過程及代碼的一些限定條件金矛。
3 代碼
3.1 配置相關(guān)庫(requests和BS4)
輸入
import requests
from bs4 import BeautifulSoup
輸出
導(dǎo)入爬蟲相關(guān)庫
3.2 爬取網(wǎng)頁源代碼
輸入
def download_page(url): # 用于下載頁面
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
r = requests.get(url, headers=headers)
return r.text
輸出
初始化必要參數(shù),完成基礎(chǔ)設(shè)置
備注:
聲明函數(shù)(def)那一行的上方必須有兩行的空行(PEP8:python編碼規(guī)范)
3.3 爬取特定信息并保存
輸入
def get_content(html, page):
output = """第{}頁 作者:{} 性別:{} 年齡:{} 點(diǎn)贊:{} 評(píng)論:{}\n{}\n------------\n"""
soup = BeautifulSoup(html, 'html.parser')
con = soup.find(id='content-left')
con_list = con.find_all('div', class_="article")
for i in con_list:
author = i.find('h2').string # 獲取作者名字
content = i.find('div', class_='content').find('span').get_text() # 獲取內(nèi)容
stats = i.find('div', class_='stats')
vote = stats.find('span', class_='stats-vote').find('i', class_='number').string # 獲取點(diǎn)贊數(shù)
comment = stats.find('span', class_='stats-comments').find('i', class_='number').string # 獲取評(píng)論數(shù)
author_info = i.find('div', class_='articleGender') # 獲取作者 年齡抒和,性別
if author_info is not None: # 非匿名用戶
class_list = author_info['class']
if "womenIcon" in class_list:
gender = '女'
elif "manIcon" in class_list:
gender = '男'
else:
gender = ''
age = author_info.string # 獲取年齡
else: # 匿名用戶
gender = ''
age = ''
?
save_txt(output.format(page, author, gender, age, vote, comment, content))
輸出
從網(wǎng)頁源代碼中爬确治:頁碼、作者信息(姓名梢褐、性別旺遮、年齡)赵讯、點(diǎn)贊數(shù)、評(píng)論數(shù)耿眉、文本內(nèi)容
輸入
def save_txt(*args):
for i in args:
with open('qiushibaike.txt', 'a', encoding='utf-8') as file:
file.write(i)
輸出
將爬取信息保存為txt文檔
3.4 限定條件
輸入
def main():
for i in range(1, 11):
url = 'https://qiushibaike.com/text/page/{}'.format(i)
html = download_page(url)
get_content(html, i)
輸出
限定爬取頁碼:1-10(含10)頁
輸入
if __name__ == '__main__':
main()
輸出
當(dāng).py文件被直接運(yùn)行時(shí)瘦癌,代碼塊被運(yùn)行;否則(當(dāng).py文件以模塊形式被導(dǎo)入時(shí))跷敬,代碼塊不運(yùn)行
4 全文
代碼全文如下:
# -*- coding: utf-8 -*-
###############################################################################
# Crawler
# Author: Lenox
# Data:2019.03.22
# License: BSD 3.0
###############################################################################
# 配置相關(guān)庫
import requests
from bs4 import BeautifulSoup
# 抓取網(wǎng)頁信息讯私,返回格式為*.text的文本
def download_page(url): # 用于下載頁面
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0"}
r = requests.get(url, headers=headers)
return r.text
# 爬取特定網(wǎng)頁、頁碼下的特定格式信息
def get_content(html, page):
output = """第{}頁 作者:{} 性別:{} 年齡:{} 點(diǎn)贊:{} 評(píng)論:{}\n{}\n------------\n"""
soup = BeautifulSoup(html, 'html.parser')
con = soup.find(id='content-left')
con_list = con.find_all('div', class_="article")
for i in con_list:
author = i.find('h2').string # 獲取作者名字
content = i.find('div', class_='content').find('span').get_text() # 獲取內(nèi)容
stats = i.find('div', class_='stats')
vote = stats.find('span', class_='stats-vote').find('i', class_='number').string # 獲取點(diǎn)贊數(shù)
comment = stats.find('span', class_='stats-comments').find('i', class_='number').string # 獲取評(píng)論數(shù)
author_info = i.find('div', class_='articleGender') # 獲取作者 年齡西傀,性別
if author_info is not None: # 非匿名用戶
class_list = author_info['class']
if "womenIcon" in class_list:
gender = '女'
elif "manIcon" in class_list:
gender = '男'
else:
gender = ''
age = author_info.string # 獲取年齡
else: # 匿名用戶
gender = ''
age = ''
save_txt(output.format(page, author, gender, age, vote, comment, content))
# 將爬取內(nèi)容保存為txt文檔
def save_txt(*args):
for i in args:
with open('qiushibaike.txt', 'a', encoding='utf-8') as file:
file.write(i)
# 限定爬取1-10(含10)頁的內(nèi)容斤寇。我們也可以用 Beautiful Soup找到頁面底部有多少頁。
def main():
for i in range(1, 11):
url = 'https://qiushibaike.com/text/page/{}'.format(i)
html = download_page(url)
get_content(html, i)
# 當(dāng).py文件被直接運(yùn)行時(shí)拥褂,代碼塊被運(yùn)行娘锁;否則(當(dāng).py文件以模塊形式被導(dǎo)入時(shí)),代碼塊不運(yùn)行
if __name__ == '__main__':
main()
?