以爬取“李毅”吧為例葡缰,寫一個小程序亏掀,完成自動的爬取與本地保存工作,此處在python3環(huán)境下運行泛释,python2環(huán)境下response.content是字符串滤愕,不需要解碼,去掉本代碼中的decode()即可怜校,具體區(qū)別參照文章 04requests模塊在python2和python3環(huán)境下的小小區(qū)別
另外python2環(huán)境下间影,代碼中的save方法encoding參數(shù)需要去掉,代碼中已注釋
# coding=utf-8
import requests
class TiebaSpider:
def __init__(self, tieba_name):
self.tieba_name = tieba_name
# 定義一個臨時的url
self.temp_url = 'https://tieba.baidu.com/f?kw='+tieba_name+'&pn={}'
self.headers = {
'User - Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
def get_url_list(self): # 構(gòu)造url列表
url_list = [self.temp_url.format(i*50) for i in range(1000)]
return url_list
def parse_url(self, url): # 發(fā)送請求獲取響應(yīng)
print('now parse', url)
response = requests.get(url, headers=self.headers)
return response.content.decode()
def save_html(self, html, page_num): # 保存html
file_path = self.tieba_name + "_" + str(page_num) + ".html"
with open(file_path, "w", encoding='utf-8') as f: # windows下需要加encoding = 'utf-8',因為windows 默認(rèn)編碼方式是gbk\
如果是python2環(huán)境下運行茄茁,需要去掉encoding這個參數(shù)魂贬,否則報錯
f.write(html)
print("保存成功")
def run(self):
# 1.url list
url_list = self.get_url_list()
# 2.發(fā)送請求,獲取響應(yīng)
for url in url_list:
html_str = self.parse_url(url)
# 3.保存
page_num = url_list.index(url) + 1 # index方法獲取當(dāng)前要保存的頁碼數(shù)
self.save_html(html_str, page_num)
if __name__ == '__main__':
tieba = TiebaSpider("李毅")
tieba.run()
運行代碼裙顽,保存本地結(jié)果展示如下
結(jié)果展示如圖所示