學(xué)習(xí)Python后,就想著自己能夠做一些小玩意來(lái)實(shí)踐一下學(xué)習(xí)效果霸饲。下面是簡(jiǎn)單的爬蟲(chóng)實(shí)現(xiàn)它改。
目標(biāo):
1.抓取糗事百科的段子
2.將段子保存到本地文件
我們首先先分析糗事百科的頁(yè)面
每個(gè)段子都是以author clearfix開(kāi)始,下面的div分別是內(nèi)容慌核,用戶名距境,點(diǎn)贊數(shù)等。今天這個(gè)例子中獲取用戶名垮卓,內(nèi)容以及點(diǎn)贊數(shù)垫桂。從html文件中獲取這些信息就要用到正則表達(dá)式了,python中提供了RE庫(kù)完美解決了這個(gè)問(wèn)題粟按。
pattern = re.compile('<div class="author.*?>.*?<h2>(.*?)</h2>.*?<span>(.*?)</span>.*?<span class="stats.*? class="number">(.*?)</i>',re.S)
items = re.findall(pattern, respone.text)
上面這段代碼就是從html文件中解析我們所需要的內(nèi)容诬滩。(.*?)是一個(gè)group,就是我們需要保存的內(nèi)容灭将。
獲取頁(yè)面的話用requests就可以了疼鸟。下面是部分代碼。
import re
import requests
# url = "http://www.qiushibaike.com/hot/"
url = "http://www.qiushibaike.com/8hr/page/2/?s=4975313"
session = requests.session()
agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
host = "www.qiushibaike.com"
headers = {
'Host':host,
'User-Agent':agent
}
respone = session.get(url)
pattern = re.compile('<div class="author.*?>.*?<h2>(.*?)</h2>.*?<span>(.*?)</span>.*?<span class="stats.*? class="number">(.*?)</i>',re.S)
items = re.findall(pattern, respone.text)
for item in items:
print("author="+ item[0]+ "\n" + "articl="+ item[1] + "\n" + "vote=" + item[2])
完整代碼:
import re
import requests
import time
import random
class QSBK:
def __init__(self):
self.pageIndex = 1
self.user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
self.host = url = "http://www.qiushibaike.com/8hr/page/"+str(pageIndex) +"/?s=4975313"
self.headers = {
'Host':self.host,
'User-Agent':self.user_agent
}
self.stories = []
self.session = requests.session()
self.enable = False
def getPage(self,pageIndex):
url = "http://www.qiushibaike.com/"
# + str(random.randint(100000,999999))
try:
resopne = self.session.get(url)
if resopne.text:
return resopne.text
except:
print("network wrong")
def getPageItems(self, pageIndex):
pageStories = []
pageCode = self.getPage(pageIndex)
if not pageCode:
print("獲取頁(yè)面失斆硎铩空镜!")
pattern = re.compile('<div class="author.*?>.*?<h2>(.*?)</h2>.*?<span>(.*?)</span>.*?<span class="stats.*? class="number">(.*?)</i>',re.S)
items = re.findall(pattern, pageCode)
for item in items:
pageStories.append([item[0], item[1], item[2]])
return pageStories
# 加載并提取頁(yè)面的內(nèi)容,加入到列表中
def loadPage(self):
# 如果當(dāng)前未看的頁(yè)數(shù)少于2頁(yè),則加載新一頁(yè)
if self.enable == True:
if len(self.stories) < 2:
# 獲取新一頁(yè)
pageStories = self.getPageItems(self.pageIndex)
# 將該頁(yè)的段子存放到全局list中
if pageStories:
self.stories.append(pageStories)
# 獲取完之后頁(yè)碼索引加一姑裂,表示下次讀取下一頁(yè)
self.pageIndex += 1
def getOneStoty(self, pageStories, page):
for story in pageStories:
game = input()
self.loadPage()
if game =="Q":
self.enable = False
return
#將段子保存到文件馋袜,write中a參數(shù)表示文件中增加內(nèi)容
try:
f = open('qsbks.txt','a')
f.write(story[0] + '\n' + story[1] + '\n' + story[2]+'\n\n')
finally:
f.close()
print("第%s頁(yè)\t發(fā)布人:%s\t\n%s\n贊:%s" %(page, story[0],story[1], story[2]))
def start(self):
print("正在讀取糗事百科")
self.enable = True
self.loadPage()
nowPage = 0
while self.enable:
if len(self.stories) > 0:
pageStories = self.stories[0]
nowPage+=1
del self.stories[0]
self.getOneStoty(pageStories, nowPage)
spider = QSBK()
spider.start()