前言
以我的理解,寫一個(gè)爬蟲分為以下幾個(gè)步驟
- 分析目標(biāo)網(wǎng)站
- 訪問單個(gè)網(wǎng)頁(yè)地址谊迄,獲取網(wǎng)頁(yè)源代碼
- 提取數(shù)據(jù)
- 保存數(shù)據(jù)
- 抓取剩余網(wǎng)頁(yè)
以下開始正題
1. 分析目標(biāo)網(wǎng)站
- 目標(biāo)網(wǎng)站為簡(jiǎn)書七日熱門文章 http://www.reibang.com/trending/weekly 拇派。 提取數(shù)據(jù)為用戶分衫,標(biāo)題爷辱,閱讀量证芭,評(píng)論量晕翠,獲贊量喷舀,打賞數(shù)
提取目標(biāo) - 用chrome tools 查看這個(gè)網(wǎng)頁(yè),是用ajax加載的崖面,分析規(guī)律元咙,發(fā)現(xiàn)url為 http://www.reibang.com/trending/weekly?page=1 , page=1 至 page=5.
url規(guī)律
2. 訪問單個(gè)網(wǎng)頁(yè)地址巫员,獲取網(wǎng)頁(yè)源代碼
- 設(shè)置url
url = 'http://www.reibang.com/trending/weekly?page=1'
- 設(shè)置頭部信息(用來(lái)偽裝請(qǐng)求庶香,本案例中可省略)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
request = urllib2.Request(url=url, headers=headers)
- 發(fā)送請(qǐng)求和接收響應(yīng)
html = urllib2.urlopen(request)
3. 從源代碼中提取數(shù)據(jù)
# 先用BeautifulSoup轉(zhuǎn)換一下,以便之后解析
bsObj = BeautifulSoup(html.read(), 'lxml')
-
抓出每篇文章的源代碼简识,并提取目標(biāo)數(shù)據(jù)(寫的很差勁赶掖,just work)
文章源碼
items = bsObj.findAll("div", {"class": "content"})
for item in items:
author = item.find("a", {"class": "blue-link"}).get_text()
title = item.find("a", {"class": "title"}).get_text()
other = item.find("div", {"class": "meta"}).get_text()
pattern = re.compile('(\d+)')
content = re.findall(pattern, other)
view = content[0]
comment = content[1]
like = content[2]
money = content[3] if (len(content) == 4) else 0 # 非常不嚴(yán)謹(jǐn),暫時(shí)這么做
4. 保存數(shù)據(jù)
with open('articlesOfSevenDays.csv', 'a') as resultFile:
wr = csv.writer(resultFile, dialect= 'excel')
wr.writerow([author,title,view,comment,like,money])
因?yàn)橛龅骄幋a問題七扰,所以添加以下代碼
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
5. 抓取剩余網(wǎng)頁(yè)
for i in range(1,6):
print "開始抓取第{}頁(yè)...".format(i)
url = 'http://www.reibang.com/trending/weekly?page={}'. format(i)
# 重復(fù)之前提取數(shù)據(jù)和保存數(shù)據(jù)的代碼
完整的代碼
#!/usr/bin/env python
# coding=utf-8
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import re
import csv
import os
def getHTML(i):
url = 'http://www.reibang.com/trending/weekly?page={}'.format(i)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
try:
request = Request(url=url, headers=headers)
html = urlopen(request)
bsObj = BeautifulSoup(html.read(), 'lxml')
items = bsObj.findAll("div", {"class": "content"})
except HTTPError as e:
print(e)
exit()
return items
def getArticleInfo(items):
articleInfo= []
for item in items:
author = item.find("a", {"class": "blue-link"}).get_text()
title = item.find("a", {"class": "title"}).get_text()
other = item.find("div", {"class": "meta"}).get_text()
pattern = re.compile('(\d+)')
content = re.findall(pattern, other)
view = content[0]
comment = content[1]
like = content[2]
money = content[3] if (len(content) == 4) else 0 # 不太嚴(yán)謹(jǐn)
articleInfo.append([author, title, view, comment, like, money])
return articleInfo
dir = "../jianshu/"
if not os.path.exists(dir):
os.makedirs(dir)
csvFile = open("../jianshu/jianshuSevenDaysArticles.csv","wt",encoding='utf-8')
writer = csv.writer(csvFile)
writer.writerow(("author", "title", "view", "comment", "like", "money"))
try:
for i in range(1, 6):
items = getHTML(i)
articleInfo = getArticleInfo(items)
for item in articleInfo:
writer.writerow(item)
finally:
csvFile.close()
抓取結(jié)果
image.png
總結(jié)
- 頁(yè)面解析水平不好奢赂,接下來(lái)要學(xué)習(xí):正則表達(dá)式,beautifulSoup颈走,lxml
- 遇到的編碼問題待學(xué)習(xí)