自己喜歡在上班的途中聽點有聲書,所以經(jīng)常在喜馬拉雅上找資源,要找到一個好聽的節(jié)目不容易嘀趟,雖然在喜馬拉雅官網(wǎng)上可以按分類來看,但是卻不能按點贊數(shù)或者評論內(nèi)容排序找愈诚,不是很方便她按,于是就用Python寫了個爬蟲牛隅,把所有聲音的相關(guān)信息、評論內(nèi)容都抓取下來酌泰,然后放到數(shù)據(jù)庫來分析媒佣,這樣喜歡什么樣的資源,直接根據(jù)聲音或評論的內(nèi)容來匯總分析陵刹,結(jié)果就一目了然了丈攒。
流程實現(xiàn)圖
Urllib,request, selenium
Web的訪問使用urllib和urllib2,相比request授霸、selenium來說巡验,效率更高些,感覺也穩(wěn)定些碘耳,之前使用request的遇到https的網(wǎng)址處理起來有點問題显设。而selenium呢,自動化操作可以不用分析具體頁面的處理邏輯辛辨,不過對于這種海量數(shù)據(jù)捕捂,處理起來速度就會慢很多。
多線程和隊列
使用了2個threading.Thread的繼承類斗搞,Ximalaya類用來解析聲音專輯指攒,分析提取專輯內(nèi)的聲音信息,解析出評論地址僻焚;CommentDown類專門用來提取保存評論內(nèi)容信息允悦;一般一個聲音會有多條評論,多的上千條評論虑啤,所以CommentDown分配了10個線程來提取評論隙弛,Ximalaya分配了3個線程來分析專輯的聲音信息。
使用了2個隊列狞山,1個用來保存專輯url全闷,大小100,1個用來保存評論url萍启,大小設(shè)置為200总珠;這樣在超過隊列最大值的時候就會停下來,等待前面隊列里處理了再繼續(xù)勘纯,可以有效控制整個爬蟲速度局服,以免訪問太過頻繁被網(wǎng)站給屏蔽了。
數(shù)據(jù)保存
使用了Mongodb數(shù)據(jù)庫屡律,Nosql處理高并發(fā)的腌逢,相比SQL速度和效率要高得多降淮。Mongodb里在music下保存聲音的相關(guān)信息(比如聲音的專輯名超埋、專輯地址搏讶、聲音的地址、聲音的時長霍殴、點贊數(shù)等等)媒惕,bookcomment下保存聲音的評論內(nèi)容。
斷點續(xù)傳来庭、重復(fù)處理
遇到中途中斷后要繼續(xù)執(zhí)行妒蔚,還得考慮下斷點續(xù)傳,這里處理得比較簡單粗暴月弛,在Ximalaya處理聲音的時候肴盏,會先判斷數(shù)據(jù)庫是否有聲音的地址,如果存在就是跳過不再處理帽衙,在CommentDown處理評論的時候菜皂,判斷判斷數(shù)據(jù)庫是否有聲音的地址,如果存在就是跳過不再處理厉萝,這樣對于后面比較費時處理部分都可以直接跳過恍飘,也不會存在有重復(fù)的數(shù)據(jù)會影響最后的分析的問題。
異常處理
程序在解析頁面的時候谴垫,可能會有超時之類的異常情況章母,增加了對應(yīng)的異常處理,socket.setdefaulttimeout(20)設(shè)置全局超時為20秒翩剪,超過20秒就會超時報錯乳怎,這樣再通過異常捕獲來處理,設(shè)置了異常處理記數(shù)器前弯,對于異常頁面重復(fù)處理指定次數(shù)后不再處理舞肆,避免部分聲音頁面被刪除一直訪問異常的情況。
數(shù)據(jù)分析
最后對保存到數(shù)據(jù)庫的數(shù)據(jù)進行分析博杖,做分析的時候Nosql做關(guān)聯(lián)分析太痛苦了椿胯,完全不如sql查詢方便,于是把數(shù)據(jù)導(dǎo)入到Oracle來進行的分析剃根,根據(jù)評論內(nèi)容中的關(guān)鍵字來標(biāo)識判斷(例如:“點贊”哩盲,“好聽”,“太棒“之類的都判斷成受歡迎)狈醉,最后再用匯總統(tǒng)計出結(jié)果廉油。
評論最受歡迎的TOP20有聲書:
評論最受歡迎的TOP10綜藝節(jié)目:
評論最受歡迎的TOP20音樂:
評論普通話說得最好的節(jié)目:
最后源代碼也附上了,有興趣的朋友可以自己看看:
爬蟲源代碼
#coding:utf-8
import urllib
from lxml import etree
import re
from pymongo import MongoClient
import socket
import threading
from Queue import Queue
import time
import random
import sys
import traceback
socket.setdefaulttimeout(20)
PgEr = 0
queue = Queue(200)
queue_in = Queue(100)
class Ximalaya(threading.Thread):
def __init__(self, queue, queue_in):
threading.Thread.__init__(self)
self.queue = queue
self.queue_in = queue_in
self.host = 'http://www.ximalaya.com'
self.client = MongoClient()
self.db = self.client.test
self.musicInfo = self.db.music
self.commentInfo = self.db.bookcomment
self.AgsEr = 0
self.SdEr = 0
self.ClgEr = 0
self.headers = {'Accept-Language': 'zh-CN',
'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.introducer.westpac.net.au',
'Connection': 'Keep-Alive',
}
def run(self):
while not exit_flag.is_set():
time.sleep(random.uniform(1,3))
album_url = self.queue_in.get()
self.albumGetSounds(album_url)
self.queue_in.task_done()
print '*** Album done, url: %s, Queue: %s, Queue_in: %s, Page: %s ***' % (album_url, self.queue.qsize(), self.queue_in.qsize(), page)
def albumGetSounds(self, ab_url):
try:
html = urllib.urlopen(ab_url).read()
tree = etree.HTML(html)
sound_urls = tree.xpath("http://div[@class='miniPlayer3']/a/@href")
album_title = tree.xpath("http://div[@class='detailContent_title']/h1")[0].text
if sound_urls:
for sound_url in sound_urls:
sound_url = self.host + sound_url
#print 'Sound url is: ', sound_url
#如果在數(shù)據(jù)庫找到聲音url地址苗傅,就不再解析聲音
exists_flag = self.Musicinfo_find(sound_url)
if not exists_flag:
print '>>> Sound put %s, Queue: %s, Queue_in: %s, Page: %s' % (sound_url, self.queue.qsize(), self.queue_in.qsize(), page)
self.soundpage(ab_url, album_title, sound_url)
else:
pass
#print 'Sound %s already exists, goto next >>>' % sound_url
except Exception as e:
print '***albumGetSounds error, error: %s, album url: %s' % (e, ab_url)
print traceback.print_exc()
self.AgsEr += 1
if self.AgsEr < 3:
print 'time sleep 15s.'
time.sleep(15)
self.albumGetSounds(ab_url)
def soundpage(self, a_url, a_title, s_url):
try:
html = urllib.urlopen(s_url).read()
tree = etree.HTML(html)
title = tree.xpath("http://div[@class='detailContent_title']/h1")[0].text
music_type = tree.xpath("http://div[@class='detailContent_category']/a")[0].text
tags = tree.xpath("http://div[@class='tagBtnList']/a[@class='tagBtn2']/span")
tagString = ','.join(i.text for i in tags)
playcount = tree.xpath("http://div[@class='soundContent_playcount']")[0].text
likecount = tree.xpath("http://a[@class='likeBtn link1 ']/span[@class='count']")[0].text
commentcount = tree.xpath("http://a[@class='commentBtn link1']/span[@class='count']")[0].text
forwardcount = tree.xpath("http://a[@class='forwardBtn link1']/span[@class='count']")[0].text
mp3duration = tree.xpath("http://div[@class='sound_titlebar']/div[@class='fr']/span[@class='sound_duration']")[0].text
username = tree.xpath("http://div[@class='username']")[0].text
username = username.split()[0]
track_id = re.search(r'track_id="(\d+)"', html)
track_id = track_id.group(1) if track_id else None
comment_url = 'http://www.ximalaya.com/sounds/'+ track_id + '/comment_list'
#print likecount, commentcount, forwardcount, mp3duration, username, track_id, comment_url
info_sound = {}
info_sound['album_title'] = a_title
info_sound['album_url'] = a_url
info_sound['title'] = title
info_sound['music_type'] = music_type
info_sound['tags'] = tagString
info_sound['music_id'] = track_id
info_sound['playcount'] = playcount
info_sound['likecount'] = likecount
info_sound['commentcount'] = commentcount
info_sound['forwardcount'] = forwardcount
info_sound['mp3duration'] = mp3duration
info_sound['user'] = username
info_sound['url'] = s_url
#把聲音信息插入數(shù)據(jù)庫
self.Musicinfo_insert(info_sound)
#把生成評論url地址抒线,交給commenlistGet處理
self.CommenlistGet(comment_url)
except Exception as e:
print '***soundpage error: %s, Album url: %s, sound url: %s' % (e, a_url, s_url)
print traceback.print_exc()
self.SdEr += 1
if self.SdEr < 3:
print 'Time sleep 15s'
time.sleep(15)
self.soundpage(a_url, a_title, s_url)
def CommenlistGet(self, url):
try:
html = urllib.urlopen(url).read()
tree = etree.HTML(html)
pages = tree.xpath("http://div[@class='pagingBar_wrapper']/a/text()")
#如果存在多頁評論,就提取最大頁數(shù), 循環(huán)各頁加入隊列
if pages:
pages =[ int(i) for i in pages if i.isdigit()] #判斷是數(shù)字渣慕,只保留數(shù)字
max_page = max(pages)
for i in xrange(1, max_page+1):
comment_url = url + '?page=%s' % i
#如果數(shù)據(jù)存在評論url地址嘶炭,就不再把評論url地址加入隊列處理
com_count = self.CommentInfo_find(comment_url)
if not com_count:
self.queue.put(comment_url)
#print 'Queue size is: %s' % self.queue.qsize()
#print 'Comment url %s put in queue' % comment_url
else:
#pass
print 'Comment already in posts database, url: %s' % comment_url
else:
#如果數(shù)據(jù)存在評論url地址抱慌,就不再把評論url地址加入隊列處理
com_count = self.CommentInfo_find(url)
if not com_count:
self.queue.put(url)
#print 'Queue size is: %s' % self.queue.qsize()
#print 'Comment url %s put in queue' % url
else:
pass
#print 'Comment already in posts database, url: %s' % url
except Exception as e:
print '***CommenlistGet get %s error, error: ' % (url, e)
print traceback.print_exc()
self.ClgEr += 1
if self.ClgEr < 3:
time.sleep(15)
self.CommenlistGet(url)
def Musicinfo_insert(self, infoData):
music_id = self.musicInfo.insert(infoData)
return music_id
def Musicinfo_find(self, url):
count = self.musicInfo.find_one({"url": url})
return count
def CommentInfo_find(self, url):
comCount = self.commentInfo.find_one({"url": url})
return comCount
class CommentDown(threading.Thread):
def __init__(self, queue, queue_in):
threading.Thread.__init__(self)
self.queue = queue
self.queue_in = queue_in
self.client = MongoClient()
self.db = self.client.test
self.PgEr = 0
def run(self):
while not exit_flag.is_set():
flag = self.queue.qsize()
#print 'flag is : %s' %flag
if flag < 30:
print 'Queue size < 30, Time sleep 60s.'
time.sleep(60)
url = self.queue.get()
time.sleep(random.uniform(1,3))
self.postpage(url)
sys.stdout.flush()
self.queue.task_done()
print '*** Comment done, url: %s ***, Queue: %s, Queue_In: %s, Page: %s' % (url, self.queue.qsize(), self.queue_in.qsize(), page)
def postpage(self, post_url):
try:
html = urllib.urlopen(post_url).read()
tree = etree.HTML(html)
contents = tree.xpath('//div[@class="right"]')
music_id = re.search(r'sounds/(\d+)/comment_list', post_url)
music_id = music_id.group(1) if music_id else None
for content in contents:
user = content.xpath("div/a")[0].text
mark = content.xpath("div/span")[0].text
comment_list = content.xpath(".//div[@class='comment_content']/text()")
comment_txt = comment_list[0] if comment_list else None
reply_time = content.xpath(".//span[@class='comment_createtime']/text()")[0]
user = user.encode('raw_unicode_escape').strip()
mark = mark.encode('raw_unicode_escape').strip()
comment_txt = comment_txt.encode('raw_unicode_escape').strip() if comment_txt else None
#print user
#print mark
#print comment_txt
#print reply_time
info_comment = {}
info_comment['music_id'] = music_id
info_comment['user'] = user
info_comment['mark'] = mark
info_comment['content'] = comment_txt
info_comment['reply_time'] = reply_time
info_comment['url'] = post_url
self.Post_insert(info_comment)
except Exception as e:
print '***postpage get %s error, error: %s' % (post_url, e)
print traceback.print_exc()
self.PgEr += 1
if self.PgEr < 5:
print 'time sleep 15s.'
time.sleep(15)
self.postpage(post_url)
def Post_insert(self, postdata):
Post_data = self.db.bookcomment
post_id = Post_data.insert(postdata)
return post_id
#把每一頁的各專輯地址放入queue_in隊列
def pageGetAlbums(url):
try:
html = urllib.urlopen(url).read()
tree = etree.HTML(html)
album_urls = tree.xpath("http://div[@class='albumfaceOutter']/a/@href")
if album_urls:
for album_url in album_urls:
# 'Album url is: ', album_url
global queue_in
global queue
queue_in.put(album_url)
print 'Queue_in: %s, Queue: %s, Page: %s' % (queue_in.qsize(), queue.qsize(), page)
except Exception as e:
print '****pageGetAlbums %s get error, error: %s' % (url, e)
print traceback.print_exc()
global PgEr
PgEr += 1
if PgEr < 5:
print 'time sleep 15s'
time.sleep(15)
pageGetAlbums(url)
if __name__ == '__main__':
exit_flag = threading.Event()
exit_flag.clear()
for i in range(3):
xi = Ximalaya(queue, queue_in)
xi.start()
for i in range(10):
downer = CommentDown(queue, queue_in)
downer.start()
url_host = 'http://www.ximalaya.com/dq/music/'
#對熱門頁面分析后,把每一頁的各專輯地址放入queue_in隊列
html = urllib.urlopen(url_host).read()
tree = etree.HTML(html)
pages = tree.xpath("http://div[@class='pagingBar_wrapper']/a/text()")
pages =[ int(i) for i in pages if i.isdigit()] #判斷是數(shù)字眨猎,只保留數(shù)字
max_page = max(pages)
for page in xrange(1, max_page+1):
print 'Page No: %s' % page
url = '%s%s' % (url_host, page)
pageGetAlbums(url)
queue_in.join()
queue.join()
exit_flag.set()
print 'All downloaded!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'