喜馬拉雅聽聽爬蟲

自己喜歡在上班的途中聽點有聲書，所以經(jīng)常在喜馬拉雅上找資源，要找到一個好聽的節(jié)目不容易嘀趟，雖然在喜馬拉雅官網(wǎng)上可以按分類來看，但是卻不能按點贊數(shù)或者評論內(nèi)容排序找愈诚，不是很方便她按，于是就用Python寫了個爬蟲牛隅，把所有聲音的相關(guān)信息、評論內(nèi)容都抓取下來酌泰，然后放到數(shù)據(jù)庫來分析媒佣，這樣喜歡什么樣的資源，直接根據(jù)聲音或評論的內(nèi)容來匯總分析陵刹，結(jié)果就一目了然了丈攒。

流程實現(xiàn)圖

Urllib,request, selenium
　　Web的訪問使用urllib和urllib2，相比request授霸、selenium來說巡验，效率更高些，感覺也穩(wěn)定些碘耳，之前使用request的遇到https的網(wǎng)址處理起來有點問題显设。而selenium呢，自動化操作可以不用分析具體頁面的處理邏輯辛辨，不過對于這種海量數(shù)據(jù)捕捂，處理起來速度就會慢很多。

多線程和隊列
　　使用了2個threading.Thread的繼承類斗搞，Ximalaya類用來解析聲音專輯指攒，分析提取專輯內(nèi)的聲音信息，解析出評論地址僻焚；CommentDown類專門用來提取保存評論內(nèi)容信息允悦；一般一個聲音會有多條評論，多的上千條評論虑啤，所以CommentDown分配了10個線程來提取評論隙弛，Ximalaya分配了3個線程來分析專輯的聲音信息。
　　使用了2個隊列狞山，1個用來保存專輯url全闷，大小100，1個用來保存評論url萍启，大小設(shè)置為200总珠；這樣在超過隊列最大值的時候就會停下來，等待前面隊列里處理了再繼續(xù)勘纯，可以有效控制整個爬蟲速度局服，以免訪問太過頻繁被網(wǎng)站給屏蔽了。

數(shù)據(jù)保存
　　使用了Mongodb數(shù)據(jù)庫屡律，Nosql處理高并發(fā)的腌逢，相比SQL速度和效率要高得多降淮。Mongodb里在music下保存聲音的相關(guān)信息（比如聲音的專輯名超埋、專輯地址搏讶、聲音的地址、聲音的時長霍殴、點贊數(shù)等等）媒惕，bookcomment下保存聲音的評論內(nèi)容。

斷點續(xù)傳来庭、重復(fù)處理
　　遇到中途中斷后要繼續(xù)執(zhí)行妒蔚，還得考慮下斷點續(xù)傳，這里處理得比較簡單粗暴月弛，在Ximalaya處理聲音的時候肴盏，會先判斷數(shù)據(jù)庫是否有聲音的地址，如果存在就是跳過不再處理帽衙，在CommentDown處理評論的時候菜皂，判斷判斷數(shù)據(jù)庫是否有聲音的地址，如果存在就是跳過不再處理厉萝，這樣對于后面比較費時處理部分都可以直接跳過恍飘，也不會存在有重復(fù)的數(shù)據(jù)會影響最后的分析的問題。

異常處理
　　程序在解析頁面的時候谴垫，可能會有超時之類的異常情況章母，增加了對應(yīng)的異常處理，socket.setdefaulttimeout(20)設(shè)置全局超時為20秒翩剪，超過20秒就會超時報錯乳怎，這樣再通過異常捕獲來處理，設(shè)置了異常處理記數(shù)器前弯，對于異常頁面重復(fù)處理指定次數(shù)后不再處理舞肆，避免部分聲音頁面被刪除一直訪問異常的情況。

數(shù)據(jù)分析
　　最后對保存到數(shù)據(jù)庫的數(shù)據(jù)進行分析博杖，做分析的時候Nosql做關(guān)聯(lián)分析太痛苦了椿胯，完全不如sql查詢方便，于是把數(shù)據(jù)導(dǎo)入到Oracle來進行的分析剃根，根據(jù)評論內(nèi)容中的關(guān)鍵字來標(biāo)識判斷（例如：“點贊”哩盲，“好聽”，“太棒“之類的都判斷成受歡迎）狈醉，最后再用匯總統(tǒng)計出結(jié)果廉油。

評論最受歡迎的TOP20有聲書：

評論最受歡迎的TOP10綜藝節(jié)目：

評論最受歡迎的TOP20音樂：

評論普通話說得最好的節(jié)目：

最后源代碼也附上了，有興趣的朋友可以自己看看：
爬蟲源代碼

#coding:utf-8

import urllib
from lxml import etree
import re
from pymongo import MongoClient
import socket
import threading
from Queue import Queue
import time
import random
import sys
import traceback

socket.setdefaulttimeout(20)
PgEr = 0
queue = Queue(200)
queue_in = Queue(100)

class Ximalaya(threading.Thread):
    def __init__(self, queue, queue_in):
        threading.Thread.__init__(self)
        self.queue = queue
        self.queue_in = queue_in
        self.host = 'http://www.ximalaya.com'
        self.client = MongoClient()
        self.db = self.client.test
        self.musicInfo = self.db.music
        self.commentInfo = self.db.bookcomment
        self.AgsEr = 0
        self.SdEr = 0
        self.ClgEr = 0
        
        self.headers = {'Accept-Language': 'zh-CN',
                       'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
                       'Content-Type': 'application/x-www-form-urlencoded',
                       'Host': 'www.introducer.westpac.net.au',
                       'Connection': 'Keep-Alive',
                        }
        
    def run(self):
        while not exit_flag.is_set():
            time.sleep(random.uniform(1,3))
            album_url = self.queue_in.get()
            self.albumGetSounds(album_url)
            self.queue_in.task_done()
            print '*** Album done, url: %s, Queue: %s, Queue_in: %s, Page: %s ***' % (album_url, self.queue.qsize(), self.queue_in.qsize(), page)
            
            
    def albumGetSounds(self, ab_url):
        try:
            html = urllib.urlopen(ab_url).read()
            tree = etree.HTML(html)
            sound_urls = tree.xpath("http://div[@class='miniPlayer3']/a/@href")
            album_title = tree.xpath("http://div[@class='detailContent_title']/h1")[0].text
            if sound_urls:
                for sound_url in sound_urls:
                    sound_url = self.host + sound_url
                    #print 'Sound url is: ', sound_url
                    
                    #如果在數(shù)據(jù)庫找到聲音url地址苗傅，就不再解析聲音
                    exists_flag = self.Musicinfo_find(sound_url)
                    if not exists_flag:                    
                        print '>>> Sound put %s, Queue: %s, Queue_in: %s, Page: %s' % (sound_url, self.queue.qsize(), self.queue_in.qsize(), page)
                        self.soundpage(ab_url, album_title, sound_url)
                    else:
                        pass
                        #print 'Sound %s already exists, goto next >>>' % sound_url
                    
        except Exception as e:
            print '***albumGetSounds error, error: %s, album url: %s' % (e, ab_url)
            print traceback.print_exc()
            self.AgsEr += 1
            if self.AgsEr < 3:
                print 'time sleep 15s.'
                time.sleep(15)
                self.albumGetSounds(ab_url)
    
    def soundpage(self, a_url, a_title, s_url):
        try:
            html = urllib.urlopen(s_url).read()
            tree = etree.HTML(html)
            title = tree.xpath("http://div[@class='detailContent_title']/h1")[0].text
            music_type = tree.xpath("http://div[@class='detailContent_category']/a")[0].text
            tags = tree.xpath("http://div[@class='tagBtnList']/a[@class='tagBtn2']/span")
            tagString = ','.join(i.text for i in tags)
            playcount = tree.xpath("http://div[@class='soundContent_playcount']")[0].text
            likecount = tree.xpath("http://a[@class='likeBtn link1 ']/span[@class='count']")[0].text
            commentcount = tree.xpath("http://a[@class='commentBtn link1']/span[@class='count']")[0].text
            forwardcount = tree.xpath("http://a[@class='forwardBtn link1']/span[@class='count']")[0].text
            mp3duration = tree.xpath("http://div[@class='sound_titlebar']/div[@class='fr']/span[@class='sound_duration']")[0].text
            username = tree.xpath("http://div[@class='username']")[0].text
            username = username.split()[0]
            track_id = re.search(r'track_id="(\d+)"', html)
            track_id = track_id.group(1) if track_id else None
            comment_url = 'http://www.ximalaya.com/sounds/'+ track_id + '/comment_list'
            #print likecount, commentcount, forwardcount, mp3duration, username, track_id, comment_url
            
            info_sound = {}
            info_sound['album_title'] = a_title
            info_sound['album_url'] = a_url
            info_sound['title'] = title
            info_sound['music_type'] = music_type
            info_sound['tags'] = tagString
            info_sound['music_id'] = track_id
            info_sound['playcount'] = playcount
            info_sound['likecount'] = likecount
            info_sound['commentcount'] = commentcount
            info_sound['forwardcount'] = forwardcount
            info_sound['mp3duration'] = mp3duration
            info_sound['user'] = username
            info_sound['url'] = s_url
                
            #把聲音信息插入數(shù)據(jù)庫
            self.Musicinfo_insert(info_sound)
            #把生成評論url地址抒线，交給commenlistGet處理
            self.CommenlistGet(comment_url)

        except Exception as e:
            print '***soundpage error: %s, Album url: %s, sound url: %s' % (e, a_url, s_url)
            print traceback.print_exc()
            self.SdEr += 1
            if self.SdEr < 3:
                print 'Time sleep 15s'
                time.sleep(15)
                self.soundpage(a_url, a_title, s_url)
        
    def CommenlistGet(self, url):
        try:
            html = urllib.urlopen(url).read()
            tree = etree.HTML(html)        
            pages = tree.xpath("http://div[@class='pagingBar_wrapper']/a/text()")
            
            #如果存在多頁評論，就提取最大頁數(shù), 循環(huán)各頁加入隊列
            if pages:
                pages =[ int(i) for i in pages if i.isdigit()]     #判斷是數(shù)字渣慕，只保留數(shù)字
                max_page = max(pages)
    
                for i in xrange(1, max_page+1):
                    comment_url = url + '?page=%s' % i
                    #如果數(shù)據(jù)存在評論url地址嘶炭，就不再把評論url地址加入隊列處理
                    com_count = self.CommentInfo_find(comment_url)
                    if not com_count:
                        self.queue.put(comment_url)
                        #print 'Queue size is: %s' % self.queue.qsize()
                        #print 'Comment url %s put in queue' % comment_url
                    else:
                        #pass
                        print 'Comment already in posts database, url: %s' % comment_url
    
            else:
                #如果數(shù)據(jù)存在評論url地址抱慌，就不再把評論url地址加入隊列處理
                com_count = self.CommentInfo_find(url)
                if not com_count:            
                    self.queue.put(url)
                    #print 'Queue size is: %s' % self.queue.qsize()
                #print 'Comment url %s put in queue' % url
                else:
                    pass
                    #print 'Comment already in posts database, url: %s' % url
        except Exception as e:
            print '***CommenlistGet get %s error, error: ' % (url, e)
            print traceback.print_exc()
            self.ClgEr += 1
            if self.ClgEr < 3:
                time.sleep(15)
                self.CommenlistGet(url)
        
        
    def Musicinfo_insert(self, infoData):
        music_id = self.musicInfo.insert(infoData)
        return music_id
    
    def Musicinfo_find(self, url):
        count = self.musicInfo.find_one({"url": url})
        return count
        
    def CommentInfo_find(self, url):
        comCount = self.commentInfo.find_one({"url": url})
        return comCount
    

class CommentDown(threading.Thread):
    
    def __init__(self, queue, queue_in):
        threading.Thread.__init__(self)
        self.queue = queue
        self.queue_in = queue_in
        self.client = MongoClient()
        self.db = self.client.test
        self.PgEr = 0
        
    
    def run(self):
        while not exit_flag.is_set():
            flag = self.queue.qsize()
            #print 'flag is : %s' %flag
            if flag < 30:
                print 'Queue size < 30, Time sleep 60s.'
                time.sleep(60)            
            url = self.queue.get()
            time.sleep(random.uniform(1,3))
            self.postpage(url)
            sys.stdout.flush()
            self.queue.task_done()
            print '*** Comment done, url: %s ***, Queue: %s, Queue_In: %s, Page: %s' % (url, self.queue.qsize(), self.queue_in.qsize(), page)
    
    
    def postpage(self, post_url):
        try:
            html = urllib.urlopen(post_url).read()
            tree = etree.HTML(html)        
            contents = tree.xpath('//div[@class="right"]')
            music_id = re.search(r'sounds/(\d+)/comment_list', post_url)
            music_id = music_id.group(1) if music_id else None
    
            
            for content in contents:
                user = content.xpath("div/a")[0].text
                mark = content.xpath("div/span")[0].text
                comment_list = content.xpath(".//div[@class='comment_content']/text()")
                comment_txt = comment_list[0] if comment_list else None
                reply_time = content.xpath(".//span[@class='comment_createtime']/text()")[0]
                
                
                user = user.encode('raw_unicode_escape').strip()
                mark = mark.encode('raw_unicode_escape').strip()
                comment_txt = comment_txt.encode('raw_unicode_escape').strip() if comment_txt else None
                #print user
                #print mark
                #print comment_txt
                #print reply_time
                info_comment = {}
                info_comment['music_id'] = music_id
                info_comment['user'] = user
                info_comment['mark'] = mark
                info_comment['content'] = comment_txt
                info_comment['reply_time'] = reply_time
                info_comment['url'] = post_url
                
                self.Post_insert(info_comment)
        except Exception as e:
            print '***postpage get %s error, error: %s' % (post_url, e)
            print traceback.print_exc()
            self.PgEr += 1
            if self.PgEr < 5:
                print 'time sleep 15s.'
                time.sleep(15)
                self.postpage(post_url)

    def Post_insert(self, postdata):
        Post_data = self.db.bookcomment
        post_id = Post_data.insert(postdata)
        return post_id


#把每一頁的各專輯地址放入queue_in隊列
def pageGetAlbums(url):
    try:
        html = urllib.urlopen(url).read()
        tree = etree.HTML(html)
    
        album_urls = tree.xpath("http://div[@class='albumfaceOutter']/a/@href")
        if album_urls:
            for album_url in album_urls:
                # 'Album url is: ', album_url
                global queue_in
                global queue
                queue_in.put(album_url)
                print 'Queue_in: %s, Queue: %s, Page: %s' % (queue_in.qsize(), queue.qsize(), page)
    except Exception as e:
        print '****pageGetAlbums %s get error, error: %s' % (url, e)
        print traceback.print_exc()
        global PgEr
        PgEr += 1
        if PgEr < 5:
            print 'time sleep 15s'
            time.sleep(15)
            pageGetAlbums(url)    
    



 
if __name__ == '__main__':
    exit_flag = threading.Event()
    exit_flag.clear()
    
    for i in range(3):
        xi = Ximalaya(queue, queue_in)
        xi.start()
        
    for i in range(10):
        downer = CommentDown(queue, queue_in)
        downer.start()           
    
    url_host = 'http://www.ximalaya.com/dq/music/'
    #對熱門頁面分析后，把每一頁的各專輯地址放入queue_in隊列
    html = urllib.urlopen(url_host).read()
    tree = etree.HTML(html)
    pages = tree.xpath("http://div[@class='pagingBar_wrapper']/a/text()")
    pages =[ int(i) for i in pages if i.isdigit()]     #判斷是數(shù)字眨猎，只保留數(shù)字        
    max_page = max(pages)
    
    for page in xrange(1, max_page+1):
        print 'Page No: %s' % page
        url = '%s%s' % (url_host, page)
        pageGetAlbums(url)
    
    queue_in.join()
    queue.join()
    exit_flag.set()
    
    print 'All downloaded!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'

最后編輯于：2017.12.03 04:56:41

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末抑进，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子睡陪，更是在濱河造成了極大的恐慌寺渗，老刑警劉巖，帶你破解...
沈念sama閱讀 222,104評論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件兰迫，死亡現(xiàn)場離奇詭異信殊，居然都是意外死亡，警方通過查閱死者的電腦和手機汁果，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,816評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門鸡号，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人须鼎，你說我怎么就攤上這事鲸伴。” “怎么了晋控？”我有些...
開封第一講書人閱讀 168,697評論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵汞窗，是天一觀的道長。經(jīng)常有香客問我赡译，道長仲吏，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 59,836評論 1贊 298
?港島之戀（遺憾婚禮）
正文為了忘掉前任蝌焚，我火速辦了婚禮裹唆，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘只洒。我一直安慰自己许帐，他們只是感情好，可當(dāng)我...
茶點故事閱讀 68,851評論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布毕谴。她就那樣靜靜地躺著成畦，像睡著了一般。火紅的嫁衣襯著肌膚如雪涝开。梳的紋絲不亂的頭發(fā)上循帐，一...
開封第一講書人閱讀 52,441評論 1贊 310
城市分裂傳說
那天，我揣著相機與錄音舀武，去河邊找鬼拄养。笑死，一個胖子當(dāng)著我的面吹牛银舱，可吹牛的內(nèi)容都是我干的瘪匿。我是一名探鬼主播跛梗，決...
沈念sama閱讀 40,992評論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼柿顶！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起操软，我...
開封第一講書人閱讀 39,899評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤嘁锯，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后聂薪，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體家乘，經(jīng)...
沈念sama閱讀 46,457評論 1贊 318
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,529評論 3贊 341
?白月光啟示錄
正文我和宋清朗相戀三年藏澳，在試婚紗的時候發(fā)現(xiàn)自己被綠了仁锯。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,664評論 1贊 352
活死人
序言：一個原本活蹦亂跳的男人離奇死亡翔悠，死狀恐怖业崖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情蓄愁，我是刑警寧澤双炕，帶...
沈念sama閱讀 36,346評論 5贊 350
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站撮抓，受9級特大地震影響妇斤，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜丹拯，卻給世界環(huán)境...
茶點故事閱讀 42,025評論 3贊 334
男人毒藥：我在死后第九天來索命
文/蒙蒙一站超、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧乖酬，春花似錦死相、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,511評論 0贊 24
一樁弒父案媳纬，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至施掏，卻和暖如春钮惠，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背七芭。一陣腳步聲響...
開封第一講書人閱讀 33,611評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工素挽，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人狸驳。一個月前我還...
沈念sama閱讀 49,081評論 3贊 377
代替公主和親
正文我出身青樓预明，卻偏偏與公主長得像缩赛，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子撰糠，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,675評論 2贊 359

喜馬拉雅聽聽爬蟲

推薦閱讀更多精彩內(nèi)容