距離上次更新又有一段時(shí)間了,畢業(yè)答辯之后,確實(shí)和同學(xué)們一起出去嗨了一段時(shí)間芥挣,由于還沒(méi)入職,在家清凈的環(huán)境中可以好好學(xué)一下一直感興趣的東西啦蹋砚。
一直對(duì)網(wǎng)絡(luò)爬蟲(chóng)很感興趣,所以就開(kāi)始學(xué)習(xí)很想學(xué)的python墨坚,用了之后也是感覺(jué)非常棒。期間抓過(guò)包括知乎帽撑、豆瓣、煎蛋還有個(gè)壁紙網(wǎng)站的數(shù)據(jù)专筷,而抓去最多的還是直播網(wǎng)站斗魚(yú)。數(shù)據(jù)抓下來(lái)之后如何使用是個(gè)問(wèn)題,我的辦法是用這些數(shù)據(jù)通過(guò)python的web框架flask搭建一個(gè)網(wǎng)站槽驶,也算是這段時(shí)間的學(xué)習(xí)成果。網(wǎng)站的構(gòu)建自然少不了前端全陨,也是硬著頭皮學(xué)習(xí)了bootstrap,了解了一些css雨涛、javascript的知識(shí)。這段時(shí)間的學(xué)習(xí)成果主要是LearningFlask侣肄、BeautifulPics、Danmu和DouyuFan這四個(gè)項(xiàng)目(由于也是剛接觸python,代碼質(zhì)量可能不是太高 -锥债。-)。而最后這個(gè)DouyuFan算是對(duì)前邊幾個(gè)項(xiàng)目的總結(jié)哮肚。DouyuFan主要是通過(guò)斗魚(yú)網(wǎng)站彈幕信息的抓取登夫,獲取直播禮物的分布情況,歷史數(shù)據(jù)記錄以及當(dāng)前最熱門房間信息允趟。
接下來(lái)我就用三次分別介紹我在數(shù)據(jù)抓取恼策、后臺(tái)搭建以及前后端數(shù)據(jù)通訊中學(xué)到的知識(shí)和遇到的問(wèn)題潮剪。
開(kāi)播房間數(shù)據(jù)獲取
使用python抓取過(guò)數(shù)據(jù)的同學(xué)肯定對(duì)request和Beautiful Soup這兩個(gè)庫(kù)不陌生涣楷。
號(hào)稱http for humans的requests缺失不是沽名釣譽(yù),他在頁(yè)面數(shù)據(jù)的抓取上確實(shí)簡(jiǎn)單明了抗碰。
通常情況下狮斗,requests和Beautiful Soup配合使用。以對(duì)斗魚(yú)當(dāng)前直播房間的抓取為例:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup # 導(dǎo)入BeautifulSoup弧蝇,提取網(wǎng)頁(yè)中目標(biāo)元素
import re # re 正則表達(dá)式碳褒,在快速查找和過(guò)濾元素中有出色表現(xiàn)
import requests # reqeusts 用以獲取頁(yè)面數(shù)據(jù)
from datetime import datetime
from pymongo import MongoClient
首先導(dǎo)入上述這幾個(gè)庫(kù),然后可以偽造http請(qǐng)求header捍壤,這樣可以減少爬蟲(chóng)被服務(wù)器ban掉的可能骤视。
HOST = "http://www.douyu.com"
Directory_url = "http://www.douyu.com/directory?isAjax=1"
Qurystr = "/?page=1&isAjax=1"
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86 Safari/537.36'
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
connection = "keep-alive"
CacheControl = "no-cache"
UpgradeInsecureRequests = 1
headers = {
'User-Agent': agent,
'Host': HOST,
'Accept': accept,
'Cache-Control': CacheControl,
'Connection': connection,
'Upgrade-InsecureRequests': UpgradeInsecureRequests
}
然后就是開(kāi)播房間數(shù)據(jù)的獲取和入庫(kù):
cli = MongoClient(host="ip",port=xxx)
db = cli["Douyu"]
col = db["Roominfo"]
def get_roominfo(data):
if data:
firstpage = BeautifulSoup(data)
roomlist = firstpage.select('li')
print len(roomlist)
if roomlist:
for room in roomlist:
try:
roomid = room["data-rid"]
roomtitle = room.a["title"]
roomtitle = roomtitle.encode('utf-8')
roomowner = room.select("p > span")
roomtag = room.select("div > span")
roomimg = room.a
roomtag = roomtag[0].string
date = datetime.now()
# now = datetime.datetime(
# date.year, date.month, date.day, date.hour, date.minute)
if len(roomowner) == 2:
zbname = roomowner[0].string
audience = roomowner[1].get_text()
audience = audience.encode('utf-8').decode('utf-8')
image = roomimg.span.img["data-original"]
word = u"萬(wàn)" # 在頁(yè)面中獲取的房間人數(shù)以萬(wàn)為單位的str需要轉(zhuǎn)換為int型,以便入庫(kù)
if word in audience:
r = re.compile(r'(\d+)(\.?)(\d*)')
data = r.match(audience).group(0)
audience = int(float(data) * 10000)
else:
audience = int(audience)
roominfo = {
"roomid": int(roomid),
"roomtitle": roomtitle,
"anchor": zbname,
"audience": audience,
"tag": roomtag,
"date": date,
"img" : image
}
col.insert_one(roominfo)
# print roomid,":",roomtitle
except Exception, e:
pass
def insert_info():
session = requests.session()
pagecontent = session.get(Directory_url).text
pagesoup = BeautifulSoup(pagecontent)
games = pagesoup.select('a')
col.drop()
for game in games:
links = game["href"]
gameurl = HOST + links + Qurystr
print gameurl
gamedata = session.get(gameurl).text
get_roominfo(gamedata)
我平常習(xí)慣使用mongodb作為數(shù)據(jù)存儲(chǔ)鹃觉,首先建立與數(shù)據(jù)庫(kù)的連接专酗,然后通過(guò)獲取斗魚(yú)當(dāng)前所有房間分類,接著逐一獲取每個(gè)分類中開(kāi)播的房間數(shù)據(jù)盗扇,并記錄每個(gè)房間的roomid(房間號(hào)祷肯,斗魚(yú)直播間唯一標(biāo)識(shí))、roomtitle(房間標(biāo)題)疗隶、anchor(主播id)佑笋、audience(觀眾人數(shù))、tag(房間所屬分類)斑鼻、date(數(shù)據(jù)獲取時(shí)間)蒋纬、img(直播間封面圖片)尿扯。通過(guò)定時(shí)執(zhí)行此腳本吉挣,可以獲取當(dāng)前觀眾人數(shù)最多的房間(通常大都是lol的直播 =。=)杖小,也可以在之后通過(guò)roomid查詢到關(guān)于對(duì)應(yīng)直播間必要的信息荒叶。
彈幕數(shù)據(jù)獲取
經(jīng)衬敫螅看斗魚(yú)直播的同學(xué)肯定知道“彈幕大神”這個(gè)詞,我最初想要抓取彈幕的目的是想通過(guò)大量的獲取直播間彈幕數(shù)據(jù)進(jìn)行一些自然語(yǔ)言分析些楣,由于那些東西一直沒(méi)學(xué)習(xí)脂凶,也就沒(méi)再弄宪睹,但是,通過(guò)彈幕蚕钦,可以獲取到在全頻道廣播的火箭信息亭病,長(zhǎng)時(shí)間監(jiān)測(cè)這些數(shù)據(jù)應(yīng)該也是一件有意思的事情。
說(shuō)干就干嘶居,想要獲取到直播間的彈幕數(shù)據(jù)不同于上邊所說(shuō)的頁(yè)面數(shù)據(jù)抓取命贴,好在斗魚(yú)官方也提供了一個(gè)獲取彈幕的途徑斗魚(yú)彈幕服務(wù)器第三方接入?yún)f(xié)議,文檔中對(duì)如何獲取彈幕數(shù)據(jù)食听、以及彈幕信息類型有具體的說(shuō)明,這也大大降低了獲取彈幕數(shù)據(jù)的難度污茵。
看過(guò)這個(gè)協(xié)議之后樱报,通過(guò)建立與彈幕服務(wù)器的tcp連接,可以不斷的獲取到彈幕數(shù)據(jù)泞当,我使用的是socket這個(gè)庫(kù)迹蛤。
HOST = 'openbarrage.douyutv.com'
PORT = 8601
RID = 97376
LOGIN_INFO = "type@=loginreq/username@=qq_aPSMdfM5" + \
"/password@=1234567890123456/roomid@=" + str(RID) + "/"
JION_GROUP = "type@=joingroup/rid@=" + str(RID) + "/gid@=-9999" + "/"
ROOM_ID = "type@=qrl/rid@=" + str(RID) + "/"
KEEP_ALIVE = "type@=keeplive/tick@=" + \
str(int(time.time())) + "/vbw@=0/k@=19beba41da8ac2b4c7895a66cab81e23/"
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
在這里,需要注意幾個(gè)變量: host port roomid gid 襟士。其中HOST是彈幕服務(wù)器地址盗飒,port是對(duì)外開(kāi)放的端口,roomid則是主播間對(duì)應(yīng)的id陋桂,gid是要加入的彈幕頻道逆趣,-9999頻道可以獲取到所有彈幕,也就是“海量彈幕”頻道嗜历。
def get_Hotroom():
hotroom = roomcol.find().limit(1).sort(
[("audience", pymongo.DESCENDING), ("date", pymongo.DESCENDING)])
for item in hotroom:
return item["roomid"]
def create_Conn():
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))
RID = get_Hotroom()
print "當(dāng)前最熱房間:", RID
LOGIN_INFO = "type@=loginreq/username@=qq_aPSMdfM5" + \
"/password@=1234567890123456/roomid@=" + str(RID) + "/"
print LOGIN_INFO
JION_GROUP = "type@=joingroup/rid@=" + str(RID) + "/gid@=-9999" + "/"
print JION_GROUP
s.sendall(tranMsg(LOGIN_INFO))
s.sendall(tranMsg(JION_GROUP))
return s
之后宣渗,通過(guò)get_Hotroom()獲取到當(dāng)前最熱門房間(人數(shù)最多的房間),通過(guò)create_Conn()建立與服務(wù)器的連接梨州。連接建立之后就可以開(kāi)心的獲取并保存彈幕數(shù)據(jù)了:
def insert_msg(sock):
sendtime = 0
while True:
if sendtime % 20 == 0:
print "----------Keep Alive---------"
try:
sock.sendall(tranMsg(KEEP_ALIVE))
except socket.error:
print "alive error"
sock = create_Conn()
insert_msg(sock)
sendtime += 1
print sendtime
try:
data = sock.recv(4000)
if data:
strdata = repr(data)
if "type@=spbc" in strdata:
get_rocket(data)
if "type@=chatmsg" in strdata:
get_chatmsg(data)
except socket.error:
print "chat error"
sock = create_Conn()
insert_msg(sock)
time.sleep(1)
每20秒向服務(wù)器發(fā)送一條KEEP_ALIVE用以使連接焙鄞眩活,通過(guò)獲取到的數(shù)據(jù)特點(diǎn)暴匠,將普通聊天彈幕和火箭廣播彈幕區(qū)分開(kāi)來(lái)鞍恢,并且保存在不同的數(shù)據(jù)庫(kù)中從而為之后提供不同的用途。
彈幕數(shù)據(jù)獲取大致是這樣的:
![獲取彈幕](https://o3sw4xojp.qnssl.com/Douyufandanmu.gif)
后續(xù)內(nèi)容
至此每窖,此項(xiàng)目的數(shù)據(jù)獲取工作已經(jīng)完成帮掉,在接下來(lái)兩篇內(nèi)容會(huì)分別介紹如何使用這些數(shù)據(jù)構(gòu)建頁(yè)面,以及在此過(guò)程中遇到的問(wèn)題岛请。