一:前言
這些天一直想做一個斗魚爬取彈幕,但是一直考試時間不夠耸别,而且這個斗魚的api接口雖然開放了但是我在github上沒有找到可以完美實現(xiàn)連接健芭。我看了好多文章,學(xué)了寫然后總結(jié)一下秀姐。也為后面數(shù)據(jù)分析做準備慈迈,后面先對彈幕簡單詞云化,然后再對各個房間的數(shù)據(jù)可視化省有。
代碼地址:github.com/rieuse/DouyuTV
這次爬取的房間是斗魚直播的蕪湖大司馬痒留,因為他人氣比較多,方便分析蠢沿。主播也是我老鄉(xiāng)伸头,嘿嘿。然后把彈幕的信息的uid舷蟀,昵稱恤磷,等級,彈幕內(nèi)容保存mongodb雪侥。
先看看效果
二:運行環(huán)境
- IDE:Pycharm
- Python3.6
- pymongo 3.4.0
三:實例分析
首先要想爬取彈幕要看看官方的開發(fā)文檔碗殷。
- 第一點就是協(xié)議組成:
def sendmsg(msgstr):
msg = msgstr.encode('utf-8')
data_length = len(msg) + 8
code = 689
msgHead = int.to_bytes(data_length, 4, 'little') \
+ int.to_bytes(data_length, 4, 'little') + int.to_bytes(code, 4, 'little')
client.send(msgHead)
sent = 0
while sent < len(msg):
tn = client.send(msg[sent:])
sent = sent + tn
- 第二點是登錄請求精绎,之后把這個傳遞給sendmsg即可發(fā)送請求:
msg = 'type@=loginreq/username@=rieuse/password@=douyu/roomid@={}/\0'.format(roomid)
sendmsg(msg)
- 第三點是獲取彈幕信息
msg_more = 'type@=joingroup/rid@={}/gid@=-9999/\0'.format(roomid)
sendmsg(msg_more)
- 第四點是要保存登錄狀態(tài)
def keeplive():
while True:
msg = 'type@=keeplive/tick@=' + str(int(time.time())) + '/\0'
sendmsg(msg)
time.sleep(15)
- 第五點是要把接受到的byte,轉(zhuǎn)換我們識別的編碼速缨,然后保存到monggodb,也可以保存到text文檔中代乃。
- 補充說明
到這里這個API的主要功能已經(jīng)了解了旬牲,剩下的就是具體實現(xiàn),有以下幾點:
- 1.用戶輸入房間號搁吓,獲取房間說明
- 2.發(fā)送數(shù)據(jù)后原茅,我們就會接受到斗魚返回的數(shù)據(jù),但是返回的數(shù)據(jù)是二進制所以我 們需要對數(shù)據(jù)轉(zhuǎn)換編碼堕仔。
- 3.我這里爬取了斗魚用戶發(fā)送彈幕的信息有uid擂橘,昵稱,等級摩骨,彈幕內(nèi)容通贞,這里的等級有的人是空的,如果不處理就會造成錯誤所以要使用下面處理一下恼五。
if not level_more:
level_more = b'0'
四:實戰(zhàn)代碼
import multiprocessing
import socket
import time
import re
import pymongo
import requests
from bs4 import BeautifulSoup
clients = pymongo.MongoClient('localhost')
db = clients["DouyuTV_danmu"]
col = db["info"]
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = socket.gethostbyname("openbarrage.douyutv.com")
port = 8601
client.connect((host, port))
danmu_path = re.compile(b'txt@=(.+?)/cid@')
uid_path = re.compile(b'uid@=(.+?)/nn@')
nickname_path = re.compile(b'nn@=(.+?)/txt@')
level_path = re.compile(b'level@=([1-9][0-9]?)/sahf')
def sendmsg(msgstr):
msg = msgstr.encode('utf-8')
data_length = len(msg) + 8
code = 689
msgHead = int.to_bytes(data_length, 4, 'little') \
+ int.to_bytes(data_length, 4, 'little') + int.to_bytes(code, 4, 'little')
client.send(msgHead)
sent = 0
while sent < len(msg):
tn = client.send(msg[sent:])
sent = sent + tn
def start(roomid):
msg = 'type@=loginreq/username@=rieuse/password@=douyu/roomid@={}/\0'.format(roomid)
sendmsg(msg)
msg_more = 'type@=joingroup/rid@={}/gid@=-9999/\0'.format(roomid)
sendmsg(msg_more)
print('---------------歡迎連接到{}的直播間---------------'.format(get_name(roomid)))
while True:
data = client.recv(1024)
uid_more = uid_path.findall(data)
nickname_more = nickname_path.findall(data)
level_more = level_path.findall(data)
danmu_more = danmu_path.findall(data)
if not level_more:
level_more = b'0'
if not data:
break
else:
for i in range(0, len(danmu_more)):
try:
product = {
'uid': uid_more[0].decode(encoding='utf-8'),
'nickname': nickname_more[0].decode(encoding='utf-8'),
'level': level_more[0].decode(encoding='utf-8'),
'danmu': danmu_more[0].decode(encoding='utf-8')
}
print(product)
col.insert(product)
print('成功導(dǎo)入mongodb')
except Exception as e:
print(e)
def keeplive():
while True:
msg = 'type@=keeplive/tick@=' + str(int(time.time())) + '/\0'
sendmsg(msg)
time.sleep(15)
def get_name(roomid):
r = requests.get("http://www.douyu.com/" + roomid)
soup = BeautifulSoup(r.text, 'lxml')
return soup.find('a', {'class', 'zb-name'}).string
if __name__ == '__main__':
room_id = input('請出入房間ID: ')
p1 = multiprocessing.Process(target=start, args=(room_id,))
p2 = multiprocessing.Process(target=keeplive)
p1.start()
p2.start()
五:彈幕的后續(xù)使用
這里我們是將彈幕的幾個信息昌罩,uid,用戶昵稱灾馒,等級茎用,彈幕內(nèi)容保存到mongodb,后續(xù)要對數(shù)據(jù)分析就可以直接拿出來,如果我們只需要彈幕那么就可以只把彈幕信息保存到txt文檔中就行了轨功。
貼出我的github地址旭斥,我的爬蟲代碼和學(xué)習(xí)的基礎(chǔ)部分都放進去了,有喜歡的朋友可以點擊 start follw一起學(xué)習(xí)交流吧夯辖!github.com/rieuse/DouyuTV