前言
2018年12月7日,本年度最后一部壓軸大片《海王》如期上映瘟裸,目前貓眼評分達到9.5分客叉,靠著1.5億美金的制作成本,以小博大话告,目前票房接近9億兼搏,本文爬取了貓眼3w+條評論,多方位帶你解讀是否值得一看I彻佛呻!其實(yin)我(wei)也(mei)沒(qian)看!
數(shù)據(jù)爬取
現(xiàn)在貓眼電影網(wǎng)頁似乎已經(jīng)全部服務(wù)端渲染了,沒有發(fā)現(xiàn)相應(yīng)的評論接口病线,參考了之前其他文章中對于貓眼數(shù)據(jù)的爬取方法吓著,找到了評論接口鲤嫡!
http://m.maoyan.com/mmdb/comments/movie/249342.json?v=yes&offset=15&startTime=2018-1208%2019%3A17%3A16%E3%80%82
接口有了,但是沒有對應(yīng)的電影id绑莺,不過這難不倒我們暖眼,使用貓眼app+charles,我們成功找到海王對應(yīng)的電影ID纺裁;
接下來爬取評論:
#獲取數(shù)據(jù)
def get_data(url):
headrs = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
html = request(method='GET',url=url,headers=headrs)
if html.status_code == 200:
return html.content
else:
return None
解析接口返回數(shù)據(jù)
#處理接口返回數(shù)據(jù)
def parse_data(html):
json_data = json.loads(html,encoding='utf-8')['cmts']
comments = []
try:
for item in json_data:
comment = {
'nickName':item['nickName'],
'cityName':item['cityName'] if 'cityName' in item else '',
'content':item['content'].strip().replace('\n',''),
'score':item['score'],
'startTime': item['startTime']
}
comments.append(comment)
return comments
except Exception as e:
print(e)
處理鏈接及存儲數(shù)據(jù)
def change_url_and_save():
start_time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())).replace(' ','%20')
end_time = '2018-12-07 00:00:00'
while start_time > end_time:
url = "http://m.maoyan.com/mmdb/comments/movie/249342.json?v=yes&offset=15&startTime="+start_time
html = None
try:
html = get_data(url)
except Exception as e:
time.sleep(0.5)
html = get_data(url)
else:
time.sleep(0.1)
comments = parse_data(html)
start_time = comments[14]['startTime']
print(start_time)
t = datetime.datetime.now()
start_time = time.strptime(start_time,'%Y-%m-%d %H:%M:%S')
start_time = datetime.datetime.fromtimestamp(time.mktime(start_time))+datetime.timedelta(seconds=-1)
start_time = time.mktime(start_time.timetuple())
start_time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(start_time)).replace(' ', '%20')
for item in comments:
print(item)
with open('/Users/mac/Desktop/H5DOC/H5learn/REPTILE/comments.txt', 'a', encoding='utf-8')as f:
f.write(item['nickName'] + ',' + item['cityName'] + ',' + item['content'] + ',' + str(item['score']) +','+ item[
'startTime'] + '\n')
最終我們獲取到了大約33000條數(shù)據(jù)
數(shù)據(jù)分析
數(shù)據(jù)分析我們使用了百度的pyecharts诫肠、excel以及使用wordcloud生成詞云
首先看一下,評論分布熱力圖:
京津冀对扶、長三角区赵、珠三角等在各種榜單長期霸榜單的區(qū)域,在熱力圖中浪南,依然占據(jù)著重要地位。而新一線的川渝漱受、鄭州武漢緊隨其后络凿!
下面是評論數(shù)前20的城市
評論全國分布圖:
由圖中可以看出基本與熱力圖相似,主要分布在各大一線昂羡、新一線城市絮记,對于杭州為何會排在第17的位置,我覺得可能是阿里大本營虐先,大家都用淘票票的緣故吧怨愤!????
接下來是評分占比情況
由圖中可以看出,評分在4以上的占比達到了94%蛹批,而平均評分也達到4.68分W础!腐芍!
再來看一下各城市評分情況:
看了評分再來看看評論的詞云情況:
詞云出現(xiàn)較多的是好看差导、特效、劇情猪勇、震撼等设褐,可以看出大家對此電影對特效和劇情還是十分認同的,畢竟爛番茄新鮮度73%泣刹,1.5億美元對制作能做到如此實屬不易助析,我還是決定這周末去影院刷一下的!
詞云代碼
def data_wordclound():
comments = ''
with open('comments.txt','r') as f:
rows = f.readlines()
try:
for row in rows:
lit = row.split(',')
if len(lit) >= 3:
comment = lit[2]
if comment != '':
comments += ' '.join(jieba.cut(comment.strip()))
# print(comments)
except Exception as e:
print(e)
hai_coloring = imread('hai.jpeg')
# 多慮沒用的停止詞
stopwords = STOPWORDS.copy()
stopwords.add('電影')
stopwords.add('一部')
stopwords.add('一個')
stopwords.add('沒有')
stopwords.add('什么')
stopwords.add('有點')
stopwords.add('感覺')
stopwords.add('海王')
stopwords.add('就是')
stopwords.add('覺得')
stopwords.add('DC')
bg_image = plt.imread('hai.jpeg')
font_path = '/System/Library/Fonts/STHeiti Light.ttc'
wc = WordCloud(width=1024, height=768, background_color='white', mask=bg_image, font_path=font_path,
stopwords=stopwords, max_font_size=400, random_state=50)
wc.generate(comments)
images_colors = ImageColorGenerator(hai_coloring)
plt.figure()
plt.imshow(wc.recolor(color_func=images_colors))
plt.axis('off')
plt.show()
綜上椅您,我覺得沒看的小伙伴可以跟我一樣一起周末去貢獻一下票房了外冀!哈哈哈哈