首先,本文的數(shù)據(jù)下載自IMDB 5000 Movie Dataset From Kaggle**
原作者爬取了IMDB 5000多條觀測(cè)數(shù)據(jù)浙滤,然后用回歸對(duì)IMDB各個(gè)電影的評(píng)分進(jìn)行建模,作者的文章如下:
Predict Movie Rating - NYC Data Science Academy Blog**
本文主要借助該數(shù)據(jù)完成大數(shù)據(jù)分析第4講復(fù)雜數(shù)據(jù)和分析的作業(yè)气堕,對(duì)該講的內(nèi)容和知識(shí)點(diǎn)練練手纺腊。
導(dǎo)入相關(guān)包
library(ggplot2)
library(stringr)
library(dplyr)
數(shù)據(jù)導(dǎo)入
#當(dāng)前項(xiàng)目運(yùn)行根路徑
#例如:G:/DataCruiser/workspace/IMDB Analysis
projectPath <- getwd()
#movie_metadata.csv路徑
#例如G:/DataCruiser/workspace/IMDB Analysis/data/movie_metadata.csv
servicePath <- str_c(projectPath, "data", "movie_metadata.csv", sep = "/")
#導(dǎo)入數(shù)據(jù)
movies <- read.csv(servicePath, header = T, stringsAsFactors = F)
導(dǎo)演與電影評(píng)分?jǐn)?shù)據(jù)處理
disDirector <- function(){
#選擇子集
mymovies <- select(movies,
title_year,
imdb_score,
director_facebook_likes,
actor_1_facebook_likes)
#列名重命名,等號(hào)左邊是新列名送巡,右邊是就列名
mymovies <- rename(mymovies,
year = title_year,
scores = imdb_score,
direcotrlikes = director_facebook_likes,
actorlikes = actor_1_facebook_likes)
#刪除缺失數(shù)據(jù)
mymovies <- filter(mymovies,
!is.na(year),
!is.na(scores),
!is.na(direcotrlikes),
!is.na(actorlikes))
#數(shù)據(jù)排序
mymovies <- arrange(mymovies, desc(year))
#數(shù)據(jù)計(jì)算:facebook上導(dǎo)演點(diǎn)贊數(shù)與相應(yīng)導(dǎo)演所導(dǎo)的電影IMDB評(píng)分?jǐn)?shù)之間的關(guān)系
disDirector <- mymovies %>%
group_by(year) %>%
summarise( count = n(),
mean_scores = mean(scores, na.rm = TRUE),
mean_likes = mean(direcotrlikes, na.rm = TRUE) )
%>% filter(count > 0)
return(disDirector)
}
演員與電影評(píng)分?jǐn)?shù)據(jù)處理
disActor <- function(){
#選擇子集
mymovies <- select(movies,
title_year,
imdb_score,
director_facebook_likes,
actor_1_facebook_likes)
#列名重命名,等號(hào)左邊是新列名盒卸,右邊是就列名
mymovies <- rename(mymovies,
year = title_year,
scores = imdb_score,
direcotrlikes = director_facebook_likes,
actorlikes = actor_1_facebook_likes)
#刪除缺失數(shù)據(jù) mymovies <- filter(mymovies,
!is.na(year),
!is.na(scores),
!is.na(direcotrlikes),
!is.na(actorlikes))
#數(shù)據(jù)排序 mymovies <- arrange(mymovies, desc(year))
#數(shù)據(jù)計(jì)算:facebook上一號(hào)演員點(diǎn)贊數(shù)與相應(yīng)導(dǎo)演所導(dǎo)的電影IMDB評(píng)分?jǐn)?shù)之間的關(guān)系
disActor <- mymovies %>%
group_by(year) %>%
summarise( count = n(),
mean_scores = mean(scores, na.rm = TRUE),
mean_likes = mean(actorlikes, na.rm = TRUE) )
%>% filter(count > 0)
return(disActor)
}
導(dǎo)演與評(píng)分圖形繪制
#導(dǎo)演評(píng)分散點(diǎn)圖
directorView <- ggplot(data = disDirector) +
geom_point(mapping = aes(x = mean_likes, y = mean_scores))+
geom_smooth(mapping = aes(x = mean_likes, y = mean_scores))
結(jié)果如下:
movieScore vs direcetorLikes.jpg
演員與評(píng)分圖形繪制
#演員評(píng)分散點(diǎn)圖
actorView <- ggplot(data = disActor) +
geom_point(mapping = aes(x = mean_likes, y = mean_scores))+
geom_smooth(mapping = aes(x = mean_likes, y = mean_scores))
結(jié)果如下:
movieScore vs actorLikes.jpg
結(jié)果保存
#保存分析結(jié)果
outputpath <- str_c(projectPath,"output","movieScore vs direcetorLikes.jpg",sep="/")
ggsave(filename=outputpath, plot=directorView)
#保存分析結(jié)果
outputpath <- str_c(projectPath,"output","movieScore vs actorLikes.jpg",sep="/")
ggsave(filename=outputpath, plot=actorView)
結(jié)果分析
在假定IMDB評(píng)分高低決定著電影好壞的前提下骗爆,從對(duì)IMDB 5000多條的數(shù)據(jù)分析可以初步得到以下結(jié)論:
- 總體上看導(dǎo)演在facebook上面獲得的點(diǎn)贊數(shù)與電影的好壞呈現(xiàn)正相關(guān),而一號(hào)演員在facebook獲得的點(diǎn)贊數(shù)與電影的好壞呈負(fù)相關(guān)蔽介,通過(guò)導(dǎo)演的好壞來(lái)判斷一部電影的好壞往往更加靠譜摘投;
- 有一些非主流的導(dǎo)演雖然在facebook上獲得的點(diǎn)贊數(shù)不多煮寡,但是也不排除會(huì)拍出一些好電影的可能性。
需要說(shuō)明一定的是犀呼,在對(duì)于count較少的數(shù)據(jù)這里沒(méi)有剔除幸撕,如果設(shè)置不同的噪音門(mén)檻得出的結(jié)論略有不同,特別是演員的趨勢(shì)上外臂,得出的結(jié)論會(huì)變化較大坐儿。
另外,本文的源碼以及輸出結(jié)果均已經(jīng)上傳到:
jijiwhywhy/IMDB-Analysis