左手python右手R

寫(xiě)在前面

最近在學(xué)習(xí)python腊尚,結(jié)合一個(gè)實(shí)際案例,寫(xiě)一下python和R在做數(shù)據(jù)分析上的差異。
本人還不是特別熟練python企蹭,所以python的代碼來(lái)自于kagle的一個(gè)高vote回帖。
我這里只是轉(zhuǎn)寫(xiě)一下R的版本智末,轉(zhuǎn)寫(xiě)python代碼之后感覺(jué)python做數(shù)據(jù)分析和可視化實(shí)在不如R給力谅摄。代碼丟這了,有機(jī)會(huì)說(shuō)說(shuō)如何用tidyverse分析數(shù)據(jù)吧系馆。
這里寫(xiě)了多數(shù)代碼送漠,剩下流程差不多的就放棄寫(xiě)了。還有機(jī)器學(xué)習(xí)的部分回頭有心情了用tidymodels寫(xiě)一下基本的框架吧由蘑。


Netflix is an application that keeps growing bigger and faster with its popularity, shows and content. This is an EDA
or a story telling through its data along with a content-based recommendation system and a wide range of different
graphs and visuals.

image.png

The python source code is from here

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
library(tidyverse)
library(skimr)
# Loading the dataset
data <- tidytuesdayR::tt_load('2021-04-20')
netfix_dta <- data$netflix_titles
# install a module if your python don't have
# reticulate::py_install('seaborn',pip = TRUE) 

Pass the data to Python from R in rstudio


netflix_overall=r.netfix_dta
netflix_overall.head()

Also, you can do the same thing using R

head(netfix_dta)

Or

glimpse(netfix_dta)

Therefore, it is clear that the dataset contains 12 columns for exploratory analysis.

netflix_overall.count()

Also, in R you can do it better.

skim(netfix_dta)

netflix_shows=netflix_overall[netflix_overall['type']=='TV Show']
netflix_shows.head()

In R, you can use pipe to repeat, which makes your script easy to read.

netflix_shows <- netfix_dta %>%
  filter(type == "TV Show")

head(netflix_shows)

netflix_movies=netflix_overall[netflix_overall['type']=='Movie']
netflix_movies <- netfix_dta %>%
  filter(type == "Movie")

Analysis of Movies vs TV Shows.


sns.set(style="darkgrid") 
ax = sns.countplot(x="type", data=netflix_overall, palette="Set2")
plt.show()

In R

netfix_dta %>% 
  ggplot(aes(x = fct_rev(type), fill = type)) + 
  geom_bar() + 
  theme_bw()

It is evident that there are more Movies on Netflix than TV shows.

```{python} md

# If a producer wants to release some content, which month must he do so?( Month when least amount of content is added)

```{python}
netflix_date = netflix_shows[['date_added']].dropna()
netflix_date['year'] = netflix_date['date_added'].apply(lambda x : x.split(', ')[-1])
netflix_date['month'] = netflix_date['date_added'].apply(lambda x : x.lstrip().split(' ')[0])

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'][::-1]
df = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(df, cmap='afmhot_r', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index, fontsize=7, fontfamily='serif')

plt.title('Netflix Contents Update', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()
library(lubridate)
library(viridis)

netfix_dta %>% 
  select(date_added) %>% 
  mutate(date_added = mdy(date_added),
         month = month(date_added, label = TRUE, abbr = FALSE),
         year = year(date_added)) %>% 
  group_by(year, month) %>% 
  filter(!is.na(month)) %>% 
  summarise(contents = n()) %>% 
  ggplot(aes(x = year, y = fct_rev(month), fill = contents)) + 
  geom_tile() + 
  viridis::scale_fill_viridis(option = "A") + 
  labs(title = 'Netflix Contents Update',
       x = '',
       y = '')

Movie ratings analysis

plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_movies, palette="Set2", order=netflix_movies['rating'].value_counts().index[0:15])
plt.show()

In R

netfix_dta %>% 
  group_by(rating) %>% 
  summarise(n = n()) %>% 
  filter(!is.na(rating)) %>% 
  ggplot(aes(x = fct_reorder(rating,n, .desc = TRUE), y = n, fill = rating)) + 
  geom_bar(stat = "identity", show.legend = F) + 
  scale_y_continuous(expand = expansion(c(0,.1))) + 
  labs(
    x = 'Rating',
    y = 'Count'
  )

Analysing IMDB ratings to get top rated movies on Netflix

imdb_ratings=pd.read_csv('netflix/IMDb ratings.csv',usecols=['weighted_average_vote'])

imdb_titles=pd.read_csv('netflix/IMDb movies.csv', usecols=['title','year','genre'])

ratings = pd.DataFrame({'Title':imdb_titles.title, 'Release Year':imdb_titles.year, 'Rating': imdb_ratings.weighted_average_vote, 'Genre':imdb_titles.genre})
ratings.drop_duplicates(subset=['Title','Release Year','Rating'], inplace=True)
ratings.shape

ratings.head()

In R

imdb_ratings <- read_csv('netflix/IMDb ratings.csv') %>% 
  select(1,2)
imdb_titles <- read_csv('netflix/IMDb movies.csv') %>% 
  select(1, title, year, genre)

ratings <- left_join(imdb_titles, imdb_ratings, by = "imdb_title_id") %>% 
  select(-1) %>% 
  select(1:3,Rating = "weighted_average_vote")
ratings
dim(ratings)
ratings.dropna()
joint_data=ratings.merge(netflix_overall,left_on='Title',right_on='title',how='inner')
joint_data=joint_data.sort_values(by='Rating', ascending=False)

joint_data.head()
joint_data.shape
joint_data <- ratings %>% 
  filter(!is.na(.)) %>% 
  inner_join(., netfix_dta, by = "title") %>% 
  arrange(desc(Rating))

dim(joint_data)
import plotly.express as px
top_rated=joint_data[0:10]
top_rated
fig =px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()
library(plotly)
top_rated <- joint_data[1:10,]
fig <- plot_ly(
  ids = c(top_rated$title, paste0(top_rated$title,"-",top_rated$country)),
  labels = c(top_rated$title,top_rated$country),
  parents = c(rep('',10), top_rated$title),
  colors = c(top_rated$Rating,top_rated$Rating),
  type = "sunburst",
  branchvalues = 'total'
)

fig
fig =px.sunburst(
    r.top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()

Countries with highest rated content.

country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries
topcountries <- joint_data %>% 
  group_by(country) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  filter(!is.na(country))
import plotly.express as px
data = dict(
    number=[1063,619,135,60,44,41,40,40,38,35],
    country=["United States", "India", "United Kingdom", "Canada", "Spain",'Turkey','Philippines','France','South Korea','Australia'])
fig = px.funnel(data, x='number', y='country')
fig.show()
library(reticulate)
data <- py$data %>% 
  as.data.frame() %>% 
  arrange(desc(number))

plot_ly(
  y = data$country,
  x = data$number,
  type = "funnel",
) %>% 
  layout(yaxis = list(categoryarray = data$country))

Year wise analysis

plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=netflix_movies, palette="Set2", order=netflix_movies['release_year'].value_counts().index[0:15])
plt.show()
netflix_movies %>% 
  group_by(release_year) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:15) %>% 
  mutate(release_year = factor(release_year, levels = release_year)) %>% 
  ggplot(aes(y = fct_rev(release_year), x = n, fill = release_year)) + 
  geom_bar(stat = "identity",show.legend = FALSE) + 
  ggsci::scale_fill_simpsons()

Analysis of duration of movies?

netflix_movies['duration']=netflix_movies['duration'].str.replace(' min','')
netflix_movies['duration']=netflix_movies['duration'].astype(str).astype(int)
netflix_movies['duration']
plt.figure(figsize=(8,8))
sns.set(style="darkgrid")
sns.kdeplot(data=netflix_movies['duration'], shade=True)
plt.show()
netflix_movies %>% 
  mutate(duration = str_remove(duration, " min") %>% as.double()) %>% 
  ggplot(aes(x = duration)) + 
           geom_density(fill = "blue2",alpha = .4) + 
  ggthemes::theme_solarized()
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

from collections import Counter

genres=list(netflix_movies['listed_in'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)

text = list(set(gen))
plt.rcParams['figure.figsize'] = (13, 13)

wordcloud = WordCloud(max_words=1000000,background_color="white").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()
library(wordcloud)
library(tidytext)
set.seed(2021)
netflix_movies %>% 
  unnest_tokens(word, listed_in) %>% 
  count(word, sort = TRUE) %>% 
  with(wordcloud(word, n, max.words = 100))


matplotlib.use('TkAgg')
g={k: v for k, v in sorted(g.items(), key=lambda item: item[1], reverse= True)}
g
fig, ax = plt.subplots()

x=list(g.keys())
y=list(g.values())
ax.vlines(x, ymin=0, ymax=y, color='green')
ax.plot(x,y, "o", color='maroon')
ax.set_xticklabels(x, rotation = 90)
ax.set_ylabel("Count of movies")
# set a title
ax.set_title("Genres")
plt.show()
g <- py$g %>% unlist() %>% data.frame() %>% select(n = ".")

g %>% 
  mutate(name = rownames(g),
         name = fct_reorder(name, n, .desc = TRUE)) %>% 
  ggplot(aes(x = name, y = n)) + 
  geom_segment(aes(x = name, xend = name, y= 0, yend = n)) + 
  geom_point(size = 5, color = 'orange') + 
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1)
  )

Lowest number of seasons.

features=['title','duration']
durations= netflix_shows[features]

durations['no_of_seasons']=durations['duration'].str.replace(' Season','')

#durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)
durations['no_of_seasons']=durations['no_of_seasons'].str.replace('s','')
durations['no_of_seasons']=durations['no_of_seasons'].astype(str).astype(int)

t=['title','no_of_seasons']
top=durations[t]

top=top.sort_values(by='no_of_seasons', ascending=False)
bottom=top.sort_values(by='no_of_seasons')
bottom=bottom[20:50]

import plotly.graph_objects as go
# Set the width and height of the figure
plt.figure(figsize=(15,15))
fig = go.Figure(data=[go.Table(header=dict(values=['Title', 'No of seasons']), cells=dict(values=[bottom['title'],bottom['no_of_seasons']],fill_color='lavender'))])
fig.show()
library(kableExtra)
netflix_shows %>% 
  select(title, duration) %>% 
  separate(duration, ' ',into = c('duration','season')) %>% 
  mutate(duration = as.numeric(duration)) %>% 
  arrange(desc(duration)) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末闽寡,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子尼酿,更是在濱河造成了極大的恐慌爷狈,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件裳擎,死亡現(xiàn)場(chǎng)離奇詭異涎永,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)鹿响,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)羡微,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人惶我,你說(shuō)我怎么就攤上這事妈倔。” “怎么了绸贡?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵盯蝴,是天一觀的道長(zhǎng)毅哗。 經(jīng)常有香客問(wèn)我,道長(zhǎng)结洼,這世上最難降的妖魔是什么黎做? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮松忍,結(jié)果婚禮上蒸殿,老公的妹妹穿的比我還像新娘。我一直安慰自己鸣峭,他們只是感情好宏所,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著摊溶,像睡著了一般爬骤。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上莫换,一...
    開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1 283
  • 那天霞玄,我揣著相機(jī)與錄音,去河邊找鬼拉岁。 笑死坷剧,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的喊暖。 我是一名探鬼主播惫企,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼陵叽!你這毒婦竟也來(lái)了狞尔?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤巩掺,失蹤者是張志新(化名)和其女友劉穎偏序,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體胖替,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡禽车,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了刊殉。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡州胳,死狀恐怖记焊,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情栓撞,我是刑警寧澤遍膜,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布碗硬,位于F島的核電站,受9級(jí)特大地震影響瓢颅,放射性物質(zhì)發(fā)生泄漏恩尾。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一挽懦、第九天 我趴在偏房一處隱蔽的房頂上張望翰意。 院中可真熱鬧,春花似錦信柿、人聲如沸冀偶。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)进鸠。三九已至,卻和暖如春形病,著一層夾襖步出監(jiān)牢的瞬間客年,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工漠吻, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留量瓜,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓侥猩,卻偏偏與公主長(zhǎng)得像榔至,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子欺劳,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345