數(shù)據(jù)來源于kaggle:https://www.kaggle.com/lava18/google-play-store-apps
數(shù)據(jù)主要包含了APP名稱昌粤、所屬類別、用戶評論數(shù)、評分我抠、價格、大小等某弦。這次的數(shù)據(jù)處理以及可視化由python完成途凫,這是我的第一個小項目垢夹,希望以后能夠做得更好。
首先導(dǎo)入相關(guān)的庫和數(shù)據(jù)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('E:\PYTHON\google\google-play-store-apps\googleplaystore.csv')
隨后對數(shù)據(jù)進行清洗以及數(shù)據(jù)預(yù)處理
#首先去除重復(fù)數(shù)據(jù)
df.drop_duplicates(subset='App',inplace = True)
df = df[df['Android Ver'] != np.nan]
df = df[df['Android Ver'] != 'NaN']
df = df[df['Installs'] != 'Free']
df = df[df['Installs'] != 'Paid']
df = df.dropna(subset=['Type','Content Rating','Current Ver','Android Ver'])
作完這些步驟后用df.info()觀察只有rating存在缺失值了维费,暫且不動果元,先對特征進行預(yù)處理。
Installs中存在‘+’犀盟,‘而晒,’等符號需要去除并且轉(zhuǎn)化為數(shù)值型
Size中存在'Varies with device',‘M’阅畴,‘k’也需要去除倡怎,并且轉(zhuǎn)化為數(shù)值型
‘Price'中存在'$',需要去除并轉(zhuǎn)化為數(shù)值型
df.Reviews = df.Reviews.astype('int64')
df['Installs'] = df['Installs'].apply(lambda x :x.replace('+','') if '+' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x : x.replace(',','') if ',' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x : float(x))
df['Size'] = df['Size'].apply(lambda x : str(x).replace('Varies with device','NaN') if 'Varies with device' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x : str(x).replace('M','') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x : float(str(x).replace('k',''))/1000 if 'k' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x : float(x))
df['Price'] = df['Price'].apply(lambda x : str(x).replace('$','') if '$' in str(x) else str(x))
df['Price'] = df['Price'].apply(lambda x :float(x))
對數(shù)據(jù)進行可視化展示贱枣,觀察總體的趨勢
labels = df['Type'].value_counts(sort=True).index
sizes = df['Type'].value_counts(sort=True)
explode = (0.1,0)
plt.pie(sizes,explode = explode,labels = labels,autopct='%1.1f%%',startangle=270)
plt.title('Payment category')
df_counts = df.groupby(['Category','Type']).size().unstack().sort_values(by = 'Free',ascending=False)
df_counts.plot.bar(figsize = (16,7))
plt.ylabel('Counts')
plt.legend(fontsize = 20)
df_counts['free_proportion'] = df_counts['Free']/(df['Type'].value_counts()['Free'])
df_counts['paid_proportion'] = df_counts['Paid']/(df['Type'].value_counts()['Paid'])
plt.figure(figsize=(16,7))
df_counts[['free_proportion','paid_proportion']].plot(kind='bar',figsize=(16,7))
觀察在是否付費的情況下各APP的數(shù)量情況
第一張圖:免費APP數(shù)量遠高于付費APP數(shù)量
第二張圖:FAMILY诈胜、GAME、TOOLS為免費APP數(shù)量的前三冯事,付費APP的數(shù)量與免費APP的數(shù)量有一定的相關(guān)性焦匈,通過計算是否付費情況下各類APP數(shù)量占總數(shù)的比值進行進一步觀察。
第三張圖:在付費APP中昵仅,F(xiàn)AMILY的比例仍然最高缓熟,GAME、TOOLS的比值也較高摔笤。值得注意的是MEDICAL够滑、PERSONALIZATION這類APP存在較多的付費類型,而BUSINESS類APP付費得較少吕世。
隨后來觀察下付費APP的價格分布情況
plt.figure(figsize=(16,7))
df[df.Type != 'Free']['Price'].plot(kind = 'hist',bins = 100)
plt.xlabel('Price')
價格大多數(shù)分布在50以內(nèi)彰触,甚至20。有少數(shù)大于350的命辖,我們來看看是什么
df_highprice = df[df['Price'] > 350]
print(df_highprice.App)
4197 most expensive app (H)
4362 ?? I'm rich
4367 I'm Rich - Trump Edition
5351 I am rich
5354 I am Rich Plus
5356 I Am Rich Premium
5357 I am extremely Rich
5358 I am Rich!
5359 I am rich(premium)
5362 I Am Rich Pro
5364 I am rich (Most expensive app)
5366 I Am Rich
5369 I am Rich
5373 I AM RICH PRO PLUS
9917 Eu Sou Rico
9934 I'm Rich/Eu sou Rico/??? ???/我很有錢
這類APP的名字基本都是統(tǒng)一的况毅,只不過在全世界有不同語言的版本。這類APP的唯一用處就是證明下載這類APP的人是很有錢的人而已尔艇。
《I'm Rich》的功能非常單純尔许,甚至可以說沒有功能,在打開App 后终娃,你只會看到一顆發(fā)光的紅色鉆石:
接下來對rating(評分)進行分析味廊,首先要去除rating為空值的數(shù)據(jù),并結(jié)合下載量、評論數(shù)余佛、對應(yīng)人群進行分析
df_rating = df[df['Rating'].notnull()]
df_rating.loc[df['Installs'] < 101 ,'downloads'] = 'very low'
df_rating.loc[(df['Installs'] < 10001)&(df['Installs'] >= 101) ,'downloads'] = 'low'
df_rating.loc[(df['Installs'] < 1000001)&(df['Installs'] >= 1001) ,'downloads'] = 'mid'
df_rating.loc[(df['Installs'] < 10000001)&(df['Installs'] >= 1000001) ,'downloads'] = 'high'
df_rating.loc[(df['Installs'] < 1000000001)&(df['Installs'] >= 10000001) ,'downloads'] = 'very high'
g = sns.catplot(x='downloads',y='Rating',data = df_rating,kind='box',height=10,palette='Set1',order=['very high','high','mid','low','very low'])
g.despine(left=True)
g.set_ylabels('rating')
print(df_rating.groupby('downloads').mean()['Rating'])
df_rating.loc[df['Reviews'] < 101 ,'Number of comments '] = 'very low'
df_rating.loc[(df['Reviews'] < 10001)&(df['Reviews'] >= 101) ,'Number of comments '] = 'low'
df_rating.loc[(df['Reviews'] < 1000001)&(df['Reviews'] >= 1001) ,'Number of comments '] = 'mid'
df_rating.loc[(df['Reviews'] < 10000001)&(df['Reviews'] >= 1000001) ,'Number of comments '] = 'high'
df_rating.loc[(df['Reviews'] < 78158307)&(df['Reviews'] >= 10000001) ,'Number of comments '] = 'very high'
df_reviews = df_rating.groupby('Number of comments ')['Rating'].agg(['count','mean']).reset_index()
plt.figure(figsize = (16,7))
plt.bar(x=df_reviews['Number of comments '],height = df_reviews['mean'],width = 0.03,zorder = 1)
plt.scatter(df_reviews['Number of comments '],df_reviews['mean'],s =list((df_reviews['count'].values/3).astype(int)),color='red',zorder = 2,marker = '*')
plt.ylim(0,5)
plt.ylabel('Rating')
plt.xlabel('Number of comments :The of Red star indicates the number of comment')
df_reviews.sort_values('mean')
downloads rating
high 4.271049
low 4.088268
mid 4.119602
very high 4.353456
very low 4.421136
下載量越高柠新,APP的評分會更加趨于穩(wěn)定。在評分上辉巡,除了極少數(shù)下載量的APP恨憎,下載數(shù)越高,評分越高红氯,這是否可以間接說明框咙,下載量高的自然口碑也會好咕痛。
Number of comments count mean
1 low 1486 4.035599
4 very low 1906 4.091815
2 mid 4449 4.234390
3 very high 30 4.403333
0 high 319 4.428527
評論數(shù)集中100~1000001較多痢甘,并且數(shù)量與評分也存在一定聯(lián)系,評論數(shù)量越多的往往評分也相對較高茉贡。這些結(jié)果表明了塞栅,只要做出來APP能夠引起更多人的下載或者評論,往往評分口碑也越好腔丧,我們通撤乓看許多熱門APP看似罵的人挺多,其實是因為基數(shù)大而已愉粤。
讓我們來看看那些下載量高砾医,評論數(shù)高的APP
df_heat = df.loc[(df['Installs']>10000001)&(df['Reviews']>10000000),:]
g = df_heat.groupby('Category').size().sort_values(ascending=False)
plt.figure(figsize=(12,7))
plt.bar(g.index,g.values,color = 'c',edgecolor='black',width=0.5)
plt.xticks(rotation=90)
plt.ylabel('Counts')
plt.ylim(0,13)
#plt.grid(axis = 'y')
for a,b in zip(g.index,g.values):
plt.text(a,b+0.3,b,ha='center')
plt.show()
#df_heat.to_csv()
這就是google app store大致的情況,第一次弄可能還缺少一些好的點子衣厘,分析脈絡(luò)也不清晰如蚜,下次整理得更規(guī)整再放上,先這樣開一個頭影暴。