標(biāo)簽(空格分隔): 數(shù)據(jù)分析 python 數(shù)據(jù)挖掘
MovieLens 1M數(shù)據(jù)集
一組從20世紀(jì)90年末到21世紀(jì)初由MovieLens用戶提供的電影評(píng)分?jǐn)?shù)據(jù)佩厚。這些數(shù)據(jù)中包括電影評(píng)分、電影元數(shù)據(jù)(風(fēng)格類型和年代)以及關(guān)于用戶的人口統(tǒng)計(jì)學(xué)數(shù)據(jù)(年齡锅纺、郵編、性別和職業(yè)等)羡铲。
MovieLens 1M數(shù)據(jù)集含有來(lái)自6000名用戶對(duì)4000部電影的100萬(wàn)條評(píng)分?jǐn)?shù)據(jù)弧岳。分為三個(gè)表:評(píng)分、用戶信息和電影信息今艺。
以下代碼,通過(guò)pandas.read_table將各個(gè)表分別讀到一個(gè)pandas DataFrame對(duì)象中:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('G:\\lcw\\movielens\\users.dat', sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('G:\\lcw\\movielens\\ratings.dat', sep='::', header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('G:\\lcw\\movielens\\movies.dat', sep='::', header=None, names=mnames)
# 如果是讀取CSV的數(shù)據(jù)格式(說(shuō)明):
import pandas as pd
# Reading data locally
df = pd.read_csv('/Users/al-ahmadgaidasaad/Documents/d.csv')
# Reading data from web
data_url = "https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv"
df = pd.read_csv(data_url)
問(wèn)題:對(duì)分布在三個(gè)表的數(shù)據(jù)進(jìn)行分析同時(shí)進(jìn)行分析很難爵卒,那必須將所有的數(shù)據(jù)都合并到一個(gè)表中進(jìn)行分析,下面撵彻,用pandas的merge函數(shù)將ratings跟users合并到一起钓株,然后再將movies也合并進(jìn)去实牡。pandas會(huì)根據(jù)列名的重疊情況推斷出哪些列是合并(或連接)鍵:
>>> data = pd.merge(pd.merge(ratings,users),movies)
>>> data
user_id movie_id rating timestamp gender age occupation zip \
0 1 1193 5 978300760 F 1 10 48067
1 2 1193 5 978298413 M 56 16 70072
2 12 1193 4 978220179 M 25 12 32793
3 15 1193 4 978199279 M 25 7 22903
4 17 1193 5 978158471 M 50 1 95350
5 18 1193 4 978156168 F 18 3 95825
6 19 1193 5 982730936 M 1 10 48073
7 24 1193 5 978136709 F 25 7 10023
8 28 1193 3 978125194 F 25 1 14607
9 33 1193 5 978557765 M 45 3 55421
10 39 1193 5 978043535 M 18 4 61820
11 42 1193 3 978038981 M 25 8 24502
下面對(duì)pandas進(jìn)行聚類操作:
1、按性別計(jì)算每部電影的平均得分轴合,用Pivot_table方法:
DataFrame 對(duì)象有一個(gè) .pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True) 方法可以用來(lái)制作透視表创坞,同時(shí) pd.pivot_table() 也是一個(gè)頂層函數(shù)。
- data 參數(shù)相當(dāng)于 self受葛,這里將其命名為 data 也許是為了與頂級(jí)函數(shù)版本的 pivot_table 保持一致题涨。
- values 參數(shù)可以是一個(gè)以列名為元素的列表,用于指定想要聚合的數(shù)據(jù)总滩,不給出的話默認(rèn)使用全部數(shù)據(jù)纲堵。
- rows 參數(shù)用于指定行分組鍵
- cols 參數(shù)用于指定列分組鍵
- aggfunc 參數(shù)用于指定聚合函數(shù),默認(rèn)為均值(mean)闰渔,也可以是np.sum
- margins 參數(shù)是小計(jì)(Total)功能的開(kāi)關(guān)席函,設(shè)為 True 后結(jié)果集中會(huì)出現(xiàn)名為 “ALL” 的行和列
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
>>> table = pivot_table(df, values='D', rows=['A', 'B'], cols=['C'], aggfunc=np.sum)
>>> table
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
具體實(shí)例:
>>> mean_ratings = data.pivot_table('rating', rows='title',cols='gender',aggfunc='mean')
>>> mean_ratings[:5]
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
- ps 該操作產(chǎn)生了另一個(gè)DataFrame,其內(nèi)容為電影平均得分冈涧,行標(biāo)為電影名稱茂附,列表為性別。
2督弓、過(guò)濾掉評(píng)分?jǐn)?shù)據(jù)不夠250條的電影营曼,為了達(dá)到這個(gè)目的,先對(duì)title進(jìn)行分組愚隧,然后利用size() 得到一個(gè)含有各電影分組大小的Series對(duì)象:
>>> ratings_by_titlr = data.groupby('title').size()
>>> ratings_by_titlr[:10]
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64
>>> active_titles = ratings_by_titlr.index[ratings_by_titlr >250]
>>> active_titles
- ps 這個(gè)地方得到的其實(shí)是滿足條件的title的Index索引值蒂阱,然后在下面用索引字段ix中直接獲取
然后從1中切出評(píng)論大于250條的電影了:
>>> mean_ratings = mean_ratings.ix[active_titles]
>>> active_titles
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', ...], dtype='object')
- ps :這個(gè)地方用ix[active_titles] 直接獲取對(duì)應(yīng)的數(shù)據(jù)
- 我們做了個(gè)試驗(yàn):
>>> data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year': [2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
>>> frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])
>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt'], dtype='object')
>>> frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
>>> obj = frame2.index
>>> frame2.ix[obj]
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
這里涉及到切片,DataFrame對(duì)象的標(biāo)準(zhǔn)切片語(yǔ)法為:.ix[::,::].ix對(duì)象可以接受兩套切片奸攻,分別為行(axis =0)和列(axis =1)的方向:
>>> df
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
[3 rows x 3 columns]
>>> df.ix[:2,:2]
Ohio Texas
a 0 1
c 3 4
[2 rows x 2 columns]
>>> df.ix['a','Ohio']
0
而不使用 ix蒜危,直接切的情況就特殊了:索引時(shí),選取的是列睹耐,切片時(shí)辐赞,選取的是行
>>> df['Ohio']
a 0
c 3
d 6
Name: Ohio, dtype: int32
>>> df[:'c']
Ohio Texas California
a 0 1 2
c 3 4 5
[2 rows x 3 columns]
>>> df[:2]
Ohio Texas California
a 0 1 2
c 3 4 5
[2 rows x 3 columns]
使用布爾型數(shù)組的情況,注意行與列的不同切法(列切法的“:”不能氏跹怠)
>>> df['Texas']>=4
a False
c True
d True
Name: Texas, dtype: bool
>>> df[df['Texas']>=4]
Ohio Texas California
c 3 4 5
d 6 7 8
[2 rows x 3 columns]
>>> df.ix[:,df.ix['c']>=4]
Texas California
a 1 2
c 4 5
d 7 8
[3 rows x 2 columns]
- 我只是不知道為什么不能顯示摘要模式
接下來(lái)就是對(duì)某一行的數(shù)據(jù)進(jìn)行排列响委,主要用到的是sort_index
>>> top_female_ratings = mean_ratings.sort_index(by = 'F', ascending = False)
>>> top_female_ratings
- Series 的sort_index(ascending=True) 方法可以對(duì)index進(jìn)行排序操作,ascdending 參數(shù)用于控制升序或者降序窖梁,默認(rèn)是升序 DataFrame 上赘风,.sort_index(axis=0, by=None, ascending=True) 方法多了一個(gè)軸向的選擇參數(shù)與一個(gè) by 參數(shù),by 參數(shù)的作用是針對(duì)某一(些)列進(jìn)行排序(不能對(duì)行使用 by 參數(shù)):纵刘。
3邀窃、計(jì)算評(píng)分分歧,找出男性和女性觀眾分歧最大的電影假哎。在mean_ratings 加上一個(gè)用于存放平均得分之差的列瞬捕,并對(duì)其進(jìn)行排序:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
>>> sort_by_diff = mean_ratings.sort_index(by='diff')
>>> sort_by_diff[:15]
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573
French Kiss (1995) 3.535714 3.056962 -0.478752
Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
Guys and Dolls (1955) 4.051724 3.583333 -0.468391
Mary Poppins (1964) 4.197740 3.730594 -0.467147
Patch Adams (1998) 3.473282 3.008746 -0.464536
對(duì)排序結(jié)果反序并取出前15行鞍历,得到的則是觀眾更喜歡的電影:[:: -1] 取反
>>> sort_by_diff[:: -1][:15]
gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704
Porky's (1981) 2.296875 2.836364 0.539489
Animal House (1978) 3.628906 4.167192 0.538286
Exorcist, The (1973) 3.537634 4.067239 0.529605
Fright Night (1985) 2.973684 3.500000 0.526316
Barb Wire (1996) 1.585366 2.100386 0.515020
但是,只是想要找出分歧最大的電影(不考慮性別因素)肪虎,則可以計(jì)算得分?jǐn)?shù)據(jù)的方差或者標(biāo)準(zhǔn)差
>>> rating_std_by_title = data.groupby('title')['rating'].std()
>>> rating_std_by_title
title
$1,000,000 Duck (1971) 1.092563
'Night Mother (1986) 1.118636
'Til There Was You (1997) 1.020159
'burbs, The (1989) 1.107760
...And Justice for All (1979) 0.878110
1-900 (1994) 0.707107
10 Things I Hate About You (1999) 0.989815
101 Dalmatians (1961) 0.982103
101 Dalmatians (1996) 1.098717
12 Angry Men (1957) 0.812731
13th Warrior, The (1999) 1.140421
187 (1997) 1.057919
2 Days in the Valley (1996) 0.921592
20 Dates (1998) 1.151943
20,000 Leagues Under the Sea (1954) 0.869685
...
Name: rating, Length: 3706, dtype: float64
- 注意劣砍,這個(gè)地方用了groupby 默認(rèn)的情況下就是統(tǒng)計(jì)這個(gè)屬性出現(xiàn)的次數(shù),但是扇救,就像我們用mysql一樣刑枝,用了groupby,肯定也會(huì)有很多相關(guān)的統(tǒng)計(jì)的方法:
一些numpy常用的統(tǒng)計(jì)方法: - count 非 NA 值的數(shù)量
- describe 針對(duì) Series 或 DF 的列計(jì)算匯總統(tǒng)計(jì)
- min , max 最小值和最大值
- argmin , argmax 最小值和最大值的索引位置(整數(shù))
- idxmin , idxmax 最小值和最大值的索引值
- quantile 樣本分位數(shù)(0 到 1)
- sum 求和
- mean 均值
- median 中位數(shù)
- mad 根據(jù)均值計(jì)算平均絕對(duì)離差
- var 方差
- std 標(biāo)準(zhǔn)差
- skew 樣本值的偏度(三階矩)
- kurt 樣本值的峰度(四階矩)
- cumsum 樣本值的累計(jì)和
- cummin , cummax 樣本值的累計(jì)最大值和累計(jì)最小值
- cumprod 樣本值的累計(jì)積
- diff 計(jì)算一階差分(對(duì)時(shí)間序列很有用)
- pct_change 計(jì)算百分?jǐn)?shù)變化
接下來(lái)迅腔,過(guò)濾掉評(píng)論不足250條的記錄:
>>> rating_std_by_title = rating_std_by_title.ix[active_titles]
>>> rating_std_by_title .order(ascending = False)[: 10]
title
Dumb & Dumber (1994) 1.321333
Blair Witch Project, The (1999) 1.316368
Natural Born Killers (1994) 1.307198
Tank Girl (1995) 1.277695
Rocky Horror Picture Show, The (1975) 1.260177
Eyes Wide Shut (1999) 1.259624
Evita (1996) 1.253631
Billy Madison (1995) 1.249970
Fear and Loathing in Las Vegas (1998) 1.246408
Bicentennial Man (1999) 1.245533
Name: rating, dtype: float64
- 若要按值對(duì) Series 進(jìn)行排序装畅,當(dāng)使用 .order(na_last=True, ascending=True, kind='mergesort') 方法,任何缺失值默認(rèn)都會(huì)被放到 Series 的末尾钾挟。
擴(kuò)展: 排名(Series.rank(method='average', ascending=True))的作用與排序的不同之處在于洁灵,他會(huì)把對(duì)象的 values 替換成名次(從 1 到 n)。這時(shí)唯一的問(wèn)題在于如何處理平級(jí)項(xiàng)掺出,方法里的 method 參數(shù)就是起這個(gè)作用的徽千,他有四個(gè)值可選:average, min, max, first。
>>> ser=Series([3,2,0,3],index=list('abcd'))
>>> ser
a 3
b 2
c 0
d 3
dtype: int64
>>> ser.rank()
a 3.5
b 2.0
c 1.0
d 3.5
dtype: float64
>>> ser.rank(method='min')
a 3
b 2
c 1
d 3
dtype: float64
>>> ser.rank(method='max')
a 4
b 2
c 1
d 4
dtype: float64
>>> ser.rank(method='first')
a 3
b 2
c 1
d 4
dtype: float64