5. 統(tǒng)計(jì)描述
在 Series
和 DataFrame
上存在大量的統(tǒng)計(jì)描述函數(shù)以及其他相關(guān)的運(yùn)算函數(shù)嚎货。
比如像 sum()
, mean()
和 quantile()
等聚合函數(shù)幽告,會(huì)對(duì)數(shù)據(jù)進(jìn)行降維翎碑。但是像 cumsum()
和 cumprod()
等函數(shù)則保持與原對(duì)象同樣的大小。
一般來說丧枪,這些函數(shù)都會(huì)提供一個(gè) axis
參數(shù)光涂,能夠接受整數(shù)(0
, 1
)或軸名(index or columns
)來指定需要應(yīng)用的維度
-
Series
:不需要指定axis
參數(shù) -
DataFrame
:index(axis=0)
默認(rèn),columns(axis=1)
例如:
In [78]: df
Out[78]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
In [79]: df.mean(0)
Out[79]:
one 0.811094
two 1.360588
three 0.187958
dtype: float64
In [80]: df.mean(1)
Out[80]:
a 1.583749
b 0.734929
c 1.133683
d -0.166914
dtype: float64
所有這些函數(shù)都有一個(gè) skipna
參數(shù),用于指定是否排除缺失值(默認(rèn)為 True
)
In [81]: df.sum(0, skipna=False)
Out[81]:
one NaN
two 5.442353
three NaN
dtype: float64
In [82]: df.sum(axis=1, skipna=True)
Out[82]:
a 3.167498
b 2.204786
c 3.401050
d -0.333828
dtype: float64
通過結(jié)合廣播行為/算術(shù)運(yùn)算拧烦,可以非常簡潔地描述各種統(tǒng)計(jì)過程忘闻。如標(biāo)準(zhǔn)化
In [83]: ts_stand = (df - df.mean()) / df.std()
In [84]: ts_stand.std()
Out[84]:
one 1.0
two 1.0
three 1.0
dtype: float64
In [85]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
In [86]: xs_stand.std(1)
Out[86]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
注意:cumsum()
和 cumprod()
函數(shù)會(huì)保留值為 NaN
的位置
In [87]: df.cumsum()
Out[87]:
one two three
a 1.394981 1.772517 NaN
b 1.738035 3.684640 -0.050390
c 2.433281 5.163008 1.177045
d NaN 5.442353 0.563873
如果數(shù)據(jù)包含多級(jí)索引,可以使用 level
參數(shù)指定應(yīng)用的級(jí)別
>>> df = pd.DataFrame(np.random.randint(80, 120, size=(2, 4)), index= ['girl', 'boy'],
...: columns=[['English', 'English', 'Chinese', 'Chinese'],
...: ['like', 'dislike', 'like', 'dislike']])
>>> df
English Chinese
like dislike like dislike
girl 108 104 115 102
boy 94 109 105 92
>>> df.mean(level=0, axis=1)
English Chinese
girl 106.0 108.5
boy 101.5 98.5
>>> df.columns.names = ['language', 'like']
>>> df.mean(level='language', axis=1)
language English Chinese
girl 106.0 108.5
boy 101.5 98.5
下面是常用匯總函數(shù)
注意:有些 NumPy
函數(shù)恋博,像 mean
, std
和 sum
默認(rèn)會(huì)忽略 Series
的缺失值
In [88]: np.mean(df["one"])
Out[88]: 0.8110935116651192
In [89]: np.mean(df["one"].to_numpy())
Out[89]: nan
Series.nunique()
會(huì)返回 Series
中非 NaN
的唯一值的數(shù)量
In [90]: series = pd.Series(np.random.randn(500))
In [91]: series[20:500] = np.nan
In [92]: series[10:20] = 5
In [93]: series.nunique()
Out[93]: 11
5.1 數(shù)據(jù)匯總 —— describe
descripe()
函數(shù)齐佳,可以用于計(jì)算 Series
或 DataFrame
每列的各種匯總統(tǒng)計(jì)信息(當(dāng)然,并不會(huì)統(tǒng)計(jì) NaN
值)
In [94]: series = pd.Series(np.random.randn(1000))
In [95]: series[::2] = np.nan
In [96]: series.describe()
Out[96]:
count 500.000000
mean -0.021292
std 1.015906
min -2.683763
25% -0.699070
50% -0.069718
75% 0.714483
max 3.160915
dtype: float64
In [97]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
In [98]: frame.iloc[::2] = np.nan
In [99]: frame.describe()
Out[99]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 0.033387 0.030045 -0.043719 -0.051686 0.005979
std 1.017152 0.978743 1.025270 1.015988 1.006695
min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821
25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115
50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363
75% 0.729907 0.775880 0.618896 0.670047 0.649748
max 2.740139 2.752332 3.004229 2.728702 3.240991
您可以選擇要包含在輸出中的特定百分比
In [100]: series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
Out[100]:
count 500.000000
mean -0.021292
std 1.015906
min -2.683763
5% -1.645423
25% -0.699070
50% -0.069718
75% 0.714483
95% 1.711409
max 3.160915
dtype: float64
默認(rèn)情況下總會(huì)包含 median
中位值
對(duì)于一個(gè)非數(shù)值的 Series
對(duì)象债沮,description()
將給出一個(gè)簡單的總結(jié)炼吴,包括唯一值和最常出現(xiàn)的值的數(shù)量
In [101]: s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])
In [102]: s.describe()
Out[102]:
count 9
unique 4
top a
freq 5
dtype: object
注意:在包含混合類型的 DataFrame
對(duì)象上,describe()
只會(huì)把匯總限制在僅包括數(shù)字的列疫衩,如果沒有數(shù)字列硅蹦,則僅統(tǒng)計(jì)分類列
In [103]: frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})
In [104]: frame.describe()
Out[104]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
可以使用 include/exclude
參數(shù)來控制需要包含/排除的數(shù)據(jù)類型列表,而 all
參數(shù)將包含所有的列
In [105]: frame.describe(include=["object"])
Out[105]:
a
count 4
unique 2
top No
freq 2
In [106]: frame.describe(include=["number"])
Out[106]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [107]: frame.describe(include="all")
Out[107]:
a b
count 4 4.000000
unique 2 NaN
top No NaN
freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000
5.2 最大最小值的索引
Series
和 DataFrame
上的 idxmin()
和 idxmax()
函數(shù)可以計(jì)算最小值和最大值對(duì)應(yīng)的索引
In [108]: s1 = pd.Series(np.random.randn(5))
In [109]: s1
Out[109]:
0 1.118076
1 -0.352051
2 -1.242883
3 -1.277155
4 -0.641184
dtype: float64
In [110]: s1.idxmin(), s1.idxmax()
Out[110]: (3, 0)
In [111]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
In [112]: df1
Out[112]:
A B C
0 -0.327863 -0.946180 -0.137570
1 -0.186235 -0.257213 -0.486567
2 -0.507027 -0.871259 -0.111110
3 2.000339 -2.430505 0.089759
4 -0.321434 -0.033695 0.096271
In [113]: df1.idxmin(axis=0)
Out[113]:
A 2
B 3
C 1
dtype: int64
In [114]: df1.idxmax(axis=1)
Out[114]:
0 C
1 A
2 C
3 A
4 C
dtype: object
當(dāng)有多個(gè)最小值或最大值時(shí)闷煤,idxmin()
和 idxmax()
返回第一個(gè)匹配的索引
In [115]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))
In [116]: df3
Out[116]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [117]: df3["A"].idxmin()
Out[117]: 'd'
5.3 值計(jì)算和眾數(shù)
value_counts()
函數(shù)能夠統(tǒng)計(jì) Series
或數(shù)組中數(shù)據(jù)值的數(shù)量
In [118]: data = np.random.randint(0, 7, size=50)
In [119]: data
Out[119]:
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
6, 2, 6, 1, 5, 4])
In [120]: s = pd.Series(data)
In [121]: s.value_counts()
Out[121]:
2 10
6 10
4 9
3 8
5 8
0 3
1 2
dtype: int64
In [122]: pd.value_counts(data)
Out[122]:
2 10
6 10
4 9
3 8
5 8
0 3
1 2
dtype: int64
value_counts()
方法可用于統(tǒng)計(jì)多個(gè)列之間的組合的數(shù)目童芹。默認(rèn)情況下會(huì)使用所有列,但可以使用 subset
參數(shù)選擇一個(gè)子集
In [123]: data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}
In [124]: frame = pd.DataFrame(data)
In [125]: frame.value_counts()
Out[125]:
a b
1 x 1
2 x 1
3 y 1
4 y 1
dtype: int64
同樣鲤拿,您可以獲取 Series
或 DataFrame
的眾數(shù)
In [126]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
In [127]: s5.mode()
Out[127]:
0 3
1 7
dtype: int64
In [128]: df5 = pd.DataFrame(
.....: {
.....: "A": np.random.randint(0, 7, size=50),
.....: "B": np.random.randint(-10, 15, size=50),
.....: }
.....: )
.....:
In [129]: df5.mode()
Out[129]:
A B
0 1.0 -9
1 NaN 10
2 NaN 13
5.4 離散化和分位數(shù)
可以使用 cut()
(基于值)和 qcut()
(基于樣本分位數(shù))函數(shù)離散化連續(xù)值
In [130]: arr = np.random.randn(20)
In [131]: factor = pd.cut(arr, 4)
In [132]: factor
Out[132]:
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]]
Length: 20
Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <
(1.179, 1.893]]
In [133]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
In [134]: factor
Out[134]:
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut()
函數(shù)計(jì)算樣本分位數(shù)假褪。例如,我們可以將一些正態(tài)分布的數(shù)據(jù)切成大小相等的四分位數(shù)近顷,如下所示
In [135]: arr = np.random.randn(30)
In [136]: factor = pd.qcut(arr, [0, 0.25, 0.5, 0.75, 1])
In [137]: factor
Out[137]:
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]]
Length: 30
Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <
(1.184, 2.346]]
In [138]: pd.value_counts(factor)
Out[138]:
(-2.278, -0.301] 8
(1.184, 2.346] 8
(-0.301, 0.569] 7
(0.569, 1.184] 7
dtype: int64
我們也可以通過無限值來定義分倉
In [139]: arr = np.random.randn(20)
In [140]: factor = pd.cut(arr, [-np.inf, 0, np.inf])
In [141]: factor
Out[141]:
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]