背景
polars學(xué)習(xí)系列文章倾芝,第8篇 分類數(shù)據(jù)處理(Categorical data)
該系列文章會分享到github,大家可以去下載jupyter文件,進(jìn)行參考學(xué)習(xí)
倉庫地址:https://github.com/DataShare-duo/polars_learn
小編運(yùn)行環(huán)境
import sys
print('python 版本:',sys.version.split('|')[0])
#python 版本: 3.11.9
import polars as pl
print("polars 版本:",pl.__version__)
#polars 版本: 0.20.22
分類數(shù)據(jù) Categorical data
分類數(shù)據(jù)就是平時在數(shù)據(jù)庫中能進(jìn)行編碼的數(shù)據(jù),比如:性別、年齡偏灿、國家、城市钝的、職業(yè) 等等翁垂,可以對這些數(shù)據(jù)進(jìn)行編碼,可以節(jié)省存儲空間
Polars 支持兩種不同的數(shù)據(jù)類型來處理分類數(shù)據(jù):Enum
和 Categorical
- 當(dāng)類別預(yù)先已知時使用
Enum
扁藕,需要提前提供所有類別 - 當(dāng)不知道類別或類別不固定時沮峡,可以使用
Categorical
enum_dtype = pl.Enum(["Polar", "Panda", "Brown"])
enum_series = pl.Series(
["Polar", "Panda", "Brown", "Brown", "Polar"],
dtype=enum_dtype)
cat_series = pl.Series(
["Polar", "Panda", "Brown", "Brown", "Polar"],
dtype=pl.Categorical
)
Categorical 類型
Categorical
相對比較靈活,不用提前獲取所有的類別亿柑,當(dāng)有新類別時邢疙,會自動進(jìn)行編碼
當(dāng)對來自2個不同的 Categorical 類別列直接進(jìn)行拼接時,以下這種方式會比較慢望薄,polars 是根據(jù)字符串出現(xiàn)的先后順序進(jìn)行編碼疟游,不同的字符串在不同的序列里面編碼可能不一樣,直接合并的話全局會再進(jìn)行一次編碼痕支,速度會比較慢:
cat_series = pl.Series(
["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)
cat2_series = pl.Series(
["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
)
#CategoricalRemappingWarning: Local categoricals have different encodings,
#expensive re-encoding is done to perform this merge operation.
#Consider using a StringCache or an Enum type if the categories are known in advance
print(cat_series.append(cat2_series))
可以通過使用 polars 提供的全局字符緩存 StringCache
颁虐,來提升數(shù)據(jù)處理效率
with pl.StringCache():
cat_series = pl.Series(
["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)
cat2_series = pl.Series(
["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
)
print(cat_series.append(cat2_series))
Enum
上面來自2個不同類型列進(jìn)行拼接的耗時的情況,在Enum
中不會存在卧须,因?yàn)橐呀?jīng)提前獲取到了全部的類別
dtype = pl.Enum(["Polar", "Panda", "Brown"])
cat_series = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=dtype)
cat2_series = pl.Series(["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=dtype)
print(cat_series.append(cat2_series))
#shape: (10,)
#Series: '' [enum]
[
"Polar"
"Panda"
"Brown"
"Brown"
"Polar"
"Panda"
"Brown"
"Brown"
"Polar"
"Polar"
]
如果有編碼的字符串類別另绩,當(dāng)不在提前獲取的Enum
中時,則會報錯:OutOfBounds
dtype = pl.Enum(["Polar", "Panda", "Brown"])
try:
cat_series = pl.Series(["Polar", "Panda", "Brown", "Black"], dtype=dtype)
except Exception as e:
print(e)
#conversion from `str` to `enum` failed
#in column '' for 1 out of 4 values: ["Black"]
#Ensure that all values in the input column are present
#in the categories of the enum datatype.
比較
- Categorical vs Categorical
- Categorical vs String
- Enum vs Enum
- Enum vs String(該字符串必須要在提前獲取的Enum中)
Categorical vs Categorical
with pl.StringCache():
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
cat_series2 = pl.Series(["Polar", "Panda", "Black"], dtype=pl.Categorical)
print(cat_series == cat_series2)
#shape: (3,)
#Series: '' [bool]
[
false
true
false
]
Categorical vs String
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
print(cat_series <= "Cat")
#shape: (3,)
#Series: '' [bool]
[
true
false
false
]
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
cat_series_utf = pl.Series(["Panda", "Panda", "A Polar"])
print(cat_series <= cat_series_utf)
#shape: (3,)
#Series: '' [bool]
[
true
true
false
]
Enum vs Enum
dtype = pl.Enum(["Polar", "Panda", "Brown"])
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=dtype)
cat_series2 = pl.Series(["Polar", "Panda", "Brown"], dtype=dtype)
print(cat_series == cat_series2)
#shape: (3,)
#Series: '' [bool]
[
false
true
false
]
Enum vs String(該字符串必須要在提前獲取的Enum中)
try:
cat_series = pl.Series(
["Low", "Medium", "High"], dtype=pl.Enum(["Low", "Medium", "High"])
)
cat_series <= "Excellent"
except Exception as e:
print(e)
#conversion from `str` to `enum` failed
#in column '' for 1 out of 1 values: ["Excellent"]
#Ensure that all values in the input column are present
#in the categories of the enum datatype.
dtype = pl.Enum(["Low", "Medium", "High"])
cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)
print(cat_series <= "Medium")
#shape: (3,)
#Series: '' [bool]
[
true
true
false
]
dtype = pl.Enum(["Low", "Medium", "High"])
cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)
cat_series2 = pl.Series(["High", "High", "Low"])
print(cat_series <= cat_series2)
#shape: (3,)
#Series: '' [bool]
[
true
true
false
]
歷史相關(guān)文章
- Python polars學(xué)習(xí)-01 讀取與寫入文件
- Python polars學(xué)習(xí)-02 上下文與表達(dá)式
- polars學(xué)習(xí)-03 數(shù)據(jù)類型轉(zhuǎn)換
- Python polars學(xué)習(xí)-04 字符串?dāng)?shù)據(jù)處理
- Python polars學(xué)習(xí)-05 包含的數(shù)據(jù)結(jié)構(gòu)
- Python polars學(xué)習(xí)-06 Lazy / Eager API
- Python polars學(xué)習(xí)-07 缺失值
以上是自己實(shí)踐中遇到的一些問題花嘶,分享出來供大家參考學(xué)習(xí)笋籽,歡迎關(guān)注微信公眾號:DataShare ,不定期分享干貨