13 dtypes
在大多數(shù)情況下缘圈,pandas
將 NumPy
數(shù)組和 dtype
作用于 Series
和 DataFrame
的每一列劣光。
NumPy
支持 float
, int
, bool
, timedelta64[ns]
和 datetime64[ns]
數(shù)據(jù)類型
注意:NumPy
不支持帶有時(shí)區(qū)信息的 datetimes
而本節(jié)我們將介紹 pandas
的擴(kuò)展類型,下面列出了所有的 pandas
擴(kuò)展類型
pandas
有兩種存儲(chǔ)字符串?dāng)?shù)據(jù)的方法:
-
object
類型糟把,可以容納任何Python
對(duì)象绢涡,包括字符串 -
StringDtype
類型專門用于存儲(chǔ)字符串。
通常建議使用 StringDtype
遣疯,雖然任意對(duì)象都可以存為 object
雄可,但是會(huì)導(dǎo)致性能及兼容問題,應(yīng)盡可能避免缠犀。
DataFrame
有一個(gè)方便的 dtypes
屬性用于返回一個(gè)包含每個(gè)列的數(shù)據(jù)類型的序列
In [347]: dft = pd.DataFrame(
.....: {
.....: "A": np.random.rand(3),
.....: "B": 1,
.....: "C": "foo",
.....: "D": pd.Timestamp("20010102"),
.....: "E": pd.Series([1.0] * 3).astype("float32"),
.....: "F": False,
.....: "G": pd.Series([1] * 3, dtype="int8"),
.....: }
.....: )
.....:
In [348]: dft
Out[348]:
A B C D E F G
0 0.035962 1 foo 2001-01-02 1.0 False 1
1 0.701379 1 foo 2001-01-02 1.0 False 1
2 0.281885 1 foo 2001-01-02 1.0 False 1
In [349]: dft.dtypes
Out[349]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
在 Series
對(duì)象上数苫,使用 dtype
屬性。
In [350]: dft["A"].dtype
Out[350]: dtype('float64')
如果 pandas
數(shù)據(jù)對(duì)象在一列中包含多種數(shù)據(jù)類型辨液,將會(huì)自動(dòng)選擇一種能夠容納所有數(shù)據(jù)類型的類型(即向上轉(zhuǎn)換)虐急。最常用的就是 object
# these ints are coerced to floats
In [351]: pd.Series([1, 2, 3, 4, 5, 6.0])
Out[351]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
dtype: float64
# string data forces an ``object`` dtype
In [352]: pd.Series([1, 2, 3, 6.0, "foo"])
Out[352]:
0 1
1 2
2 3
3 6.0
4 foo
dtype: object
可以通過調(diào)用 DataFrame.dtypes.value_counts()
來統(tǒng)計(jì) DataFrame
中每種類型的列數(shù)
In [353]: dft.dtypes.value_counts()
Out[353]:
float32 1
datetime64[ns] 1
float64 1
bool 1
int8 1
object 1
int64 1
dtype: int64
不同的數(shù)據(jù)類型可以在 DataFrame
中共存。不論是通過 dtype
參數(shù)設(shè)置室梅,還是傳遞 ndarray
或 Series
戏仓,都會(huì)在 DataFrame
操作中保留其類型。
此外亡鼠,不同的數(shù)值類型不會(huì)合并
In [354]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=["A"], dtype="float32")
In [355]: df1
Out[355]:
A
0 0.224364
1 1.890546
2 0.182879
3 0.787847
4 -0.188449
5 0.667715
6 -0.011736
7 -0.399073
In [356]: df1.dtypes
Out[356]:
A float32
dtype: object
In [357]: df2 = pd.DataFrame(
.....: {
.....: "A": pd.Series(np.random.randn(8), dtype="float16"),
.....: "B": pd.Series(np.random.randn(8)),
.....: "C": pd.Series(np.array(np.random.randn(8), dtype="uint8")),
.....: }
.....: )
.....:
In [358]: df2
Out[358]:
A B C
0 0.823242 0.256090 0
1 1.607422 1.426469 0
2 -0.333740 -0.416203 255
3 -0.063477 1.139976 0
4 -1.014648 -1.193477 0
5 0.678711 0.096706 0
6 -0.040863 -1.956850 1
7 -0.357422 -0.714337 0
In [359]: df2.dtypes
Out[359]:
A float16
B float64
C uint8
dtype: object
13.1 默認(rèn)值
默認(rèn)情況下赏殃,整數(shù)類型為 int64
, float
類型為 float64
。
無論平臺(tái)是 32
位還是 64
位间涵,下面的數(shù)據(jù)都是 int64
類型仁热。
In [360]: pd.DataFrame([1, 2], columns=["a"]).dtypes
Out[360]:
a int64
dtype: object
In [361]: pd.DataFrame({"a": [1, 2]}).dtypes
Out[361]:
a int64
dtype: object
In [362]: pd.DataFrame({"a": 1}, index=list(range(2))).dtypes
Out[362]:
a int64
dtype: object
注意:NumPy
在創(chuàng)建數(shù)組是會(huì)根據(jù)系統(tǒng)選擇相應(yīng)的類型,下面的代碼在 32
位操作系統(tǒng)中會(huì)返回 int32
In [363]: frame = pd.DataFrame(np.array([1, 2]))
13.2 向上轉(zhuǎn)型
當(dāng)與其他類型混合時(shí)勾哩,類型會(huì)隱式向上轉(zhuǎn)換抗蠢,這意味著它們從當(dāng)前類型提升為另一種類型,例如 int
提升到 float
In [364]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [365]: df3
Out[365]:
A B C
0 1.047606 0.256090 0.0
1 3.497968 1.426469 0.0
2 -0.150862 -0.416203 255.0
3 0.724370 1.139976 0.0
4 -1.203098 -1.193477 0.0
5 1.346426 0.096706 0.0
6 -0.052599 -1.956850 1.0
7 -0.756495 -0.714337 0.0
In [366]: df3.dtypes
Out[366]:
A float32
B float64
C float64
dtype: object
DataFrame.to_numpy()
返回的數(shù)組的類型是出現(xiàn)次數(shù)最多的類型,因此這可能會(huì)發(fā)生一些強(qiáng)制的轉(zhuǎn)換
In [367]: df3.to_numpy().dtype
Out[367]: dtype('float64')
13.3 astype
可以使用 astype()
方法顯式地將 dtype
從一種類型轉(zhuǎn)換為另一種類型思劳。
默認(rèn)情況下迅矛,這些函數(shù)將返回一份拷貝(可以使用 copy=False
來更改這一行為),即使 dtype
并沒有改變
此外潜叛,如果 astype
操作無效秽褒,將引發(fā)異常
In [368]: df3
Out[368]:
A B C
0 1.047606 0.256090 0.0
1 3.497968 1.426469 0.0
2 -0.150862 -0.416203 255.0
3 0.724370 1.139976 0.0
4 -1.203098 -1.193477 0.0
5 1.346426 0.096706 0.0
6 -0.052599 -1.956850 1.0
7 -0.756495 -0.714337 0.0
In [369]: df3.dtypes
Out[369]:
A float32
B float64
C float64
dtype: object
# conversion of dtypes
In [370]: df3.astype("float32").dtypes
Out[370]:
A float32
B float32
C float32
dtype: object
使用 astype()
將某些列轉(zhuǎn)換為指定的類型壶硅。
In [371]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
In [372]: dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8)
In [373]: dft
Out[373]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [374]: dft.dtypes
Out[374]:
a uint8
b uint8
c int64
dtype: object
通過對(duì) astype()
傳遞字典的方式,將某些列轉(zhuǎn)換為特定的 dtype
In [375]: dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})
In [376]: dft1 = dft1.astype({"a": np.bool_, "c": np.float64})
In [377]: dft1
Out[377]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0
In [378]: dft1.dtypes
Out[378]:
a bool
b int64
c float64
dtype: object
注意
當(dāng)嘗試使用
astype()
和loc()
將某些列轉(zhuǎn)換為指定的類型時(shí)销斟,將會(huì)發(fā)生向上轉(zhuǎn)換因此庐椒,下列代碼會(huì)產(chǎn)出意料之外的結(jié)果:
In [379]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}) In [380]: dft.loc[:, ["a", "b"]].astype(np.uint8).dtypes Out[380]: a uint8 b uint8 dtype: object In [381]: dft.loc[:, ["a", "b"]] = dft.loc[:, ["a", "b"]].astype(np.uint8) In [382]: dft.dtypes Out[382]: a int64 b int64 c int64 dtype: object
13.4 對(duì)象轉(zhuǎn)換
pandas
提供了各種函數(shù)來嘗試強(qiáng)制將類型從對(duì)象類型轉(zhuǎn)換為其他類型。
如果數(shù)據(jù)已經(jīng)具有正確的類型蚂踊,但是存儲(chǔ)在對(duì)象數(shù)組中约谈,則可以使用 datafame.infer_objects()
和 Series.infer_objects()
方法將其轉(zhuǎn)換為正確的類型
In [383]: import datetime
In [384]: df = pd.DataFrame(
.....: [
.....: [1, 2],
.....: ["a", "b"],
.....: [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)],
.....: ]
.....: )
.....:
In [385]: df = df.T
In [386]: df
Out[386]:
0 1 2
0 1 a 2016-03-02
1 2 b 2016-03-02
In [387]: df.dtypes
Out[387]:
0 object
1 object
2 datetime64[ns]
dtype: object
由于數(shù)據(jù)被轉(zhuǎn)置,所以原始推斷將所有的列存儲(chǔ)為對(duì)象犁钟,但是可以使用 infer_objects
糾正
Out[388]:
0 int64
1 object
2 datetime64[ns]
dtype: object
以下函數(shù)可用于一維對(duì)象數(shù)組或標(biāo)量棱诱,執(zhí)行指定類型的轉(zhuǎn)換:
-
to_numeric()
(轉(zhuǎn)換為數(shù)字類型)
In [389]: m = ["1.1", 2, 3]
In [390]: pd.to_numeric(m)
Out[390]: array([1.1, 2. , 3. ])
-
to_datetime()
(轉(zhuǎn)換為datetime
對(duì)象)
In [391]: import datetime
In [392]: m = ["2016-07-09", datetime.datetime(2016, 3, 2)]
In [393]: pd.to_datetime(m)
Out[393]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)
-
to_timedelta()
(轉(zhuǎn)換為timedelta
對(duì)象)
In [394]: m = ["5us", pd.Timedelta("1day")]
In [395]: pd.to_timedelta(m)
Out[395]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
如果要執(zhí)行強(qiáng)制轉(zhuǎn)換,可以傳入一個(gè) errors
參數(shù)涝动,來指定 pandas
應(yīng)如何處理不能轉(zhuǎn)換為指定 dtype
或?qū)ο蟮脑?/p>
默認(rèn)情況下军俊,errors='raise'
,這意味著在轉(zhuǎn)換過程中遇到的任何錯(cuò)誤都會(huì)引發(fā)異常
但是捧存,如果 errors='coerce'
,這些錯(cuò)誤將被忽略担败,pandas
將把有問題的元素轉(zhuǎn)換為 pd.NaT
或 np.nan
有時(shí)候你的數(shù)據(jù)大部分都是正確的類型昔穴,但是可能有很少一部分不一致的類型,你可能希望將其標(biāo)記為缺失值而不是引發(fā)異常
In [396]: import datetime
In [397]: m = ["apple", datetime.datetime(2016, 3, 2)]
In [398]: pd.to_datetime(m, errors="coerce")
Out[398]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)
In [399]: m = ["apple", 2, 3]
In [400]: pd.to_numeric(m, errors="coerce")
Out[400]: array([nan, 2., 3.])
In [401]: m = ["apple", pd.Timedelta("1day")]
In [402]: pd.to_timedelta(m, errors="coerce")
Out[402]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)
當(dāng) errors='ignore'
時(shí)提前,如果在轉(zhuǎn)換類型時(shí)遇到任何錯(cuò)誤吗货,它將簡單地返回轉(zhuǎn)換成功的數(shù)據(jù)
In [403]: import datetime
In [404]: m = ["apple", datetime.datetime(2016, 3, 2)]
In [405]: pd.to_datetime(m, errors="ignore")
Out[405]: Index(['apple', 2016-03-02 00:00:00], dtype='object')
In [406]: m = ["apple", 2, 3]
In [407]: pd.to_numeric(m, errors="ignore")
Out[407]: array(['apple', 2, 3], dtype=object)
In [408]: m = ["apple", pd.Timedelta("1day")]
In [409]: pd.to_timedelta(m, errors="ignore")
Out[409]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)
除了對(duì)象轉(zhuǎn)換外,to_numeric()
還提供了另一個(gè)參數(shù) downcast
狈网,設(shè)置該參數(shù)能夠?qū)?shù)值型數(shù)據(jù)向下轉(zhuǎn)換為較小的 dtype
宙搬,以節(jié)省內(nèi)存
In [410]: m = ["1", 2, 3]
In [411]: pd.to_numeric(m, downcast="integer") # smallest signed int dtype
Out[411]: array([1, 2, 3], dtype=int8)
In [412]: pd.to_numeric(m, downcast="signed") # same as 'integer'
Out[412]: array([1, 2, 3], dtype=int8)
In [413]: pd.to_numeric(m, downcast="unsigned") # smallest unsigned int dtype
Out[413]: array([1, 2, 3], dtype=uint8)
In [414]: pd.to_numeric(m, downcast="float") # smallest float dtype
Out[414]: array([1., 2., 3.], dtype=float32)
這些方法只適用于一維數(shù)組、列表或標(biāo)量拓哺,因此勇垛,它們不能直接用于多維對(duì)象,如 DataFrame
士鸥。但是我們可以使用 apply
函數(shù)將其應(yīng)用到每列上
In [415]: import datetime
In [416]: df = pd.DataFrame([["2016-07-09", datetime.datetime(2016, 3, 2)]] * 2, dtype="O")
In [417]: df
Out[417]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00
In [418]: df.apply(pd.to_datetime)
Out[418]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [419]: df = pd.DataFrame([["1.1", 2, 3]] * 2, dtype="O")
In [420]: df
Out[420]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [421]: df.apply(pd.to_numeric)
Out[421]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [422]: df = pd.DataFrame([["5us", pd.Timedelta("1day")]] * 2, dtype="O")
In [423]: df
Out[423]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00
In [424]: df.apply(pd.to_timedelta)
Out[424]:
0 1
0 0 days 00:00:00.000005 1 days
1 0 days 00:00:00.000005 1 days
13.5 陷阱
對(duì)整數(shù)類型數(shù)據(jù)執(zhí)行選擇操作時(shí)闲孤,會(huì)很容易地將數(shù)據(jù)向上轉(zhuǎn)換為 float
。而在沒有引入 nan
的情況下烤礁,輸入數(shù)據(jù)的 dtype
將被保留讼积。
In [425]: dfi = df3.astype("int32")
In [426]: dfi["E"] = 1
In [427]: dfi
Out[427]:
A B C E
0 1 0 0 1
1 3 1 0 1
2 0 0 255 1
3 0 1 0 1
4 -1 -1 0 1
5 1 0 0 1
6 0 -1 1 1
7 0 0 0 1
In [428]: dfi.dtypes
Out[428]:
A int32
B int32
C int32
E int64
dtype: object
In [429]: casted = dfi[dfi > 0]
In [430]: casted
Out[430]:
A B C E
0 1.0 NaN NaN 1
1 3.0 1.0 NaN 1
2 NaN NaN 255.0 1
3 NaN 1.0 NaN 1
4 NaN NaN NaN 1
5 1.0 NaN NaN 1
6 NaN NaN 1.0 1
7 NaN NaN NaN 1
In [431]: casted.dtypes
Out[431]:
A float64
B float64
C float64
E int64
dtype: object
而 float
類型不會(huì)改變
In [432]: dfa = df3.copy()
In [433]: dfa["A"] = dfa["A"].astype("float32")
In [434]: dfa.dtypes
Out[434]:
A float32
B float64
C float64
dtype: object
In [435]: casted = dfa[df2 > 0]
In [436]: casted
Out[436]:
A B C
0 1.047606 0.256090 NaN
1 3.497968 1.426469 NaN
2 NaN NaN 255.0
3 NaN 1.139976 NaN
4 NaN NaN NaN
5 1.346426 0.096706 NaN
6 NaN NaN 1.0
7 NaN NaN NaN
In [437]: casted.dtypes
Out[437]:
A float32
B float64
C float64
dtype: object
14 根據(jù) dtype 選擇列
select_dtypes()
方法可以根據(jù)列的 dtype
實(shí)現(xiàn)列的提取。
首先脚仔,讓我們創(chuàng)建一個(gè)具有不同 dtype
的數(shù)據(jù)框
In [438]: df = pd.DataFrame(
.....: {
.....: "string": list("abc"),
.....: "int64": list(range(1, 4)),
.....: "uint8": np.arange(3, 6).astype("u1"),
.....: "float64": np.arange(4.0, 7.0),
.....: "bool1": [True, False, True],
.....: "bool2": [False, True, False],
.....: "dates": pd.date_range("now", periods=3),
.....: "category": pd.Series(list("ABC")).astype("category"),
.....: }
.....: )
.....:
In [439]: df["tdeltas"] = df.dates.diff()
In [440]: df["uint64"] = np.arange(3, 6).astype("u8")
In [441]: df["other_dates"] = pd.date_range("20130101", periods=3)
In [442]: df["tz_aware_dates"] = pd.date_range("20130101", periods=3, tz="US/Eastern")
In [443]: df
Out[443]:
string int64 uint8 float64 bool1 ... category tdeltas uint64 other_dates tz_aware_dates
0 a 1 3 4.0 True ... A NaT 3 2013-01-01 2013-01-01 00:00:00-05:00
1 b 2 4 5.0 False ... B 1 days 4 2013-01-02 2013-01-02 00:00:00-05:00
2 c 3 5 6.0 True ... C 1 days 5 2013-01-03 2013-01-03 00:00:00-05:00
[3 rows x 12 columns]
所有列的 dtypes
In [444]: df.dtypes
Out[444]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
other_dates datetime64[ns]
tz_aware_dates datetime64[ns, US/Eastern]
dtype: object
select_dtypes()
有兩個(gè)參數(shù):
-
include
: 包含這些類型的列 -
exclude
: 不包含這些類型的列
例如勤众,要選擇 bool
列
In [445]: df.select_dtypes(include=[bool])
Out[445]:
bool1 bool2
0 True False
1 False True
2 True False
你也可以使用 NumPy
dtype
層次結(jié)構(gòu)中的類型名稱
In [446]: df.select_dtypes(include=["bool"])
Out[446]:
bool1 bool2
0 True False
1 False True
2 True False
select_dtypes()
也適用于通用數(shù)據(jù)類型
例如,選擇所有數(shù)字和布爾列鲤脏,同時(shí)排除無符號(hào)整數(shù)
In [447]: df.select_dtypes(include=["number", "bool"], exclude=["unsignedinteger"])
Out[447]:
int64 float64 bool1 bool2 tdeltas
0 1 4.0 True False NaT
1 2 5.0 False True 1 days
2 3 6.0 True False 1 days
要選擇字符串列们颜,必須使用 object
類型
In [448]: df.select_dtypes(include=["object"])
Out[448]:
string
0 a
1 b
2 c
如果想要查看通用數(shù)據(jù)類型的所有子類型吕朵,你可以定義類似如下的函數(shù)來返回一個(gè)子類型樹
In [449]: def subdtypes(dtype):
.....: subs = dtype.__subclasses__()
.....: if not subs:
.....: return dtype
.....: return [dtype, [subdtypes(dt) for dt in subs]]
In [450]: subdtypes(np.generic)
Out[450]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.longlong,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.ulonglong]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
注意
pandas
還定義了category
和datetime64[ns, tz]
類型,但它們沒有集成到通用的NumPy
層次結(jié)構(gòu)中掌桩,因此沒有顯示在上述結(jié)果中