Pandas有兩種數(shù)據(jù)結(jié)構(gòu)類(lèi)型曙寡,一個(gè)是Series杜恰,另一個(gè)是DataFrame
Series
Series是一種一維數(shù)據(jù)結(jié)構(gòu)竹宋,類(lèi)似字典或者Numpy中元素帶標(biāo)簽的數(shù)組勉失。但是比字典更為強(qiáng)大管嬉。其中每一個(gè)元素都有一個(gè)標(biāo)簽(索引)皂林,標(biāo)簽可以是數(shù)字或者字符串。具有索引蚯撩,具有鍵值對(duì)應(yīng)關(guān)系础倍,能夠排序,切片Slice等等操作胎挎。
DataFrame
DataFrame是一個(gè)二維的表結(jié)構(gòu)沟启。Pandas的DataFrame可以存儲(chǔ)許多種不同的數(shù)據(jù)類(lèi)型忆家,但是每一個(gè)列的數(shù)據(jù)都是同一個(gè)數(shù)據(jù)類(lèi)型,并且每一個(gè)坐標(biāo)軸都有自己的標(biāo)簽(索引)德迹。你可以把它想象成一個(gè)Series的字典項(xiàng)芽卿。
1.1 創(chuàng)建Series
利用一個(gè)List創(chuàng)建一個(gè)Series,Pandas會(huì)默認(rèn)創(chuàng)建整型索引
import pandas as pd
import numpy as np
s =pd.Series([0,1,2,3,4,np.NAN,5,'A'])
In [74]:s
Out[74]:
0 0
1 1
2 2
3 3
4 4
5 NaN
6 5
7 A
dtype: object
2.1 創(chuàng)建DataFrame
方法一:使用一個(gè)數(shù)組array胳搞,指定索引卸例,列名
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])
In [76]:df
Out[76]:
A B C D
2013-01-01 -2.359309 -0.065001 1.099911 -0.886392
2013-01-02 0.318336 0.715261 0.060752 1.326758
2013-01-03 0.515914 1.482326 -0.973154 1.766126
2013-01-04 1.875221 -0.316619 -0.543997 0.864037
2013-01-05 -0.697887 0.065137 -0.899040 0.826392
2013-01-06 -0.205943 -1.532289 1.849114 1.267895
方法二:使用字典創(chuàng)建DataFrame
df2 = pd.DataFrame({'A':1,
'B':pd.Timestamp('20130102'),
'C':pd.Series(1,index=range(4)),
'D':np.array([3]*4,dtype='int'),
'E':'foo'})
In [78]:df2
Out[78]:
A B C D E
0 1 2013-01-02 1 3 foo
1 1 2013-01-02 1 3 foo
2 1 2013-01-02 1 3 foo
3 1 2013-01-02 1 3 foo
2.1.1 常用的基本功能
1、查看前N行或者后M行數(shù)據(jù)
In [80]:df.head(2)
Out[80]:
A B C D
2013-01-01 -2.359309 -0.065001 1.099911 -0.886392
2013-01-02 0.318336 0.715261 0.060752 1.326758
In [81]:df.tail(2)
Out[81]:
A B C D
2013-01-05 -0.697887 0.065137 -0.899040 0.826392
2013-01-06 -0.205943 -1.532289 1.849114 1.267895
2肌毅、查看索引
In [82]:df.index
Out[82]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None
3筷转、查看值
In [83]:df.values
Out[83]:
array([[-2.35930948, -0.06500052, 1.09991148, -0.88639213],
[ 0.31833619, 0.71526129, 0.06075226, 1.32675777],
[ 0.51591397, 1.48232627, -0.97315391, 1.76612637],
[ 1.87522057, -0.31661914, -0.54399686, 0.86403681],
[-0.69788733, 0.06513657, -0.89903951, 0.82639165],
[-0.20594297, -1.53228941, 1.84911405, 1.26789462]])
4、查看列名
In [84]:df.columns
Out[84]: Index([u'A', u'B', u'C', u'D'], dtype='object')In [85]:
In [85]:df.dtypes
Out[85]:
A float64
B float64
C float64
D float64
dtype: object
5悬而、查看數(shù)據(jù)有多少行
In [74]:len(df)
Out[74]: 6
6呜舒、查看數(shù)據(jù)Summary信息(均值、方差笨奠、最小袭蝗、最大,分位數(shù))
In [9]:df.describe()
Out[9]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.329473 0.087595 -0.172075 0.308271
std 0.595492 1.106105 0.524659 0.864240
min -0.218562 -1.454443 -0.992808 -0.790523
25% 0.112395 -0.458519 -0.362685 -0.434517
50% 0.135337 0.000715 -0.197997 0.653177
75% 0.281296 0.630096 0.137386 0.864773
max 1.490029 1.750290 0.524751 1.195568
7艰躺、復(fù)制一個(gè)完全一樣的對(duì)象
In [11]:df2 = df.copy()
In [11]:df2
Out[11]:
A B C D
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-03 0.329824 1.750290 -0.085930 0.891737
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
2013-01-05 0.104873 0.150260 0.211825 -0.790523
2013-01-06 -0.218562 0.790041 -0.992808 0.783881
8呻袭、對(duì)數(shù)據(jù)進(jìn)行行列轉(zhuǎn)置
In [12]:df.T
Out[12]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.134964 1.490029 0.329824 0.135711 0.104873 -0.218562
B -1.454443 -0.561749 1.750290 -0.148830 0.150260 0.790041
C -0.310064 0.524751 -0.085930 -0.380225 0.211825 -0.992808
D 1.195568 0.522473 0.891737 -0.753513 -0.790523 0.783881
9、對(duì)數(shù)據(jù)進(jìn)行行列轉(zhuǎn)置
In [12]:df.T
Out[12]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.134964 1.490029 0.329824 0.135711 0.104873 -0.218562
B -1.454443 -0.561749 1.750290 -0.148830 0.150260 0.790041
C -0.310064 0.524751 -0.085930 -0.380225 0.211825 -0.992808
D 1.195568 0.522473 0.891737 -0.753513 -0.790523 0.783881
10腺兴、對(duì)數(shù)據(jù)進(jìn)行行列轉(zhuǎn)置
df.set_index=df['A']
11左电、對(duì)數(shù)據(jù)進(jìn)行行列轉(zhuǎn)置
In [74]:df2.columns = ['E','F','G','H']
In [74]:df2
Out[74]:
E F G H
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-03 0.329824 1.750290 -0.085930 0.891737
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
2013-01-05 0.104873 0.150260 0.211825 -0.790523
2013-01-06 -0.218562 0.790041 -0.992808 0.783881
2.1.2 進(jìn)行選擇、過(guò)濾页响、切片等操作
索引篓足,根據(jù)標(biāo)簽(索引)進(jìn)行行操作
- loc是字符串標(biāo)簽的索引方法,
- iloc是數(shù)字標(biāo)簽的索引方法闰蚕,
- ix是一個(gè)字符串標(biāo)簽的索引方法栈拖,同樣支持?jǐn)?shù)字標(biāo)簽索引作為它的備選。
備注:ix雖然支持字符和數(shù)字切片没陡,但有一些輕微的不可預(yù)測(cè)性涩哟,數(shù)字標(biāo)簽可能會(huì)讓ix做出一些奇怪的事情,例如將一個(gè)數(shù)字解釋成一個(gè)位置盼玄。而loc和iloc則為你帶來(lái)了安全的贴彼、可預(yù)測(cè)的。ix要比loc和iloc更快埃儿。雖然loc是對(duì)字符串進(jìn)行索引器仗,但是如果索引是數(shù)字的時(shí)候,loc也可以進(jìn)行索引,貌似有一點(diǎn)矛盾精钮,需要實(shí)操時(shí)進(jìn)行體會(huì)威鹿。
1、選擇一列
- 方法一轨香、df['A']
- 方法二忽你、df.A
- 方法三、df.loc[:,['A']]
In [20]:df['A']
Out[20]:
2013-01-01 0.134964
2013-01-02 1.490029
2013-01-03 0.329824
2013-01-04 0.135711
2013-01-05 0.104873
2013-01-06 -0.218562
Freq: D, Name: A, dtype: float64
2弹沽、選擇兩列或者多列
- 方法一檀夹、df[['A','B']]
- 方法二、df.loc[:,['A','B']]
- 方法三策橘、df.ix[:,['A','B']]
In [20]:df[['A','B']]
Out[29]:
A B
2013-01-01 0.134964 -1.454443
2013-01-02 1.490029 -0.561749
2013-01-03 0.329824 1.750290
2013-01-04 0.135711 -0.148830
2013-01-05 0.104873 0.150260
2013-01-06 -0.218562 0.790041
3炸渡、根據(jù)某一列或者幾列進(jìn)行條件篩選
In [30]:df[(df.A>0) & (df.B<0)]
Out[30]:
A B C D
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
4、索引是數(shù)字的使用iloc
In [35]: df1 = pd.DataFrame(np.random.randn(4,4),index=[1,2,3,4],columns=['A','B','C','D'])
In [36]: df1
Out[36]:
A B C D
1 0.913335 -0.209641 -0.994628 -0.300057
2 1.260923 0.405731 -0.566145 -1.114782
3 0.437972 1.800594 -0.269038 -0.038466
4 -0.239472 0.290871 0.207056 0.105834
#查看某一行
In [40]: df.iloc[3]
Out[40]:
A 0.135711
B -0.148830
C -0.380225
D -0.753513
Name: 2013-01-04 00:00:00, dtype: float64
#由于df1的索引是數(shù)字丽已,體會(huì)一會(huì)這里使用loc和iloc的區(qū)別
In [25]:df1.loc[1:2]
Out[25]:
A B C D
1 -0.762372 -0.390335 0.037414 2.104834
2 1.265755 -0.113307 1.443822 -2.765101
In [26]:df1.iloc[1:2]
Out[26]:
A B C D
2 1.265755 -0.113307 1.443822 -2.765101
#查看第二行到第三行
In [69]:df.iloc[1:3,:]
Out[69]:
A B C D
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-03 0.329824 1.750290 -0.085930 0.891737
#查看第一行到第二行蚌堵,第一列到第三列
In [70]:df.iloc[0:2,0:3]
Out[70]:
A B C
2013-01-01 0.134964 -1.454443 -0.310064
2013-01-02 1.490029 -0.561749 0.524751
#挑某幾列進(jìn)行查看,如位置第1沛婴,2吼畏,4行,第0嘁灯,2列
In [71]:df.iloc[[1,2,4],[0,2]]
Out[71]:
A C
2013-01-02 1.490029 0.524751
2013-01-03 0.329824 -0.085930
2013-01-05 0.104873 0.211825
5泻蚊、索引不是數(shù)字,是字符的使用loc
#索引是Date挑 '2013-01-03':'2013-01-05'幾行
In [54]:df.loc['2013-01-03':'2013-01-05']
Out[54]:
A B C D
2013-01-03 0.329824 1.75029 -0.085930 0.891737
2013-01-04 0.135711 -0.14883 -0.380225 -0.753513
2013-01-05 0.104873 0.15026 0.211825 -0.790523
In [55]:df.ix['2013-01-03':'2013-01-05']
Out[55]:
A B C D
2013-01-03 0.329824 1.75029 -0.085930 0.891737
2013-01-04 0.135711 -0.14883 -0.380225 -0.753513
2013-01-05 0.104873 0.15026 0.211825 -0.790523
#第1到3列
In [53]:df.iloc[:,1:3]
Out[53]:
B C
2013-01-01 -1.454443 -0.310064
2013-01-02 -0.561749 0.524751
2013-01-03 1.750290 -0.085930
2013-01-04 -0.148830 -0.380225
2013-01-05 0.150260 0.211825
2013-01-06 0.790041 -0.992808
# 第3到5行丑婿,A性雄、B列
In [52]:df.loc['2013-01-03':'2013-01-05',['A','B']]
Out[52]:
A B
2013-01-03 0.329824 1.75029
2013-01-04 0.135711 -0.14883
2013-01-05 0.104873 0.15026
In [56]:df.ix[1:2]
Out[56]:
A B C D
2013-01-02 1.490029 -0.561749 0.524751 0.522473
6、排序
#對(duì)索引排序
In [57]:df.sort_index(ascending=False)
Out[57]:
A B C D
2013-01-06 -0.218562 0.790041 -0.992808 0.783881
2013-01-05 0.104873 0.150260 0.211825 -0.790523
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
2013-01-03 0.329824 1.750290 -0.085930 0.891737
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
#根據(jù)某一列進(jìn)行排序
In [58]:df.sort(columns='B')
Out[58]:
A B C D
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
2013-01-02 1.490029 -0.561749 0.524751 0.522473
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
2013-01-05 0.104873 0.150260 0.211825 -0.790523
2013-01-06 -0.218562 0.790041 -0.992808 0.783881
2013-01-03 0.329824 1.750290 -0.085930 0.891737
#根據(jù)某幾列進(jìn)行排序
In [59]:df.sort(columns=['A','B'])
Out[59]:
A B C D
2013-01-06 -0.218562 0.790041 -0.992808 0.783881
2013-01-05 0.104873 0.150260 0.211825 -0.790523
2013-01-01 0.134964 -1.454443 -0.310064 1.195568
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513
2013-01-03 0.329824 1.750290 -0.085930 0.891737
2013-01-02 1.490029 -0.561749 0.524751 0.522473
7羹奉、缺失值處理
In [66]:df3 = df.reindex(index=dates[0:4], columns = list(df.columns)+['E'])
In [66]:df3.loc[dates[0]:dates[1],['E']]=1
In [66]:df3
Out[63]:
A B C D E
2013-01-01 0.134964 -1.454443 -0.310064 1.195568 1
2013-01-02 1.490029 -0.561749 0.524751 0.522473 1
2013-01-03 0.329824 1.750290 -0.085930 0.891737 NaN
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513 NaN
# 刪除缺失值
In [60]: df3.dropna(how='any')
Out[60]:
A B C D E
2013-01-01 0.134964 -1.454443 -0.310064 1.195568 1
2013-01-02 1.490029 -0.561749 0.524751 0.522473 1
# 對(duì)缺失值進(jìn)行填充
In [68]:df3.fillna(value=5)
Out[68]:
A B C D E
2013-01-01 0.134964 -1.454443 -0.310064 1.195568 1
2013-01-02 1.490029 -0.561749 0.524751 0.522473 1
2013-01-03 0.329824 1.750290 -0.085930 0.891737 5
2013-01-04 0.135711 -0.148830 -0.380225 -0.753513 5
2.1.3 使用函數(shù)求值以及Apply的使用方法
In [69]:df.mean()
Out[69]:
A 0.634212
B -0.517503
C -0.360313
D -0.178633
dtype: float64
In [70]:df.apply(np.cumsum)
Out[70]:
A B C D
2013-01-01 -1.083703 -0.984847 0.231595 0.764466
2013-01-02 -0.277971 -0.737865 -0.366301 -0.768202
2013-01-03 -0.271485 -1.006928 -0.246741 -0.483353
2013-01-04 2.491598 0.096372 -2.159432 -0.331738
2013-01-05 2.624991 -1.882532 -2.445247 -1.636275
2013-01-06 3.805273 -3.105017 -2.161877 -1.071797
In [71]:df.apply(lambda x: x.max() - x.min())
Out[71]:
A 3.846786
B 3.082203
C 2.196061
D 2.297133
dtype: float64