import numpy as np
import pandas as pd
0 pandas數(shù)據(jù)結(jié)構(gòu)簡介
pandas主要處理下面三種數(shù)據(jù)結(jié)構(gòu)
- Series
- DataFrame
- Panel
它們都是以numpy為基礎(chǔ)的,處理速度相對較快,其中最常用的是DataFrame
甲葬。
數(shù)據(jù)結(jié)構(gòu) | 維數(shù) | 簡述 |
---|---|---|
Series | 1 | 1維數(shù)組抄囚,大小不可變熏矿,但是里邊的值可變 |
DataFrame | 2 | 2維數(shù)組,大小可變 |
Panel | 3 | 3維數(shù)組闭翩,大小可變 |
接下來詳細(xì)介紹這三種數(shù)據(jù)結(jié)構(gòu)
1 Series
Series
是一個可以保存任何類型的數(shù)據(jù)的一維標(biāo)簽數(shù)組,標(biāo)簽被稱之為index迄埃。
1.1 構(gòu)造函數(shù)
pandas.Series(data, index, dtype, copy)
構(gòu)造函數(shù)參數(shù)如下:
參數(shù) | 描述 |
---|---|
data |
數(shù)據(jù)疗韵,例如ndarray、list等 |
index |
索引值必須保證唯一并可散列侄非,與數(shù)據(jù)長度相同蕉汪。默認(rèn)使用np.arange(n) |
dtype |
數(shù)據(jù)類型 |
copy |
復(fù)制數(shù)據(jù),默認(rèn)為False |
1.2 創(chuàng)建Series
1.2.1 創(chuàng)建空Series
s = pd.Series()
print s
Series([], dtype: float64)
1.2.2 從list
創(chuàng)建Series
使用默認(rèn)的index
使用默認(rèn)index
逞怨,如下所示
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
1.2.3 從np.ndarray
中創(chuàng)建Series
給定一個相同長度的index
者疤,如下所示
s = pd.Series(np.random.rand(5), index=["a", "b", "c", "d", "e"])
print s
a 0.055684
b 0.697289
c 0.768223
d 0.428101
e 0.748015
dtype: float64
print s.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
1.2.4 從dict創(chuàng)建Series
如果沒有指定index,則按照字典順序獲取dict中的健作為index叠赦。如果指定了index驹马,則按照指定的index從dict中獲取索引對應(yīng)的值。
d = {"a": 0., "b": 1., "c": 2.}
s = pd.Series(d)
print s
s = pd.Series(d, index=["b", "c", "d", "a"])
print s
a 0.0
b 1.0
c 2.0
dtype: float64
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
1.2.5 從標(biāo)量中創(chuàng)建Series
使用這種方式必須指定index眯搭,如下所示:
s = pd.Series(5., index=["a", "b", "c", "d", "e"])
print s
a 5.0
b 5.0
c 5.0
d 5.0
e 5.0
dtype: float64
1.3 類ndarray訪問Series
Series
與np.ndarray
非常相似,可以使用大部分NumPy中的大部分方法或者方式來操作和訪問Series业岁,如下所示:
s = pd.Series(np.random.rand(5), index=["a", "b", "c", "d", "e"])
print s
a 0.566958
b 0.548278
c 0.239546
d 0.218399
e 0.322169
dtype: float64
print s[0]
0.566958376402
print s[:3]
a 0.566958
b 0.548278
c 0.239546
dtype: float64
print s[s>s.median()]
a 0.566958
b 0.548278
dtype: float64
print s[[4, 3, 1]]
e 0.322169
d 0.218399
b 0.548278
dtype: float64
print np.exp(s)
a 1.762897
b 1.730271
c 1.270672
d 1.244083
e 1.380118
dtype: float64
print s + s
a 1.133917
b 1.096556
c 0.479092
d 0.436798
e 0.644338
dtype: float64
print s * 2
a 1.133917
b 1.096556
c 0.479092
d 0.436798
e 0.644338
dtype: float64
1.4 類dict訪問Series
可以通過類似dict的方式訪問Series鳞仙,如下所示
print s["a"]
0.566958376402
print "e" in s
True
print "f" in s
False
print s.get("e")
print s.get("f", np.nan)
0.322169102265
nan
2 DataFrame
DataFrame
是一個二維數(shù)組結(jié)構(gòu),通過行index
和列columns
來訪問其中的數(shù)據(jù)笔时。
2.1 構(gòu)造函數(shù)
pandas.DataFrame(data, index, columns, dtype, copy)
構(gòu)造函數(shù)參數(shù)如下:
參數(shù) | 描述 |
---|---|
data |
數(shù)據(jù)棍好,例如2-D ndarray、lists允耿、Series借笙、dict或者其他DataFrame等 |
index |
對應(yīng)于行標(biāo)簽不一定唯一,與數(shù)據(jù)長度相同较锡。默認(rèn)使用np.arange(n) |
columns |
對應(yīng)于列標(biāo)簽必須保證唯一并可散列业稼。默認(rèn)使用np.arange(n) |
dtype |
數(shù)據(jù)類型 |
copy |
復(fù)制數(shù)據(jù),默認(rèn)為False |
2.2 創(chuàng)建DataFrame
2.2.1 從value是Series或者dicts的字典創(chuàng)建
DataFrame的index是所有Series中的indexes的并集蚂蕴。如果value是一個dict低散,首先將其轉(zhuǎn)換成Series。如果沒有指定columns骡楼,則使用dict中的所有keys作為columns熔号。如下所示:
d = {"one": pd.Series([1., 2., 3.], index=["a", "b", "c"]),
"two": pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])}
df = pd.DataFrame(d)
print df
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
df1 = pd.DataFrame(d, index=["d", "b", "a"])
print df1
one two
d NaN 4.0
b 2.0 2.0
a 1.0 1.0
df2 = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print df2
two three
d 4.0 NaN
b 2.0 NaN
a 1.0 NaN
2.2.2 從value是ndarrays或者list的字典中創(chuàng)建
d = {"one": [1., 2., 3., 4.],
"two": [4., 3., 2., 1.]}
df = pd.DataFrame(d)
print df
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
df1 = pd.DataFrame(d, index=["a", "b", "c", "d"])
print df1
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
2.2.3 從結(jié)構(gòu)化的array中創(chuàng)建
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data[:] = [(1, 2., "Hello"), (2, 3., "World")]
df1 = pd.DataFrame(data)
print df1
A B C
0 1 2.0 Hello
1 2 3.0 World
df2 = pd.DataFrame(data, index=["first", "second"])
print df2
A B C
first 1 2.0 Hello
second 2 3.0 World
df3 = pd.DataFrame(data, columns=["C", "A", "B"])
print df3
C A B
0 Hello 1 2.0
1 World 2 3.0
2.2.4 從元素是dict的list中創(chuàng)建
data = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
df1 = pd.DataFrame(data)
print df1
a b c
0 1 2 NaN
1 5 10 20.0
df2 = pd.DataFrame(data, columns=["a", "b"])
print df2
a b
0 1 2
1 5 10
2.2.5 從列表中創(chuàng)建
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
0
0 1
1 2
2 3
3 4
4 5
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
2.3 列選擇,添加和刪除
通過column獲取指定列
d = {"one": pd.Series([1., 2., 3.], index=["a", "b", "c"]),
"two": pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])}
df = pd.DataFrame(d)
print df["one"]
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
添加列
df["three"] = df["one"] * df["two"] #由其他列計(jì)算而來
df["flag"] = df["one"] > 2 #有其他列計(jì)算而來
df["foo"] = "bar" #全部賦值為bar
df.insert(1, "bar", df["one"]) #在固定位置添加列
print df
one bar two three flag foo
a 1.0 1.0 1.0 1.0 False bar
b 2.0 2.0 2.0 4.0 False bar
c 3.0 3.0 3.0 9.0 True bar
d NaN NaN 4.0 NaN False bar
刪除列
del df["two"]
three = df.pop("three")
print df
one bar flag foo
a 1.0 1.0 False bar
b 2.0 2.0 False bar
c 3.0 3.0 True bar
d NaN NaN False bar
2.4 行選擇鸟整,添加和刪除
可以通過如下方式選擇df中的某一行或者多行:
方式 | 操作 | 結(jié)果 |
---|---|---|
df.loc[label] | 通過行標(biāo)簽選擇 | Series |
df.iloc[loc] | 通過整數(shù)索引選擇 | Series |
df[5:10] | 通過下標(biāo)選取多行 | DataFrame |
df[bool_vec] | 通過boolean數(shù)組選取多行 | DataFrame |
df = pd.DataFrame(np.random.randn(10, 4), index=list("abcdefghij"), columns=["A", "B", "C", "D"])
print df
A B C D
a -0.986619 1.526696 -0.268968 -0.092091
b -1.151455 -0.512284 -0.978782 1.043218
c 0.909876 -1.032838 -0.103740 -0.002227
d -1.012738 0.519562 1.472160 -0.334393
e -0.833450 0.402912 -0.586269 -1.501751
f 0.039272 0.759840 -0.688571 -0.686812
g 0.641397 0.162648 -0.969303 1.060234
h -0.119458 0.059383 -1.328667 -0.777637
i 0.093021 -0.235605 0.166218 -0.582874
j 0.462327 -0.435135 -1.953918 0.531841
行選擇
print df.loc["b"]
A -1.151455
B -0.512284
C -0.978782
D 1.043218
Name: b, dtype: float64
print df.iloc[2]
A 0.909876
B -1.032838
C -0.103740
D -0.002227
Name: c, dtype: float64
print df[2:4]
A B C D
c 0.909876 -1.032838 -0.10374 -0.002227
d -1.012738 0.519562 1.47216 -0.334393
print df[[False, True, False, True, False, True, False, False, True, False]]
A B C D
b -1.151455 -0.512284 -0.978782 1.043218
d -1.012738 0.519562 1.472160 -0.334393
f 0.039272 0.759840 -0.688571 -0.686812
i 0.093021 -0.235605 0.166218 -0.582874
附加行
df2 = pd.DataFrame(np.random.randn(2, 4), index=["m", "n"], columns=["A", "B", "C", "D"])
df = df.append(df2)
print df
A B C D
a -0.986619 1.526696 -0.268968 -0.092091
b -1.151455 -0.512284 -0.978782 1.043218
c 0.909876 -1.032838 -0.103740 -0.002227
d -1.012738 0.519562 1.472160 -0.334393
e -0.833450 0.402912 -0.586269 -1.501751
f 0.039272 0.759840 -0.688571 -0.686812
g 0.641397 0.162648 -0.969303 1.060234
h -0.119458 0.059383 -1.328667 -0.777637
i 0.093021 -0.235605 0.166218 -0.582874
j 0.462327 -0.435135 -1.953918 0.531841
m 1.004950 0.522191 -0.071558 -0.615419
n -0.995826 -1.055260 -1.204035 -1.444035
刪除行
df = df.drop("a")
print df
A B C D
b -1.151455 -0.512284 -0.978782 1.043218
c 0.909876 -1.032838 -0.103740 -0.002227
d -1.012738 0.519562 1.472160 -0.334393
e -0.833450 0.402912 -0.586269 -1.501751
f 0.039272 0.759840 -0.688571 -0.686812
g 0.641397 0.162648 -0.969303 1.060234
h -0.119458 0.059383 -1.328667 -0.777637
i 0.093021 -0.235605 0.166218 -0.582874
j 0.462327 -0.435135 -1.953918 0.531841
m 1.004950 0.522191 -0.071558 -0.615419
n -0.995826 -1.055260 -1.204035 -1.444035
2.5 轉(zhuǎn)置
df_t = df.T
print df_t
b c d e f g h \
A -1.151455 0.909876 -1.012738 -0.833450 0.039272 0.641397 -0.119458
B -0.512284 -1.032838 0.519562 0.402912 0.759840 0.162648 0.059383
C -0.978782 -0.103740 1.472160 -0.586269 -0.688571 -0.969303 -1.328667
D 1.043218 -0.002227 -0.334393 -1.501751 -0.686812 1.060234 -0.777637
i j m n
A 0.093021 0.462327 1.004950 -0.995826
B -0.235605 -0.435135 0.522191 -1.055260
C 0.166218 -1.953918 -0.071558 -1.204035
D -0.582874 0.531841 -0.615419 -1.444035
3 Panel
用的不多引镊,暫不介紹,待后續(xù)補(bǔ)充