Python DataScience Handbook 學(xué)習(xí)筆記
第二部分 numpy(2)
numpy的向量化操作與Matlab非常類似,需要注意的是向量化操作遠(yuǎn)比循環(huán)要有效率的多甸赃,請(qǐng)盡量使用向量化操作來(lái)取代循環(huán)埠对。
"ufunc"是一些列能夠?qū)rray進(jìn)行整體操作的函數(shù)
有一些特殊的函數(shù)裁替,我們可以通過(guò)scipy包來(lái)獲取
In [29]: from scipy import special
In [30]: x = np.random.randint(15, size = (5,5), dtype = 'int32')
In [31]: x
Out[31]:
array([[ 4, 14, 8, 5, 7],
[ 0, 8, 8, 14, 9],
[ 9, 9, 10, 14, 1],
[13, 10, 0, 12, 12],
[ 7, 3, 2, 14, 2]], dtype=int32)
In [32]: special.erf(x)
Out[32]:
array([[ 0.99999998, 1. , 1. , 1. , 1. ],
[ 0. , 1. , 1. , 1. , 1. ],
[ 1. , 1. , 1. , 1. , 0.84270079],
[ 1. , 1. , 0. , 1. , 1. ],
[ 1. , 0.99997791, 0.99532227, 1. , 0.99532227]])
In [33]: x
Out[33]:
array([[ 4, 14, 8, 5, 7],
[ 0, 8, 8, 14, 9],
[ 9, 9, 10, 14, 1],
[13, 10, 0, 12, 12],
[ 7, 3, 2, 14, 2]], dtype=int32)
Specifying output
In [24]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[ 0. 10. 20. 30. 40.]
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[ 1. 0. 2. 0. 4. 0. 8. 0. 16. 0.]
你可能會(huì)問(wèn)這樣做的好處是什么弱判,相比于直接賦值有何優(yōu)越性?
在y[::2] = 2 ** x的過(guò)程中开伏,我們會(huì)創(chuàng)建一個(gè)臨時(shí)數(shù)組遭商,儲(chǔ)存右邊語(yǔ)句的值,再將其拷貝到左邊的子數(shù)組中巫玻。很顯然祠汇,使用specifying output提升了效率可很。
Aggregation
In [36]: x = np.linspace(0, 10, 5)
In [37]: x
Out[37]: array([ 0. , 2.5, 5. , 7.5, 10. ])
In [38]: np.add.reduce(x)
Out[38]: 25.0
In [39]: np.multiply.reduce(x)
Out[39]: 0.0
In [40]: np.add.accumulate(x)
Out[40]: array([ 0. , 2.5, 7.5, 15. , 25. ])
Outer 外積
In [41]: x = np.arange(1, 5)
In [42]: x
Out[42]: array([1, 2, 3, 4])
In [43]: np.multiply.outer(x, x)
Out[43]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])
numpy中的min,max等聚合函數(shù)
In [41]: x = np.arange(1, 5)
In [42]: x
Out[42]: array([1, 2, 3, 4])
In [43]: np.multiply.outer(x, x)
Out[43]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])
In [44]: x = np.arange(1, 10)
In [45]: x
Out[45]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [46]: %timeit x.sum()
1.11 μs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [47]: %timeit sum(x) #Be careful, don't use the python-version sum()
1.3 μs ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [48]: x.min()
Out[48]: 1
In [49]: x.max()
Out[49]: 9
我們還可以通過(guò)設(shè)置axis來(lái)對(duì)行列進(jìn)行操作
In [50]: Mat = np.random.random((3,4))
In [51]: Mat.sum(axis = 1)
Out[51]: array([ 2.54634383, 2.42121143, 1.28962794])
In [52]: Mat
Out[52]:
array([[ 0.77880176, 0.57543626, 0.6840498 , 0.508056 ],
[ 0.75612961, 0.15132258, 0.65047932, 0.86327992],
[ 0.25738888, 0.5731711 , 0.03401482, 0.42505314]])
In [53]: Mat.sum(axis = 0)
Out[53]: array([ 1.79232025, 1.29992993, 1.36854395, 1.79638906])
In [54]: # axis = 0 means adding the elements around column
Broadcasting
最簡(jiǎn)單的broadcasting
In [1]: import numpy as np
In [2]: a = np.array([1, 2, 3])
In [3]: b = 3
In [4]: a + b
Out[4]: array([4, 5, 6])
一些更復(fù)雜的例子
In [5]: M = np.ones((3, 3))
In [6]: M + a
Out[6]:
array([[ 2., 3., 4.],
[ 2., 3., 4.],
[ 2., 3., 4.]])
In [7]: a = np.arange(3)
In [8]: b = np.arange(3)[:, np.newaxis]
In [9]: a
Out[9]: array([0, 1, 2])
In [10]: b
Out[10]:
array([[0],
[1],
[2]])
In [11]: a + b
Out[11]:
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
注意在此過(guò)程中苇本,不同維度的數(shù)組被互相“拉伸”來(lái)適應(yīng)彼此。
關(guān)于broadcasting的三條規(guī)則
- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
應(yīng)用實(shí)例
創(chuàng)建一個(gè)z = f(x,y) 的數(shù)據(jù)集
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
Boolean masking
這里書(shū)中使用了一個(gè)關(guān)于雨水的數(shù)據(jù)集來(lái)展示boolean masking的妙用惫周。
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
In [4]: inches = rainfall / 254.0
In [5]: inches.shape
Out[5]: (365,)
接下來(lái)便可以對(duì)這些數(shù)據(jù)進(jìn)行可視化來(lái)找尋其中的規(guī)律
ufuncs
前面我們提到過(guò)ufunc是一類對(duì)array整體進(jìn)行操作的函數(shù)递递,這里我們把他與boolean masking相結(jié)合.
In [1]: import numpy as np
In [2]: rng = np.random.RandomState(0)
In [3]: x = rng.randint(10, size = (3, 4))
In [4]: x
Out[4]:
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
In [5]: x < 6
Out[5]:
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]], dtype=bool)
上述的ufunc操作會(huì)帶給了我們一個(gè)boolean array, 接下來(lái)作者就展示了boolean array 的妙用啥么。
In [5]: x < 6
Out[5]:
array([[ True, True, True, True],
[False, False, True, True],
[ True, True, False, False]], dtype=bool)
In [6]: np.count_nonzero(_)
Out[6]: 8
In [7]: np.sum(x < 6)
Out[7]: 8
In [8]: np.any(x > 8)
Out[8]: True
In [9]: np.all(x < 8, axis = 1)
Out[9]: array([ True, False, True], dtype=bool)
In [10]: # Working together with boolean operators
In [11]: np.sum((x < 6) & (x >= 0))
Out[11]: 8
最后boolean array 還可以用為mask,這里與matlab中的logic array還是非常類似的
In [12]: x[x < 6]
Out[12]: array([5, 0, 3, 3, 3, 5, 2, 4])
回到雨水的例子悬荣,運(yùn)用mask可以非常優(yōu)雅地得到我們要的數(shù)據(jù)
# construct a mask of all rainy days
rainy = (inches > 0)
# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
print("Median precip on rainy days in 2014 (inches): ",
np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches): ",
np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches): 0.194881889764
Median precip on summer days in 2014 (inches): 0.0
Maximum precip on summer days in 2014 (inches): 0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575
最后要注意and, & 與 or, | 的區(qū)別氯迂,后者是位運(yùn)算符。
Fancy Indexing
fancy indexing指我們以一個(gè)array作為數(shù)組的index(就例如上一屆的boolean masks)
In [14]: ind = np.array([[3, 7], [4, 5]])
In [15]: rand = np.random.RandomState(45)
In [16]: x= rand.randint(100, size = (10, 5))
In [17]: x
Out[17]:
array([[75, 30, 3, 32, 95],
[61, 85, 35, 68, 15],
[65, 14, 53, 57, 72],
[87, 46, 8, 53, 12],
[34, 24, 12, 17, 68],
[30, 56, 14, 36, 31],
[86, 36, 57, 61, 79],
[17, 6, 42, 11, 8],
[49, 77, 75, 63, 42],
[54, 16, 24, 95, 63]])
In [18]: x[ind]
Out[18]:
array([[[87, 46, 8, 53, 12],
[17, 6, 42, 11, 8]],
[[34, 24, 12, 17, 68],
[30, 56, 14, 36, 31]]])
In [19]: # Shape of the result reflects the shape of the index arrays rather tha
...: n the shape of the array being indexed
In [20]: X = np.arange(12).reshape((3, 4))
In [21]: X
Out[21]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [22]: row = np.array([0, 1, 2])
In [23]: col = np.array([2, 1, 3])
In [24]: X[row, col]
Out[24]: array([ 2, 5, 11])
In [25]: # We get the (0, 2), (1, 1), (2, 3) th element
In [34]: X.shape
Out[34]: (100, 2)
In [35]: import matplotlib.pyplot as plt
In [36]: import seaborn; seaborn.set()
In [37]: plt.scatter(X[:, 0], X[:, 1])
Out[37]: <matplotlib.collections.PathCollection at 0x7f0cc9c461d0>
<matplotlib.figure.Figure at 0x7f0cc9c6b5f8>
In [38]: plt.show()
In [39]: indices = np.random.choice(X.shape[0], 20, replace = False)
In [40]: indices
Out[40]:
array([15, 87, 73, 17, 44, 66, 89, 91, 8, 25, 19, 39, 85, 49, 26, 20, 58,
41, 55, 24])
In [41]: selection = X[indices] # fancy indexing
In [42]: selection
Out[42]:
array([[ -1.80623391e-01, -2.15707232e+00],
[ -8.04178492e-01, -1.34828994e+00],
[ -1.24272035e+00, -2.42157557e+00],
[ 3.57111518e-01, 8.94495954e-02],
[ 2.15274973e+00, 3.24279140e+00],
[ -4.18439156e-01, -8.58736471e-01],
[ 6.08859877e-01, -2.59284917e-01],
[ -6.29633042e-01, 1.32258627e-01],
[ 1.11113414e+00, 1.77185490e+00],
[ 1.65522319e+00, 4.23558698e+00],
[ -1.40629915e-01, -1.62069848e-01],
[ 5.21162541e-01, 2.89756456e+00],
[ -1.11282410e+00, -1.82987036e+00],
[ -5.71948987e-01, -3.34258009e+00],
[ -2.34528800e+00, -3.77554207e+00],
[ -2.58467915e-01, -8.69598951e-01],
[ -1.46270269e-01, -1.27384266e-04],
[ -7.79152780e-02, -2.01423478e+00],
[ -1.79097697e+00, -1.08351482e+00],
[ -1.31637907e+00, -1.86128924e+00]])
Using Fancy Index to modify values
In [53]: x
Out[53]: array([ 0., 0., 2., 3., 4., 0.])
In [54]: i
Out[54]: [2, 3, 3, 4, 4, 4]
In [55]: x[i] += 1
In [56]: x
Out[56]: array([ 0., 0., 3., 4., 5., 0.])
In [57]: x = np.zeros(10)
In [58]: np.add.at(x, i, 1) # proper way to do
In [59]: x
Out[59]: array([ 0., 0., 1., 2., 3., 0., 0., 0., 0., 0.])
Binning Data
In [67]: np.random.seed(42)
In [68]: x = np.random.randn(100)
In [69]: size(x)
Out[69]: 100
In [70]: bins = np.linspace(-5, 5, 20)
In [71]: counts = np.zeros_like(bins)
In [72]: size(counts)
Out[72]: 20
In [73]: i = np.searchsorted(bins, x)
In [74]: i
Out[74]:
array([11, 10, 11, 13, 10, 10, 13, 11, 9, 11, 9, 9, 10, 6, 7, 9, 8,
11, 8, 7, 13, 10, 10, 7, 9, 10, 8, 11, 9, 9, 9, 14, 10, 8,
12, 8, 10, 6, 7, 10, 11, 10, 10, 9, 7, 9, 9, 12, 11, 7, 11,
9, 9, 11, 12, 12, 8, 9, 11, 12, 9, 10, 8, 8, 12, 13, 10, 12,
11, 9, 11, 13, 10, 13, 5, 12, 10, 9, 10, 6, 10, 11, 13, 9, 8,
9, 12, 11, 9, 11, 10, 12, 9, 9, 9, 7, 11, 10, 10, 10])
In [75]: x
Out[75]:
array([ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337,
-0.23413696, 1.57921282, 0.76743473, -0.46947439, 0.54256004,
-0.46341769, -0.46572975, 0.24196227, -1.91328024, -1.72491783,
-0.56228753, -1.01283112, 0.31424733, -0.90802408, -1.4123037 ,
1.46564877, -0.2257763 , 0.0675282 , -1.42474819, -0.54438272,
0.11092259, -1.15099358, 0.37569802, -0.60063869, -0.29169375,
-0.60170661, 1.85227818, -0.01349722, -1.05771093, 0.82254491,
-1.22084365, 0.2088636 , -1.95967012, -1.32818605, 0.19686124,
0.73846658, 0.17136828, -0.11564828, -0.3011037 , -1.47852199,
-0.71984421, -0.46063877, 1.05712223, 0.34361829, -1.76304016,
0.32408397, -0.38508228, -0.676922 , 0.61167629, 1.03099952,
0.93128012, -0.83921752, -0.30921238, 0.33126343, 0.97554513,
-0.47917424, -0.18565898, -1.10633497, -1.19620662, 0.81252582,
1.35624003, -0.07201012, 1.0035329 , 0.36163603, -0.64511975,
0.36139561, 1.53803657, -0.03582604, 1.56464366, -2.6197451 ,
0.8219025 , 0.08704707, -0.29900735, 0.09176078, -1.98756891,
-0.21967189, 0.35711257, 1.47789404, -0.51827022, -0.8084936 ,
-0.50175704, 0.91540212, 0.32875111, -0.5297602 , 0.51326743,
0.09707755, 0.96864499, -0.70205309, -0.32766215, -0.39210815,
-1.46351495, 0.29612028, 0.26105527, 0.00511346, -0.23458713])
In [76]: np.add.at(counts, i, 1)
In [77]: counts
Out[77]:
array([ 0., 0., 0., 0., 0., 1., 3., 7., 9., 23., 22.,
17., 10., 7., 1., 0., 0., 0., 0., 0.])
Sorting
numpy主要提供了兩個(gè)與排序有關(guān)的函數(shù)sort()與argsort()
In [18]: x
Out[18]: array([14, 92, 58, 74, 22])
In [19]: i = np.argsort(x)
In [20]: x[i]
Out[20]: array([14, 22, 58, 74, 92])
根據(jù)argsort得到的index array, 我們可以用fancy index來(lái)構(gòu)建出排序后的數(shù)組
In [21]: x = np.arange(1,10)
In [22]: np.random.shuffle(x)
In [23]: x
Out[23]: array([2, 9, 4, 3, 8, 6, 7, 5, 1])
In [24]: np.partition(x, 5)
Out[24]: array([1, 2, 3, 4, 5, 6, 7, 9, 8])
用partition而非sort我們可以得到最小的k個(gè)元素
Structured arrays
In [25]: name = ['Alice', 'Bob', 'Cathy', 'Doug']
...: age = [25, 45, 37, 19]
...: weight = [55.0, 85.5, 68.0, 61.5]
...:
In [26]: x = np.zeros(4, dtype=int)
In [27]: # compound data type
In [28]: data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats'
...: :('U10', 'i4', 'f8')})
In [29]: data.dtype
Out[29]: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [30]: data['name']=name;data['age']=age;data['weight']=weight
In [31]: data
Out[31]:
array([('Alice', 25, 55. ), ('Bob', 45, 85.5), ('Cathy', 37, 68. ),
('Doug', 19, 61.5)],
dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [32]: data[data['age'] < 30]['name']
Out[32]:
array(['Alice', 'Doug'],
dtype='<U10')
除了structured array, numpy還內(nèi)置了record array,最大的區(qū)別是能夠把上面的這些key作為屬性來(lái)訪問(wèn),但壞處是訪問(wèn)速度要慢于按鍵訪問(wèn)
最后导帝,pandas為我們提供了更加強(qiáng)大高效的處理這類數(shù)組的工具您单。