【學(xué)習(xí)】Python pandas+numpy+可視化

20190605

一斑匪、pandas

pandas.png

二呐籽、numpy

numpy.png

三、可視化

可視化.png

四蚀瘸、經(jīng)典60題：

1绝淡、Python 中（&，|）和（and苍姜，or）之間的區(qū)別
當(dāng)比較變量為邏輯變量為牢酵，兩種用法一致。當(dāng)比較變量為數(shù)值變量為衙猪，則&馍乙， |表示位運算， and垫释，or則依據(jù)是否非0來決定輸出丝格。
在pandas中，(df['Height']>200) | (df['Height']<170) 語句可以得到輸出棵譬，(df['Height']>200) or(df['Height']<170)報錯显蝌。姑且認(rèn)為是|可以進行多變量比較，而or只能進行單變量比較订咸。

image.png

2曼尊、Count the number of each type of animal in df.

df['animal'].value_counts()

3、pivot_table函數(shù)進行表的重建
For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

df.pivot_table(values='age',index='animal',columns='visits') #以animal為索引脏嚷，列為visits的各種值（有多少visits值骆撇，就有多少列），匹配的values為age的值

4父叙、pandas+-*/的應(yīng)用
how do you subtract the row mean from each element in the row?

df = pd.DataFrame(np.random.random(size=(5, 3))) 
df.sub(df.mean(axis=1),axis=0)

5神郊、返回column或index的最小值/最大值序號
Suppose you have DataFrame with 10 columns of real numbers, for example:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
Which column of numbers has the smallest sum? (Find that column's label.)

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()

6、返回特定的column或index
返回column和index只有兩個相關(guān)的函數(shù)dataframe.index 和dataframe.column（返回的是series）趾唱，如果要把返回的index或者index用list表示涌乳，可以再加一層tolist()函數(shù)。

返回特定行或列甜癞，思路是“過濾+返回index”的方法夕晓，只是一般要用到loc/iloc函數(shù)，而不是直接索引（直接索引不能）

（2）返回特定列带欢，思路其實是和返回行一樣的运授，只是需要用到loc/iloc函數(shù)，用loc可以進行復(fù)雜操作：

import pandas as pd
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# 創(chuàng)建一個DataFrame并設(shè)置行索引
df = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
b=df.loc[:,df.isnull().sum(axis=0)==1].columns.tolist()
#返回該列具有一個nan值的列名稱

為什么不能用直接索引方法乔煞？因為直接索引只能接受列名（或列名的list）吁朦，或者是對應(yīng)索引切片。
這里的df.isnull().sum(axis=0)==1 返回的是一個series渡贾，如果直接用df.[df.isnull().sum(axis=0)==1 ]是讀不出來的逗宜，因為不接受series。換言之空骚，如果是列名的list是可以讀出來的纺讲。
所以，圖方便，可以用直接索引處理特定行的篩選；對于特定列的篩選观话，只能用loc/iloc唉侄。

6、去除重復(fù)行/列
去除重復(fù)行有duplicate（返回重復(fù)行的索引）和drop_duplicate兩種方法泪电。可以指定特定的列，或者默認(rèn)所有列去重诲泌。
去除重復(fù)列目前好像沒有特定的方法？考慮用df.T進行轉(zhuǎn)置铣鹏，然后再去除重復(fù)行的方法敷扫？？

7诚卸、pandas處理nan值
（1）葵第、創(chuàng)建含有nan值的dataframe：
以下兩種方式不能創(chuàng)建空值，因為默認(rèn)為str

df=pd.DataFrame({'A':[1,2,3],'B':[1,2,3],'C':[1,2,3],'D':['','',''],'E':['','',''],'F':['','',''],'G':['','',''],'H':['','',''],'I':['','',''],'J':['','','']})
type(df.loc[1,'J'])  #顯示為str

df=pd.DataFrame({'A':[1,2,3],'B':[1,2,3],'C':[1,2,3]})
df['D']='NaN'
type(df.loc[1,'D'])  #顯示為str

以下是唯一創(chuàng)建空值的方法合溺，返回的空值格式是numpy.float64羹幸，跟讀取數(shù)據(jù)有關(guān)系吧？（不影響isnull使用辫愉，nan還是會返回True）

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

df = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
df.loc['store 1','glasses']          #返回nan（不是NaN）
type(df.loc['store 1','glasses'])        #返回 numpy.float64

（2）返回空值個數(shù)：
參見https://blog.csdn.net/Tyro_java/article/details/81396000

# 計算在store_items中NaN值的個數(shù)
x =  store_items.isnull().sum().sum()

# 輸出
print('在我們DataFrame中NaN的數(shù)量:', x)

在 Pandas 中栅受，邏輯值 True 的數(shù)字值是 1，邏輯值 False 的數(shù)字值是 0恭朗。
因此屏镊，我們可以通過數(shù)邏輯值 True 的數(shù)量數(shù)出 NaN 值的數(shù)量。
為了數(shù)邏輯值 True 的總數(shù)痰腮，我們使用 .sum() 方法兩次而芥。
要使用該方法兩次，是因為第一個 sum() 返回一個 Pandas Series膀值，其中存儲了列上的邏輯值 True 的總數(shù)
第二個 sum() 將上述 Pandas Series 中的 1 相加

可以使用 .count() 方法計算非空的總個數(shù)：

df.count().count()

count不同于sum棍丐，count只針對1/0進行處理误辑，sum是對具體數(shù)值進行處理。

8歌逢、多索引groupby
For each group, find the sum of the three greatest values.

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
#df.groupby(['grps'])['vals'].nlargest(3)      #nlargest（3）構(gòu)建了一個包含1級索引的新的dataframe巾钉，并且只含有最大的三個數(shù)
df.groupby(['grps'])['vals'].nlargest(3).sum(level=0)

嘗試自己構(gòu)建新列/索引用來表示排名（先用rank函數(shù)新建排名列，再篩選→groupby→sum）：

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

df['rank']=df.groupby(['grps'])['vals'].rank(method='dense',ascending=False)
(df[df.loc[:,'rank']<4].groupby(['grps'])['vals']).sum()

經(jīng)典10套：
1秘案、字符串處理為數(shù)值
Create a lambda function and change the type of item price：（原來的item_price格式是字符型砰苍，如$2.39）
用transform+匿名函數(shù)的方法，將字符串修改為float：

chipo['item_price']=chipo['item_price'].transform(lambda x:float(x[1:-1]))#這種方法比字符串轉(zhuǎn)化要簡單阱高！

2赚导、返回不重復(fù)行的總行數(shù)
最簡單的方法是用value_counts()函數(shù)對行進行計數(shù)（計算重復(fù)出現(xiàn)了多少次），然后再進行行數(shù)統(tǒng)計：

chipo.item_name.value_counts().count()

如果要考慮groupby思想的話赤惊，那就必須用mean或者sum等函數(shù)把groupby后的groupby類型展開為dataframe（不然本身groupby類型不能進行計數(shù)的）吼旧，再進行count：

chipo.groupby('item_name').sum().count()

3、panda關(guān)于字符串處理的函數(shù)
Select the teams that start with G
自己的寫法：

euro12[euro12.Goals>6].Team

但實際上panda有字符串首未舟、末的函數(shù)

euro12[euro12.Team.str.startswith('G')]

4黍少、loc/iloc用于篩選行名/列名，如何篩選列里面的值处面？
Present only the Shooting Accuracy from England, Italy and Russia
一種方法是用判斷語句(euro12.Team=='England'&...)厂置，再返回進行索引。缺點在于篩選值比較多的時候魂角，要用很多“或”語句昵济，很臃腫。
另一種簡便的方法是直接用isin函數(shù)野揪，再返回進行索引

euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

5访忿、stack和unstack方法
https://www.cnblogs.com/bambipai/p/7658311.html

image.png

6、groupby后直接打印group的方法(有點像enumerate可以返回序數(shù)和值)

# Group the dataframe by regiment, and for each regiment,
for name, group in regiment.groupby('regiment'):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)

直接用name 和 group（沒有找到這個函數(shù)斯稳，但實際上name就是對應(yīng)group名海铆，group對應(yīng)所包含的組的內(nèi)容！Ｕ醵琛）卧斟。
這可能是唯一的辦法，不然的話要用 for index in regiment.groupby('regiment')憎茂，再依次print 該某類'regiment'下所有的index對應(yīng)的行珍语？然而'regiment'和'index'都沒有對應(yīng)的函數(shù)！Ｊ！板乙？？拳氢？

7募逞、agg蛋铆、transform、apply放接、applymap刺啦、map的比較
（1） agg只能用自帶函數(shù)（mean、sum這些）進行數(shù)值運算透乾，允許不同列進行不同運算；
（2）transform允許：匿名函數(shù)+自帶函數(shù)（mean磕秤、sum這些）+自定義函數(shù)乳乌，只允許對單列進行單個/多個運算；
（3）apply允許：匿名函數(shù)+自定義函數(shù)市咆，相比transform不允許自帶函數(shù)汉操，同時是以series作為計算單位的（當(dāng)只有一個series時，可以是針對每個數(shù)據(jù)的單獨運算蒙兰，當(dāng)是一組series時磷瘤，則必須是min、max這種聚集函數(shù)）搜变；
（4）applymap是以所有元素進行運算的采缚；
（5）map是針對series的每個元素進行運算的。
關(guān)于apply挠他、applymap扳抽、map參https://blog.csdn.net/u010814042/article/details/76401133

總結(jié)：要進行自帶函數(shù)運算，用agg和transform殖侵，進行自定義函數(shù)運算贸呢，apply、applymap拢军、map各有用處楞陷，但是如果是針對多列進行不同的自定義函數(shù)運算，只有遍歷方法茉唉。

8固蛾、轉(zhuǎn)換為時間格式dateme64
https://blog.csdn.net/qq_36523839/article/details/79746977
用pd.to_datetime函數(shù)要記得匹配原數(shù)據(jù)中的時間格式：
· 1/17/07 has the format "%m/%d/%y"
· 17-1-2007 has the format "%d-%m-%Y"

9、生成隨機整數(shù)度陆，要用np.random.randint()函數(shù)

10魏铅、np.random.rand和np.random.random的區(qū)別
https://blog.csdn.net/xia_ri_xing/article/details/82949004
其實沒有區(qū)別，就是輸入的是元祖還是數(shù)字組合的區(qū)別

11坚芜、concat览芳、union、append和merge
mysql的union和pd.append差不多鸿竖，都是組成更多的行
pd.concat既可以組成行沧竟，也可以擴展成列铸敏，所以綜合來說最方便。
pd.concat相比merge：merge是正宗的表的聯(lián)結(jié)悟泵，concat只能是表的組合杈笔，不具備根據(jù)某列進行join的功能，而且concat的join也只是針對聯(lián)結(jié)的兩個表糕非，連接后是保留原來表所有的columns（outer）蒙具，還是保留兩個都有的columns（inner）。所以concat的join是對列說的朽肥，merge的join是對行數(shù)據(jù)來說的禁筏。

12、處理dataframe或者series多級索引的方式
series如果存在多級索引很難提取衡招，可以先to_frame轉(zhuǎn)化成dataframe篱昔，再reset_index，把多級轉(zhuǎn)化成column始腾，再提取列名州刽；
dataframe里面的列如果要重新設(shè)置為多級索引，有一個dataframe.columns=pd.MultiIndex.from_product()的方法