21天pandas入門(2) - 10分鐘入門2

merge


contact
pandas提供一些組合數(shù)據(jù)的set方法,相當(dāng)于join/merge操作吧遥金。

In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]: 
      0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)
Out[76]: 
      0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

join
sql style的merge. See the Database style joining

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]: 
   key  lval
0  foo     1
1  foo     2

In [80]: right
Out[80]: 
   key  rval
0  foo     4
1  foo     5

In [81]: pd.merge(left, right, on='key')
Out[81]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

append

In [82]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

In [83]: df
Out[83]: 
          A         B         C         D
0  1.346061  1.511763  1.627081 -0.990582
1 -0.441652  1.211526  0.268520  0.024580
2 -1.577585  0.396823 -0.105381 -0.532532
3  1.453749  1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346  0.339969 -0.693205
5 -0.339355  0.593616  0.884345  1.591431
6  0.141809  0.220390  0.435589  0.192451
7 -0.096701  0.803351  1.715071 -0.708758

In [84]: s = df.iloc[3]

In [85]: df.append(s, ignore_index=True)
Out[85]: 
          A         B         C         D
0  1.346061  1.511763  1.627081 -0.990582
1 -0.441652  1.211526  0.268520  0.024580
2 -1.577585  0.396823 -0.105381 -0.532532
3  1.453749  1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346  0.339969 -0.693205
5 -0.339355  0.593616  0.884345  1.591431
6  0.141809  0.220390  0.435589  0.192451
7 -0.096701  0.803351  1.715071 -0.708758
8  1.453749  1.208843 -0.080952 -0.264610

group
group 指的如下幾步:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

See the Grouping section

In [86]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
   ....: 

In [87]: df
Out[87]: 
     A      B         C         D
0  foo    one -1.202872 -0.055224
1  bar    one -1.814470  2.395985
2  foo    two  1.018601  1.552825
3  bar  three -0.595447  0.166599
4  foo    two  1.395433  0.047609
5  bar    two -0.392670 -0.136473
6  foo    one  0.007207 -0.561757
7  foo  three  1.928123 -1.623033

那么現(xiàn)在group一下纵诞,然后應(yīng)用sum函數(shù)

In [88]: df.groupby('A').sum()
Out[88]: 
            C        D
A                     
bar -2.802588  2.42611
foo  3.146492 -0.63958


In [89]: df.groupby(['A','B']).sum()
Out[89]: 
                  C         D
A   B                        
bar one   -1.814470  2.395985
    three -0.595447  0.166599
    two   -0.392670 -0.136473
foo one   -1.195665 -0.616981
    three  1.928123 -1.623033
    two    2.414034  1.600434

這里有兩個不常用的

reshape

See the sections on Hierarchical Indexing and Reshaping.
stack

In [90]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
   ....: 

In [91]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [92]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [93]: df2 = df[:4]

In [94]: df2
Out[94]: 
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230

The stack()
method “compresses” a level in the DataFrame’s columns.

In [95]: stacked = df2.stack()

In [96]: stacked
Out[96]: 
first  second   
bar    one     A    0.029399
               B   -0.542108
       two     A    0.282696
               B   -0.087302
baz    one     A   -1.575170
               B    1.771208
       two     A    0.816482
               B    1.100230
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex
as the index
), the inverse operation of stack()
isunstack()
, which by default unstacks the last level:

In [97]: stacked.unstack()
Out[97]: 
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230


In [98]: stacked.unstack(1)
Out[98]: 
second        one       two
first                      
bar   A  0.029399  0.282696
      B -0.542108 -0.087302
baz   A -1.575170  0.816482
      B  1.771208  1.100230

In [99]: stacked.unstack(0)
Out[99]: 
first          bar       baz
second                      
one    A  0.029399 -1.575170
       B -0.542108  1.771208
two    A  0.282696  0.816482
       B -0.087302  1.100230

pivot table 數(shù)據(jù)透視圖?
See the section on Pivot Tables.

In [100]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
   .....: 

In [101]: df
Out[101]: 
        A  B    C         D         E
0     one  A  foo  1.418757 -0.179666
1     one  B  foo -1.879024  1.291836
2     two  C  foo  0.536826 -0.009614
3   three  A  bar  1.006160  0.392149
4     one  B  bar -0.029716  0.264599
5     one  C  bar -1.146178 -0.057409
6     two  A  foo  0.100900 -1.425638
7   three  B  foo -1.035018  1.024098
8     one  C  foo  0.314665 -0.106062
9     one  A  bar -0.773723  1.824375
10    two  B  bar -1.170653  0.595974
11  three  C  bar  0.648740  1.167115

We can produce pivot tables from this data very easily:

In [102]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[102]: 
C             bar       foo
A     B                    
one   A -0.773723  1.418757
      B -0.029716 -1.879024
      C -1.146178  0.314665
three A  1.006160       NaN
      B       NaN -1.035018
      C  0.648740       NaN
two   A       NaN  0.100900
      B -1.170653       NaN
      C       NaN  0.536826

time series 時間處理

對于時間頻率轉(zhuǎn)換提供了很好的支持番枚。在financial領(lǐng)域很常見审轮。See the Time Series section

In [103]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [104]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [105]: ts.resample('5Min', how='sum')
Out[105]: 
2012-01-01    25083
Freq: 5T, dtype: int32




In [106]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [107]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [108]: ts
Out[108]: 
2012-03-06    0.464000
2012-03-07    0.227371
2012-03-08   -0.496922
2012-03-09    0.306389
2012-03-10   -2.290613
Freq: D, dtype: float64

In [109]: ts_utc = ts.tz_localize('UTC')

In [110]: ts_utc
Out[110]: 
2012-03-06 00:00:00+00:00    0.464000
2012-03-07 00:00:00+00:00    0.227371
2012-03-08 00:00:00+00:00   -0.496922
2012-03-09 00:00:00+00:00    0.306389
2012-03-10 00:00:00+00:00   -2.290613
Freq: D, dtype: float64




In [111]: ts_utc.tz_convert('US/Eastern')
Out[111]: 
2012-03-05 19:00:00-05:00    0.464000
2012-03-06 19:00:00-05:00    0.227371
2012-03-07 19:00:00-05:00   -0.496922
2012-03-08 19:00:00-05:00    0.306389
2012-03-09 19:00:00-05:00   -2.290613
Freq: D, dtype: float64


In [112]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [113]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [114]: ts
Out[114]: 
2012-01-31   -1.134623
2012-02-29   -1.561819
2012-03-31   -0.260838
2012-04-30    0.281957
2012-05-31    1.523962
Freq: M, dtype: float64

In [115]: ps = ts.to_period()

In [116]: ps
Out[116]: 
2012-01   -1.134623
2012-02   -1.561819
2012-03   -0.260838
2012-04    0.281957
2012-05    1.523962
Freq: M, dtype: float64

In [117]: ps.to_timestamp()
Out[117]: 
2012-01-01   -1.134623
2012-02-01   -1.561819
2012-03-01   -0.260838
2012-04-01    0.281957
2012-05-01    1.523962
Freq: MS, dtype: float64

時間戳和日期的轉(zhuǎn)換能夠進(jìn)行一些方便的計算。下面的例子嗅绸,以季度為單位
with year ending in November to 9am of the end of the month following the quarter end:

In [118]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [119]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [120]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [121]: ts.head()
Out[121]: 
1990-03-01 09:00   -0.902937
1990-06-01 09:00    0.068159
1990-09-01 09:00   -0.057873
1990-12-01 09:00   -0.368204
1991-03-01 09:00   -1.144073
Freq: H, dtype: float64

Categoricals 分類脾猛?

Since version 0.15, pandas can include categorical data in a DataFrame
. For full docs, see the categorical introductionand the API documentation.
In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
In [123]: df["grade"] = df["raw_grade"].astype("category")

In [124]: df["grade"]
Out[124]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

# rename
In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"]



In [126]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [127]: df["grade"]
Out[127]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [128]: df.sort_values(by="grade")
Out[128]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

In [129]: df.groupby("grade").size()
Out[129]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

plot 終于到重點(diǎn)了!畫圖鱼鸠!

Plotting docs.

In [130]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

In [131]: ts = ts.cumsum()

In [132]: ts.plot()
Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0xae3696ac>

圖就不貼了

In [133]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 

In [134]: df = df.cumsum()

In [135]: plt.figure(); df.plot(); plt.legend(loc='best')
Out[135]: <matplotlib.legend.Legend at 0xab53b26c>
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末猛拴,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子蚀狰,更是在濱河造成了極大的恐慌愉昆,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,743評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件麻蹋,死亡現(xiàn)場離奇詭異跛溉,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)扮授,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,296評論 3 385
  • 文/潘曉璐 我一進(jìn)店門芳室,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人刹勃,你說我怎么就攤上這事堪侯。” “怎么了荔仁?”我有些...
    開封第一講書人閱讀 157,285評論 0 348
  • 文/不壞的土叔 我叫張陵伍宦,是天一觀的道長。 經(jīng)常有香客問我乏梁,道長次洼,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,485評論 1 283
  • 正文 為了忘掉前任遇骑,我火速辦了婚禮卖毁,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘落萎。我一直安慰自己势篡,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,581評論 6 386
  • 文/花漫 我一把揭開白布模暗。 她就那樣靜靜地躺著,像睡著了一般念祭。 火紅的嫁衣襯著肌膚如雪兑宇。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,821評論 1 290
  • 那天粱坤,我揣著相機(jī)與錄音隶糕,去河邊找鬼瓷产。 笑死,一個胖子當(dāng)著我的面吹牛枚驻,可吹牛的內(nèi)容都是我干的濒旦。 我是一名探鬼主播,決...
    沈念sama閱讀 38,960評論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼再登,長吁一口氣:“原來是場噩夢啊……” “哼尔邓!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起锉矢,我...
    開封第一講書人閱讀 37,719評論 0 266
  • 序言:老撾萬榮一對情侶失蹤梯嗽,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后沽损,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體灯节,經(jīng)...
    沈念sama閱讀 44,186評論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,516評論 2 327
  • 正文 我和宋清朗相戀三年绵估,在試婚紗的時候發(fā)現(xiàn)自己被綠了炎疆。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,650評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡国裳,死狀恐怖形入,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情躏救,我是刑警寧澤唯笙,帶...
    沈念sama閱讀 34,329評論 4 330
  • 正文 年R本政府宣布,位于F島的核電站盒使,受9級特大地震影響崩掘,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜少办,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,936評論 3 313
  • 文/蒙蒙 一苞慢、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧英妓,春花似錦挽放、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,757評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至腿倚,卻和暖如春纯出,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,991評論 1 266
  • 我被黑心中介騙來泰國打工暂筝, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留箩言,地道東北人。 一個月前我還...
    沈念sama閱讀 46,370評論 2 360
  • 正文 我出身青樓焕襟,卻偏偏與公主長得像陨收,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子鸵赖,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,527評論 2 349

推薦閱讀更多精彩內(nèi)容

  • 最近在寫個性化推薦的論文务漩,經(jīng)常用到Python來處理數(shù)據(jù),被pandas和numpy中的數(shù)據(jù)選取和索引問題繞的比較...
    shuhanrainbow閱讀 4,541評論 6 19
  • ①恰當(dāng)?shù)哪繕?biāo)不易準(zhǔn)確確定卫漫。 ②過分強(qiáng)調(diào)短期目標(biāo)菲饼。 ③花費(fèi)太多時間。 ④缺乏靈活性列赎。 ⑤過分強(qiáng)調(diào)定量目標(biāo)宏悦。 共同愿景...
    芃芃暖暖閱讀 1,071評論 0 0
  • 民宅與商鋪(廠房)開門有左中右三法。而商品樓是定性開門包吝,不可改饼煞。不以論,不過大家可以民宅商鋪等參考進(jìn)行選購诗越。 如果...
    易小得一玄易閱讀 1,196評論 0 0
  • 暑假第三十八天砖瞧,這幾天天氣熱,公司放假嚷狞,兒子回老家了块促,今天沒上班回家看兒子。兒子看見我就抱著我說:“媽媽床未,我好想你...
    周新悅閱讀 259評論 0 1
  • 我的優(yōu)點(diǎn) 好像還沒發(fā)現(xiàn)什么優(yōu)點(diǎn)... >_< 我的不足之處 1 演講或者給別人分享的時候:<我聲音太小> 201...
    ColdRomantic閱讀 2,391評論 0 0