pandas 中的merge是一種功能比較強大的用于兩個DataFrame或者Series進行合并的方法.
合并時會將所有的列進行合并,但是指定鍵值不存在行列會填充NaN.
直接復制官方文檔 :
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
常用參數解釋:
-
right : DataFrame or named Series:
- 當使用pandas.merge()時,right處實際填入兩個待合并的結構;當使用dataframe.merge()時,right處僅填入一個待合并的結構,此處的right與dataframe分別作為右/左結構.
-
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’:
- how指定了結構的融合的類型,是一個關于key的重要參數
- 默認inner,即采用交叉部分的key作為列的內容
- left: 即選取左側結構的key作為列的內容
- right: 即選取左側結構的key作為列的內容
- outer: 選取所有的鍵作為列內容.
- 不存在的內容用NaN填充
-
on : label or list
Column or index level names to join on. These must be found in both DataFrames.- 參數on指定了用于合并的鍵key.
- 參數on指定的鍵必須是兩個結構中共有的.
-
indicator : bool or str, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.- indicator 用于指示說明該行所用的鍵來自于哪一邊結構.
-
left_index : bool, default False
Use the index from the left DataFrame as the join key(s).- left_index 設定為True, 即根據左側結構的index進行merge. 而不再是根據某一columns.
-
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.- right_index 與 left_index同時使用.即根據兩個結構的index進行merge.
- suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).- suffixes 主要用于解決兩個合并結構的列存在交叉的情況.
- 通過suffixes 的指定,名字相同可以在merge后使用不同的列名,并同時存在.
代碼:
In:df1 = pd.DataFrame({'A':['A0','A1','A2'],'B':['B0','B1','B2']},index=['KO','K1','K2'])
df2 = pd.DataFrame({'C':['C0','C2','C3'],'D':['D0','D2','D3']},index=['KO','K1','K2'])
In: df1
Out:
A B
KO A0 B0
K1 A1 B1
K2 A2 B2
In: df1
Out:
C D
KO C0 D0
K1 C2 D2
K2 C3 D3
# 打開left_index 和 right_index ,how='inner'即根據行進行merge, 合并的類型是采用交叉部分(index部分的交叉)進行合并.
In: res = pd.merge(df1,df2,left_index=True,right_index=True,how='inner')
In: res
Out:
A B C D
KO A0 B0 C0 D0
K1 A1 B1 C2 D2
K2 A2 B2 C3 D3
In: df1 = pd.DataFrame({'lkey1':['foo','bar','baz','foo'],'value':[1,2,3,4],'rkey':['ab','bc','cd','ef']})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foooo'],'value': [4, 6, 7, 8],'xxx':[1,23,4,5]})
In: df1
Out:
lkey1 value rkey
0 foo 1 ab
1 bar 2 bc
2 baz 3 cd
3 foo 4 ef
In: df2
Out:
rkey value xxx
0 foo 4 1
1 bar 6 23
2 baz 7 4
3 foooo 8 5
# 基于value鍵,使用left類型進行合并.合并結果中的value只采用df1中value值,對于df2中不存在value值對應行的情況直接填充NaN(例如value1/2/3)
In: pd.merge(df1,df2,how='left',on = 'value')
Out:
lkey1 value rkey_x rkey_y xxx
0 foo 1 ab NaN NaN
1 bar 2 bc NaN NaN
2 baz 3 cd NaN NaN
3 foo 4 ef foo 1.0
注意到在上一段代碼的運行結果中,重疊部分即rkey列,在融合后自動添加了x和y后綴,避免了重疊.這里也可以利用suffixes對后綴進行指定
In: boys = pd.DataFrame({'k':['K0','K1','K2'],'age':['1','2','3'],'name':['b1','b2','b3']})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':['4','5','6'],'name':['g1','g2','g3']})
In: boys
Out:
k age name
0 K0 1 b1
1 K1 2 b2
2 K2 3 b3
In: girls
Out:
k age name
0 K0 4 g1
1 K0 5 g2
2 K3 6 g3
# 針對name和age列存在重疊情況. 使用suffixes指明了后綴
In: pd.merge(boys,girls,suffixes=['_boys','_girls'],on = 'k',how= 'inner')
Out:
k age_boys name_boys age_girls name_girls
0 K0 1 b1 4 g1
1 K0 1 b1 5 g2
In: pd.merge(boys,girls,suffixes=['_boys','_girls'],on = ['k','age'] ,how= 'left')
Out:
k age name_boys name_girls
0 K0 1 b1 NaN
1 K1 2 b2 NaN
2 K2 3 b3 NaN