在Pandas的實(shí)踐過(guò)程中,我們經(jīng)常需要將兩個(gè)DataFrame合并組合在一起再進(jìn)行處理措伐,比如將不同來(lái)源的數(shù)據(jù)合并在一起,或者將不同日期的DataFrame合并在一起。DataFrame的合并組合從方向上分昆烁,大體上分為兩種情況:橫向的,縱向的缎岗。(這個(gè)很容易理解吧)
看下如下的圖示(圖片來(lái)自Pandas官網(wǎng))
另外需要注意的是静尼,兩個(gè)DataFrame在合在一起的時(shí)候,如果針對(duì)重疊項(xiàng)(比如都有column B)會(huì)有兩種不同的處理方式传泊,一種是針對(duì)重疊項(xiàng)進(jìn)行合并處理(比如相加茅郎,或者直接取代);另一種是忽略重疊項(xiàng)或渤,只是簡(jiǎn)單的組合在一起系冗。前者我們稱(chēng)為合并,后者我們叫做組合薪鹦。
另外掌敬,在Pandas中有很多不同函數(shù)和不同用法惯豆,比如有concat, join, merge, append,它們各有不同的使用場(chǎng)景奔害。
縱向連接
In [45]: import pandas as pd
In [46]: df = pd.DataFrame({"name": ["zhangsan", "lisi", "wangwu"], "city": ["beijing"
...: , "beijing", "shenzhen"], "order": [12, 33, 67]})
In [47]: df
Out[47]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
In [48]: df2 = pd.DataFrame({"name": ["zhanghai", "liyang", "wangjing"], "city": ["sha
...: nghai", "shenzhen", "chengdu"], "order": [2, 3, 7]})
In [49]: df2
Out[49]:
city name order
0 shanghai zhanghai 2
1 shenzhen liyang 3
2 chengdu wangjing 7
# concat默認(rèn)按行拼接(即縱向連接楷兽,axis=0), 也可以按列來(lái)連接(axis=1)
# 注意concat拼接時(shí),其參數(shù)是一個(gè)數(shù)組 华临,因此可以拼接多個(gè)DataFrame
In [51]: pd.concat([df, df2])
Out[51]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
0 shanghai zhanghai 2
1 shenzhen liyang 3
2 chengdu wangjing 7
# append是concat的一種快捷方式芯杀,用于將兩個(gè)DataFrame直接拼接在一起
# append的參數(shù)即可以是一個(gè)DataFrame,也可以是一個(gè)數(shù)組雅潭,這種情況下可以合并多個(gè)DataFrame
In [52]: df.append(df2)
Out[52]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
0 shanghai zhanghai 2
1 shenzhen liyang 3
2 chengdu wangjing 7
橫向連接與合并
前面我們已經(jīng)知道concat不僅可以縱向連接揭厚,也可以橫向連接
In [8]: df3 = pd.concat([df, df2], join="inner", axis=1)
In [9]: df3
Out[9]:
city name order city name order
0 beijing zhangsan 12 shanghai zhanghai 2
1 beijing lisi 33 shenzhen liyang 3
2 shenzhen wangwu 67 chengdu wangjing 7
# 實(shí)際上,concat是通過(guò)index來(lái)作為拼接的依據(jù)扶供,什么意思呢筛圆,看下面的例子
# 我們重新assign index給df3
In [12]: df3 = pd.DataFrame({"name": ["zhanghai", "liyang", "wangjing"], "city": ["shanghai", "shenzhen", "chengdu"], "order": [2, 3, 7]}, index=[
...: 1, 3, 9])
# df3的row index變?yōu)榱?, 3椿浓, 9
In [13]: df3
Out[13]:
city name order
1 shanghai zhanghai 2
3 shenzhen liyang 3
9 chengdu wangjing 7
# 當(dāng)使用join="outer"的時(shí)候太援,取index的并集,index相同的行會(huì)放在同一行扳碍,而不同的行會(huì)分別列出
In [14]: df4 = pd.concat([df, df3], join="outer", axis=1)
In [15]: df4
Out[15]:
city name order city name order
0 beijing zhangsan 12.0 NaN NaN NaN
1 beijing lisi 33.0 shanghai zhanghai 2.0
2 shenzhen wangwu 67.0 NaN NaN NaN
3 NaN NaN NaN shenzhen liyang 3.0
9 NaN NaN NaN chengdu wangjing 7.0
# 當(dāng)join="inner"時(shí)提岔,我們?nèi)〗患琲ndex相同的行才會(huì)拼接在一起
In [16]: df5 = pd.concat([df, df3], join="inner", axis=1)
In [17]: df5
Out[17]:
city name order city name order
1 beijing lisi 33 shanghai zhanghai 2
我們看到concat仍然是一種拼接笋敞,其根據(jù)index進(jìn)行join碱蒙,而merge更加靈活,可以根據(jù)指定的column來(lái)進(jìn)行合并液样,如下:
In [18]: df
Out[18]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
In [19]: df2
Out[19]:
city name order
1 shanghai zhanghai 2
3 shenzhen liyang 3
9 chengdu wangjing 7
# 根據(jù)指定的列"city"進(jìn)行合并振亮,同時(shí)how="outer"意味著join的方式
# on也可以跟一個(gè)list,這樣就可以針對(duì)多個(gè)列進(jìn)行join
In [21]: df6 = df.merge(df2, on="city", how="outer")
# 注意同名的列(非on column)鞭莽,會(huì)默認(rèn)添加"_x", "_y"后綴
In [22]: df6
Out[22]:
city name_x order_x name_y order_y
0 beijing zhangsan 12.0 NaN NaN
1 beijing lisi 33.0 NaN NaN
2 shenzhen wangwu 67.0 liyang 3.0
3 shanghai NaN NaN zhanghai 2.0
4 chengdu NaN NaN wangjing 7.0
可以針對(duì)多個(gè)列進(jìn)行join坊秸,并重新命名后綴,如下:
In [31]: df2
Out[31]:
city name order
1 shanghai zhanghai 2
3 shenzhen liyang 67
9 chengdu wangjing 33
In [32]: df
Out[32]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
# 這里針對(duì)column "city" 和 "order"進(jìn)行join
# 同時(shí)我們可以通過(guò)suffixes來(lái)為同名的列賦予更有意義的名字
In [35]: df7 = df.merge(df2, on=["city", "order"], how="outer", suffixes=('_left', '_right'))
In [36]: df7
Out[36]:
city name_left order name_right
0 beijing zhangsan 12 NaN
1 beijing lisi 33 NaN
2 shenzhen wangwu 67 liyang
3 shanghai NaN 2 zhanghai
4 chengdu NaN 33 wangjing
關(guān)于merge還有幾個(gè)常用的參數(shù)說(shuō)明如下:
- left_index & right_index: 當(dāng)我們需要通過(guò)index來(lái)進(jìn)行join的時(shí)候(類(lèi)似concat)澎怒,則可以使用left_index 或者right_index.
- sort: 默認(rèn)為False褒搔,如果True則將join的key按照字典順序進(jìn)行排序,比如我們按照"city"進(jìn)行join的時(shí)候喷面,會(huì)按照"city"的字典順序進(jìn)行排序星瘾。但如果我們不需要排序,則可以將其置為False惧辈,以提高性能
-
validate: 主要針對(duì)duplicate的情況琳状,它有以下幾個(gè)參數(shù)可以設(shè)置
“one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: checks if merge keys are unique in left dataset.
“many_to_one” or “m:1”: checks if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
更多關(guān)于merge的說(shuō)明參考如下鏈接:
關(guān)于merge的說(shuō)明
另外,join是merge的一種簡(jiǎn)便寫(xiě)法盒齿,其底層是通過(guò)merge來(lái)實(shí)現(xiàn)的念逞,如下兩種表達(dá)方式是相同的困食。
left.join(right, on=key_or_keys)
# 從這里可以看出join的時(shí)候是使用左邊的df的column key,而使用右邊df的index進(jìn)行join
pd.merge(left, right, left_on=key_or_keys, right_index=True,
how='left', sort=False)
示例:
# 示例來(lái)自官網(wǎng)
In [85]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
....: 'B': ['B0', 'B1', 'B2', 'B3'],
....: 'key': ['K0', 'K1', 'K0', 'K1']})
....:
In [86]: right = pd.DataFrame({'C': ['C0', 'C1'],
....: 'D': ['D0', 'D1']},
....: index=['K0', 'K1'])
....:
In [87]: result = left.join(right, on='key')
需要注意的是當(dāng)要join的兩個(gè)DataFrame有同名的列時(shí)翎承,必須指定suffix硕盹,否則會(huì)報(bào)錯(cuò),如下:
In [45]: df
Out[45]:
city name order
0 beijing zhangsan 12
1 beijing lisi 33
2 shenzhen wangwu 67
In [46]: df2
Out[46]:
city name order
1 shanghai zhanghai 2
3 shenzhen liyang 67
9 chengdu wangjing 33
In [47]: df.join(df2, lsuffix="_left", rsuffix="_right")
Out[47]:
city_left name_left order_left city_right name_right order_right
0 beijing zhangsan 12 NaN NaN NaN
1 beijing lisi 33 shanghai zhanghai 2.0
2 shenzhen wangwu 67 NaN NaN NaN