公眾號(hào):尤而小屋
作者:Peter
編輯:Peter
大家好,我是Peter~
本文介紹的是Pandas庫中一個(gè)非常有用的函數(shù):assign
在我們處理數(shù)據(jù)的時(shí)候狱杰,有時(shí)需要根據(jù)某個(gè)列進(jìn)行計(jì)算得到一個(gè)新列差凹,以便后續(xù)使用期奔,相當(dāng)于是根據(jù)已知列得到新的列侧馅,這個(gè)時(shí)候assign函數(shù)非常方便。下面通過實(shí)例來說明函數(shù)的的用法呐萌。
Pandas文章
本文是Pandas文章連載系列的第21篇馁痴,主要分為3類:
基礎(chǔ)部分:1-16篇时呀,主要是介紹Pandas中基礎(chǔ)和常用操作脐区,比如數(shù)據(jù)創(chuàng)建、檢索查詢抢蚀、排名排序渠旁、缺失值/重復(fù)值處理等常見的數(shù)據(jù)處理操作
進(jìn)階部分:第17篇開始講解Pandas中的高級(jí)操作方法
對(duì)比SQL攀例,學(xué)習(xí)Pandas:將SQL和Pandas的操作對(duì)比起來進(jìn)行學(xué)習(xí)
參數(shù)
assign函數(shù)的參數(shù)只有一個(gè):DataFrame.assign(**kwargs)。
**kwargs: dict of {str: callable or Series}
關(guān)于參數(shù)的幾點(diǎn)說明:
- 列名是關(guān)鍵字keywords
- 如果列名是可調(diào)用的顾腊,那么它們將在DataFrame上計(jì)算并分配給新的列
- 如果列名是不可調(diào)用的(例如:Series粤铭、標(biāo)量scalar或者數(shù)組array),則直接進(jìn)行分配
最后杂靶,這個(gè)函數(shù)的返回值是一個(gè)新的DataFrame數(shù)據(jù)框梆惯,包含所有現(xiàn)有列和新生成的列
導(dǎo)入庫
import pandas as pd
import numpy as np
# 模擬數(shù)據(jù)
df = pd.DataFrame({
"col1":[12, 16, 18],
"col2":["xiaoming","peter", "mike"]})
df
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
實(shí)例
當(dāng)值是可調(diào)用的,我們直接在數(shù)據(jù)框上進(jìn)行計(jì)算:
方式1:直接調(diào)用數(shù)據(jù)框
# 方式1:數(shù)據(jù)框df上調(diào)用
# 使用數(shù)據(jù)框df的col1屬性吗垮,生成col3
df.assign(col3=lambda x: x.col1 / 2 + 20)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
<th>col3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
<td>26.0</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
<td>28.0</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
<td>29.0</td>
</tr>
</tbody>
</table>
</div>
我們可以查看原來的df垛吗,發(fā)現(xiàn)它是不變的
df # 原數(shù)據(jù)框不變的
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
操作字符串類型的數(shù)據(jù):
df.assign(col3=df["col2"].str.upper())
方式2:調(diào)用Series數(shù)據(jù)
可以通過直接引用現(xiàn)有的Series或序列來實(shí)現(xiàn)相同的行為:
# 方式2:調(diào)用現(xiàn)有的Series來計(jì)算
df.assign(col4=df["col1"] * 3 / 4 + 25)
df # 原數(shù)據(jù)不變
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
在Python3.6+中,我們可以在同一個(gè)賦值中創(chuàng)建多個(gè)列烁登,并且其中一個(gè)列還可以依賴于同一個(gè)賦值中定義的另一列怯屉,也就是中間生成的新列可以直接使用:
df.assign(
col5=lambda x: x["col1"] / 2 + 10,
col6=lambda x: x["col5"] * 5, # 在col6計(jì)算中直接使用col5
col7=lambda x: x.col2.str.upper(),
col8=lambda x: x.col7.str.title() # col8中使用col7
)
df # 原數(shù)據(jù)不變
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
如果我們重新分配的是一個(gè)現(xiàn)有的列,那么這個(gè)現(xiàn)有列的值將會(huì)被覆蓋:
df.assign(col1=df["col1"] / 2) # col1直接被覆蓋
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>6.0</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>8.0</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>9.0</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
對(duì)比apply函數(shù)
我們?cè)趐andas中同樣可以使用apply函數(shù)來實(shí)現(xiàn)
df # 原數(shù)據(jù)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
生成一個(gè)副本饵沧,我們直接在副本上操作:
df1 = df.copy() # 生成副本锨络,直接在副本上操作
df2 = df.copy()
df1
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
df1.assign(col3=lambda x: x.col1 / 2 + 20)
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
<th>col3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
<td>26.0</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
<td>28.0</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
<td>29.0</td>
</tr>
</tbody>
</table>
</div>
df1 # df1保持不變
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>
</div>
df1["col3"] = df1["col1"].apply(lambda x:x / 2 + 20)
df1 # df1已經(jīng)發(fā)生了變化
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
<th>col3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
<td>26.0</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
<td>28.0</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
<td>29.0</td>
</tr>
</tbody>
</table>
</div>
我們發(fā)現(xiàn):通過assign函數(shù)的操作,原數(shù)據(jù)是不變的狼牺,但是通過apply操作的數(shù)據(jù)已經(jīng)變化了
BMI
最后在模擬一份數(shù)據(jù)羡儿,計(jì)算每個(gè)人的BMI。
身體質(zhì)量指數(shù)是钥,是BMI指數(shù)掠归,簡(jiǎn)稱體質(zhì)指數(shù),是國際上常用的衡量人體胖瘦程度以及是否健康的一個(gè)標(biāo)準(zhǔn)悄泥。
其中:體重單位是kg拂到,身高單位是m
df2 = pd.DataFrame({
"name":["xiaoming","xiaohong","xiaosu"],
"weight":[78,65,87],
"height":[1.82,1.75,1.89]
})
df2
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>weight</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiaoming</td>
<td>78</td>
<td>1.82</td>
</tr>
<tr>
<th>1</th>
<td>xiaohong</td>
<td>65</td>
<td>1.75</td>
</tr>
<tr>
<th>2</th>
<td>xiaosu</td>
<td>87</td>
<td>1.89</td>
</tr>
</tbody>
</table>
</div>
# 使用assign函數(shù)實(shí)現(xiàn)
df2.assign(BMI=df2["weight"] / (df2["height"] ** 2))
df2 # 不變
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>weight</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiaoming</td>
<td>78</td>
<td>1.82</td>
</tr>
<tr>
<th>1</th>
<td>xiaohong</td>
<td>65</td>
<td>1.75</td>
</tr>
<tr>
<th>2</th>
<td>xiaosu</td>
<td>87</td>
<td>1.89</td>
</tr>
</tbody>
</table>
</div>
df2["BMI"] = df2["weight"] / (df2["height"] ** 2)
df2 # df2生成了一個(gè)新的列:BMI
總結(jié)
通過上面的例子,我們發(fā)現(xiàn):
- 使用assign函數(shù)生成的DataFrame是不會(huì)改變?cè)瓉淼臄?shù)據(jù)码泞,這個(gè)DataFrame是新的
- assign函數(shù)能夠同時(shí)操作多個(gè)列名兄旬,并且中間生成的列名能夠直接使用
- assign和apply的主要區(qū)別在于:前者不改變?cè)瓟?shù)據(jù),apply函數(shù)是在原數(shù)據(jù)的基礎(chǔ)上添加新列