公眾號(hào):尤而小屋
作者:Peter
編輯:Peter
大家好会油,我是Peter~
本文主要介紹的是通過(guò)使用Pandas中3個(gè)字符串相關(guān)函數(shù)來(lái)篩選滿足需求的文本數(shù)據(jù):
- contains :包含某個(gè)字符
- startswith:以字符開(kāi)頭
- endswith:以字符結(jié)尾
模擬數(shù)據(jù)
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name":["xiao ming","Xiao zhang",np.nan,"sun quan","guan yu"],
"age":["22","19","20","34","39"],
"sex":["male","Female","female","Female","male"],
"address":["廣東省深圳市","浙江省杭州市","江蘇省蘇州市","福建省泉州市","廣東省廣州市"]
})
df
df.dtypes # 查看字段類型
name object
age object
sex object
address object
dtype: object
在本次模擬的數(shù)據(jù)中力图,有4個(gè)特點(diǎn):
- name字段:存在缺失值np.nan宿百,且Xiao和xiao存在大小寫(xiě)之分
- age:年齡字段蔓搞,正常應(yīng)該是數(shù)值型捞蚂,模擬的數(shù)據(jù)是字符類型object
- sex:也存在F和f的大小寫(xiě)之分
- address:正常寫(xiě)法
數(shù)據(jù)類型轉(zhuǎn)換
我們將age字段的字符類型型轉(zhuǎn)成數(shù)值型
df["age"] = df["age"].astype(float)
df
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
生成的數(shù)據(jù)如下剪侮,似乎和原始數(shù)據(jù)沒(méi)有區(qū)別金赦;但是我們查看屬性字段的數(shù)據(jù)類型就會(huì)看到區(qū)別:
df.dtypes
name object
age float64
sex object
address object
dtype: object
age字段已經(jīng)轉(zhuǎn)成了float64位的數(shù)值型音瓷。
contains
contains是用于Series數(shù)據(jù)的函數(shù),基本語(yǔ)法如下:
Series.str.contains(
pat,
case=True,
flags=0,
na=None,
regex=True
)
- pat:傳入的字符或者正則表達(dá)式
- case:是否區(qū)分大小寫(xiě)(對(duì)大小寫(xiě)敏感)
- flags:正則標(biāo)志位夹抗,比如:re.IGNORECASE绳慎,表示忽略大小寫(xiě)
- na:可選項(xiàng),標(biāo)量類型漠烧;對(duì)原數(shù)據(jù)中的缺失值處理杏愤,如果是object-dtype, 使用numpy.nan 代替;如果是StringDtype, 用pandas.NA
- regex:布爾值已脓;True:傳入的pat看做是正則表達(dá)式珊楼,F(xiàn)alse:看做是正常的字符類型的表達(dá)式
默認(rèn)情況
# 例子1:篩選包含xiao的數(shù)據(jù)
df["name"].str.contains("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
當(dāng)屬性中存在缺失值的時(shí)候,需要帶上na參數(shù):
缺失值處理
# 例子2:參數(shù)na使用
df[df["name"].str.contains("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>
</div>
如果不帶上則會(huì)報(bào)錯(cuò):
df[df["name"].str.contains("xiao")]
忽略大小寫(xiě)
# 例子3:case使用
df["name"].str.contains("xiao",case=False)
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
上面的結(jié)果直接忽略了大小寫(xiě)摆舟,可以看到出現(xiàn)了兩個(gè)True:也就是xiao和Xiao的數(shù)據(jù)都被篩選出來(lái):
df[df["name"].str.contains("xiao",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
忽略大小寫(xiě)和缺失值
# 例子4:忽略大小寫(xiě)和缺失值
df[df["sex"].str.contains("f",case=False, na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
正則表達(dá)式使用
# 例子5:正則表達(dá)式使用
df["address"].str.contains("^廣")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
其中^
表示開(kāi)始的符號(hào)亥曹,即:以廣
開(kāi)頭的數(shù)據(jù)
df[df["address"].str.contains("^廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
正則表達(dá)式中的$
表示結(jié)尾的符號(hào);下面是篩選以市
結(jié)尾的數(shù)據(jù):
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
在下面的正則表達(dá)式例子中恨诱,會(huì)在深蘇泉
中任意選擇一個(gè)媳瞪,然后包含這個(gè)字符的數(shù)據(jù):
df[df["address"].str.contains("[深蘇泉]")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
startswith
startswith的語(yǔ)法相對(duì)簡(jiǎn)單:
Series.str.startswith(pat, na=None)
- pat:表示一個(gè)字符;注意:不接受正則表達(dá)式
- na:表示對(duì)缺失值的處理照宝;na=False表示忽略缺失值
pat參數(shù)
指定一個(gè)字符蛇受;不接受正則表達(dá)式
df["address"].str.startswith("廣")
0 True
1 False
2 False
3 False
4 True
Name: address, dtype: bool
df[df["address"].str.startswith("廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
這種寫(xiě)法和正則表達(dá)式的以某個(gè)字符開(kāi)頭是同樣的效果:
df[df["address"].str.contains("^廣")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
自動(dòng)區(qū)分大小寫(xiě)
startswith方法是自動(dòng)區(qū)分大小寫(xiě)的:
df[df["sex"].str.startswith("f")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
</tbody>
</table>
</div>
df[df["sex"].str.startswith("F")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>
</div>
缺失值處理
df["name"].str.startswith("xiao")
0 True
1 False
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.startswith("xiao",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>
</div>
endswith
指定以某個(gè)字符結(jié)尾,語(yǔ)法為:
Series.str.endswith(pat, na=None)
- pat:表示一個(gè)字符厕鹃;注意:不接受正則表達(dá)式
- na:表示對(duì)缺失值的處理兢仰;na=False表示忽略缺失值
pat參數(shù)
# 以市結(jié)尾
df[df["address"].str.endswith("市")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
# 正則的寫(xiě)法:contains方法
df[df["address"].str.contains("市$")]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>
</div>
缺失值處理
df["name"].str.endswith("g")
0 True
1 True
2 NaN
3 False
4 False
Name: name, dtype: object
df[df["name"].str.endswith("g",na=False)]
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>
</div>
# 不加na參數(shù)則報(bào)錯(cuò)
df[df["name"].str.endswith("g")]
報(bào)錯(cuò)的原因很明顯:就是因?yàn)閚ame字段下面存在缺失值。當(dāng)使用了na參數(shù)就可以解決