Python基礎(chǔ)語法 - 4 正則表達式和綜合實戰(zhàn)

1 正則表達式的符號和特殊字符
2 正則表達式的匹配和分組
3 re庫：compile match search findall sub split group groups groupdict

1. 正則表達式

有三個匹配模式：

簡單匹配
多個匹配
匹配任意字符

2. 正則表達式的使用

a*bc 匹配0次或者多次a
a+bc 匹配1次或者多次a
a?bc 匹配0次或者1次a
a{3}bc 匹配3次a
a{2,5}bc 匹配2-5次a官册，優(yōu)先匹配最多次的

3. 正則表達式匹配同類型及邊界匹配

匹配同類型：

\d 數(shù)字
\w 數(shù)字和字符
\s 空格
邊界匹配：
^ 開頭
$ 結(jié)尾

4. 正則表達式匹配選項

使用\來進行轉(zhuǎn)義特殊字符
匹配選項：

[a-zA-Z]{3} 指定匹配3個
[^abc]{2} 指定不匹配這兩個

5. 正則表達式分組

重復(fù)一個字符串進行匹配時

() 匹配如 (\d{1,3}.){3}\d{1,3}
組號：
\1 \2 反向引用如 He (l..e)s her \1r. 來匹配 He loves her lover. He likes her liker.

6. 貪婪和非貪婪模式

貪婪模式掉弛，默認(rèn)是貪婪模式爱葵，盡可能多的去匹配如 a.+b
非貪婪模式辙谜，盡可能少的去匹配 a.+?b

7. 實戰(zhàn)匹配

身份證匹配： (\d{6})(\d{4})((\d{2})(\d{2}))\d{1}([0-9]|X)
郵箱正則匹配：[a-zA-Z0-9_-]+@[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,5})

8. python re模塊

compile() 和 match()

import re
pattern = re.compile(r'Hello', re.I)
rest = pattern.match('hello word')
print(dir(rest))
print(rest.string)

findall() 和 search()
findall()是找到所有匹配的內(nèi)容，返回一個list眼虱；search()是找到第一個匹配的內(nèi)容拒垃，返回一個對象

# 有兩種方式纺阔，一個是編譯，一個是不編譯
# 編譯
p = re.compile(r'[a-z]+', re.I)
rest = p.findall(content)
# 不編譯
all_rest = re.findall(r'[a-z]+', content, re.I)

match() 和 search()
match是從開頭開始匹配即寡，如果匹配不是就返回空徊哑；search是只要找到就ok
group(), groups(), groupdict()
group(1) 返回該位置的
groups() 返回tuple
groupdict() 返回命名的group

p = re.compile(r'(\d{6})(?P<year>\d{4})((?P<month>\d{2})(\d{2}))\d{1}([0-9]|X)')
id1 = '232321199410270017'
rest1 = p.search(id1)
print(rest1.group(4))
print(rest1.groups())
print(rest1.groupdict())

split() 和 sub()
split(pattern, string, max=0) 分割匹配的字符（分隔符為匹配的字符）
sub(pattern, replace, string, max) 替換匹配的字符

s = 'one1two2three'
p = re.compile(r'\d+')
rest = p.split(s, 2)
print(rest)

# 替換
s = 'one1two2three'
p = re.compile(r'\d+')
rest = p.sub('@', s)
# 替換位置
s1 = 'hello world'
p1 = re.compile(r'(\w+) (\w+)')
rest1 = p1.sub('r\2 \1', s1)
# 使用函數(shù)或者lambda來匹配
def f(m):
  return m.group(2).upper() + ' ' + m.group(1)
rest2 = p1.sub(f, s1)
rest3 = p1.sub(lambda m: m.group(2).upper() + ' ' + m.group(1), s1)

9. 實戰(zhàn)取圖片地址

import re
def test_image_url_extraction():
  with open('sample.html', encoding='utf-8') as f:
    html = f.read()
    p = re.compile(r'<img.+?src=\"(?P<src>.+?)\".+?>', re.M|re.I)
    list_img = p.findall(html)
    for i in list_img:
      print(i.replace('&amp;', '&'))
    # requests庫去爬蟲

10. 飛機大戰(zhàn)

最后編輯于：2019.10.16 17:25:34

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者