【python基礎(chǔ)】9-文本處理

字符串方法
正則表達(dá)式
模式匹配和提取
搜索和替換
編譯正則表達(dá)式
正則表達(dá)式進(jìn)一步閱讀

字符串方法

轉(zhuǎn)換字符
- str.maketrans()獲取轉(zhuǎn)換表
- translate()基于轉(zhuǎn)換表執(zhí)行字符串映射
maketrans()第一個(gè)參數(shù)是被取代的字符温亲，第二個(gè)參數(shù)是取代的字符拄显，第三個(gè)是被映射為None的字符
字符轉(zhuǎn)換例子

>>> greeting = '===== Have a great day ====='
>>> greeting.translate(str.maketrans('=', '-'))
'----- Have a great day -----'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.translate(str.maketrans('=', '-', '!'))
'----- Have a great day -----'

>>> import string
>>> quote = 'SIMPLICITY IS THE ULTIMATE SOPHISTICATION'
>>> tr_table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase)
>>> quote.translate(tr_table)
'simplicity is the ultimate sophistication'

>>> sentence = "Thi1s is34 a senten6ce"
>>> sentence.translate(str.maketrans('', '', string.digits))
'This is a sentence'
>>> greeting.translate(str.maketrans('', '', string.punctuation))
' Have a great day '

移除首/尾/兩者的字符串
僅移除首/尾連續(xù)的字符
默認(rèn)空格會被除去
如果指定了多個(gè)字符贴浙，它會被視為集合贾铝，并使用其中所有的組合

>>> greeting = '      Have a nice day :)     '
>>> greeting.strip()
'Have a nice day :)'
>>> greeting.rstrip()
'      Have a nice day :)'
>>> greeting.lstrip()
'Have a nice day :)     '

>>> greeting.strip(') :')
'Have a nice day'

>>> greeting = '===== Have a great day!! ====='
>>> greeting.strip('=')
' Have a great day!! '

風(fēng)格化
width參數(shù)指定了總的輸出字符串長度

>>> ' Hello World '.center(40, '*')
'************* Hello World **************'

改變大小寫和大小寫檢查

>>> sentence = 'thIs iS a saMple StrIng'

>>> sentence.capitalize()
'This is a sample string'

>>> sentence.title()
'This Is A Sample String'

>>> sentence.lower()
'this is a sample string'

>>> sentence.upper()
'THIS IS A SAMPLE STRING'

>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'

>>> 'good'.islower()
True

>>> 'good'.isupper()
False

檢查是否字符串由數(shù)值構(gòu)成

>>> '1'.isnumeric()
True
>>> 'abc1'.isnumeric()
False
>>> '1.2'.isnumeric()
False

檢查是否字符串序列是否存在

>>> sentence = 'This is a sample string'
>>> 'is' in sentence
True
>>> 'this' in sentence
False
>>> 'This' in sentence
True
>>> 'this' in sentence.lower()
True
>>> 'is a' in sentence
True
>>> 'test' not in sentence
True

獲取字符序列存在的次數(shù)（非覆蓋）

>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0

>>> word = 'phototonic'
>>> word.count('oto')
1

匹配頭尾字符序列

>>> sentence
'This is a sample string'

>>> sentence.startswith('This')
True
>>> sentence.startswith('The')
False

>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False

基于字符序列分割字符串
返回列表
要使用正則表達(dá)式分割，使用re.split()

>>> sentence = 'This is a sample string'

>>> sentence.split()
['This', 'is', 'a', 'sample', 'string']

>>> "oranges:5".split(':')
['oranges', '5']
>>> "oranges :: 5".split(' :: ')
['oranges', '5']

>>> "a e i o u".split(' ', maxsplit=1)
['a', 'e i o u']
>>> "a e i o u".split(' ', maxsplit=2)
['a', 'e', 'i o u']

>>> line = '{1.0 2.0 3.0}'
>>> nums = [float(s) for s in line.strip('{}').split()]
>>> nums
[1.0, 2.0, 3.0]

連接字符串列表

>>> str_list
['This', 'is', 'a', 'sample', 'string']
>>> ' '.join(str_list)
'This is a sample string'
>>> '-'.join(str_list)
'This-is-a-sample-string'

>>> c = ' :: '
>>> c.join(str_list)
'This :: is :: a :: sample :: string'

替換字符
第三個(gè)參數(shù)指定使用多少次的替換
變量必須顯式地重賦值

>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'

>>> phrase
'2 be or not 2 be'

>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'

>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'

進(jìn)一步閱讀

Python文檔 - 字符串方法
python字符串方法教程

正則表達(dá)式

正則表達(dá)式元素便利參考

元字符	描述
^	錨定宙项，匹配字符串行首
$	錨定翻默，匹配字符串行尾
.	匹配除換行符\n之外的字符
\|	或操作符，用于匹配多個(gè)模式
()	用于模式分組和提取
[]	字符類 - 匹配多個(gè)字符中的一個(gè)
\^	使用\ 匹配元字符

量詞	描述
*	匹配之前的字符0或多次
+	匹配之前的字符1或多次
?	匹配之前的字符0或1次
{n}	匹配n次
{n,}	匹配至少n次
{n,m}	匹配至少n次宜咒，至多m次

字符類	描述
[aeiou]	匹配任何元音
[^aeiou]	^ 倒置選擇惠赫，所以這會匹配任何的輔音
[a-f]	匹配abcdef中任意字符
\d	匹配數(shù)字把鉴，跟[0-9]一樣
\D	匹配非數(shù)字，跟 [^0-9] 或 [^\d]一樣
\w	匹配字母和下劃線儿咱，跟[a-zA-Z_]一樣
\W	匹配非字母和非下劃線字符庭砍，跟[^a-zA-Z_] 或 [^\w]一樣
\s	匹配空格符，跟[\ \t\n\r\f\v]一樣
\S	匹配非空行符混埠，跟[^\s]一樣
\b	單詞邊界怠缸，單詞定義為字母序列
\B	非單詞邊界

編譯標(biāo)記	描述
re.I	忽略大小寫
re.M	多行模式，^和$錨定符號可以處理中間行
re.S	單行模式钳宪，.也會匹配\n
re.V	冗余模式揭北，提高可讀性和添加注釋

Python文檔 - 標(biāo)記 - 詳情和標(biāo)記長名

變量	描述
\1, \2, \3 等等	引用匹配的模式
\g<1>, \g<2>, \g<3> etc	引用匹配的模式，用于區(qū)分?jǐn)?shù)字和引用

模式匹配和提取

匹配/提取字符序列
使用re.search()查看是否一個(gè)字符串包含某個(gè)模式
使用re.findall()獲得一個(gè)匹配模式列表
使用re.split()獲得一個(gè)基于模式分割字符串的列表
它們的語法如下

re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)

>>> import re
>>> string = "This is a sample string"

>>> bool(re.search('is', string))
True

>>> bool(re.search('this', string))
False

>>> bool(re.search('this', string, re.I))
True

>>> bool(re.search('T', string))
True

>>> bool(re.search('is a', string))
True

>>> re.findall('i', string)
['i', 'i', 'i']

使用正則表達(dá)式
當(dāng)使用正則表達(dá)式元素時(shí)用r''格式

>>> string
'This is a sample string'

>>> re.findall('is', string)
['is', 'is']

>>> re.findall('\bis', string)
[]

>>> re.findall(r'\bis', string)
['is']

>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'sample', 'string']

>>> re.split(r'\s+', string)
['This', 'is', 'a', 'sample', 'string']

>>> re.split(r'\d+', 'Sample123string54with908numbers')
['Sample', 'string', 'with', 'numbers']

>>> re.split(r'(\d+)', 'Sample123string54with908numbers')
['Sample', '123', 'string', '54', 'with', '908', 'numbers']

引用

>>> quote = "So many books, so little time"

>>> re.search(r'([a-z]{2,}).*\1', quote, re.I)
<_sre.SRE_Match object; span=(0, 17), match='So many books, so'>

>>> re.search(r'([a-z])\1', quote, re.I)
<_sre.SRE_Match object; span=(9, 11), match='oo'>

>>> re.findall(r'([a-z])\1', quote, re.I)
['o', 't']

搜索和替換

語法

re.sub(pattern, repl, string, count=0, flags=0)

簡單替換
re.sub不會改變傳入變量的值吏颖，必須顯式地指定

>>> sentence = 'This is a sample string'
>>> re.sub('sample', 'test', sentence)
'This is a test string'

>>> sentence
'This is a sample string'
>>> sentence = re.sub('sample', 'test', sentence)
>>> sentence
'This is a test string'

>>> re.sub('/', '-', '25/06/2016')
'25-06-2016'
>>> re.sub('/', '-', '25/06/2016', count=1)
'25-06/2016'

>>> greeting = '***** Have a great day *****'
>>> re.sub('\*', '=', greeting)
'===== Have a great day ====='

引用

>>> words = 'night and day'
>>> re.sub(r'(\w+)( \w+ )(\w+)', r'\3\2\1', words)
'day and night'

>>> line = 'Can you spot the the mistakes? I i seem to not'
>>> re.sub(r'\b(\w+) \1\b', r'\1', line, flags=re.I)
'Can you spot the mistakes? I seem to not'

在re.sub()替換部分使用函數(shù)

>>> import math
>>> numbers = '1 2 3 4 5'

>>> def fact_num(n):
...     return str(math.factorial(int(n.group(1))))
...
>>> re.sub(r'(\d+)', fact_num, numbers)
'1 2 6 24 120'

>>> re.sub(r'(\d+)', lambda m: str(math.factorial(int(m.group(1)))), numbers)
'1 2 6 24 120'

從re.sub調(diào)用函數(shù)
用函數(shù)輸出替換字符串模式
lambda教程

編譯正則表達(dá)式

>>> swap_words = re.compile(r'(\w+)( \w+ )(\w+)')
>>> swap_words
re.compile('(\\w+)( \\w+ )(\\w+)')

>>> words = 'night and day'

>>> swap_words.search(words).group()
'night and day'
>>> swap_words.search(words).group(1)
'night'
>>> swap_words.search(words).group(2)
' and '
>>> swap_words.search(words).group(3)
'day'
>>> swap_words.search(words).group(4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

>>> bool(swap_words.search(words))
True
>>> swap_words.findall(words)
[('night', ' and ', 'day')]

>>> swap_words.sub(r'\3\2\1', words)
'day and night'
>>> swap_words.sub(r'\3\2\1', 'yin and yang')
'yang and yin'

正則表達(dá)式進(jìn)一步閱讀

Python文檔 - re模塊
Python文檔 - 正則表達(dá)式使用介紹
developers.google - 正則表達(dá)式教程
automatetheboringstuff - 正則表達(dá)式
綜合參考：regex是什么搔体？
練習(xí)工具
- online regex tester 展示解釋，提供參考指南和保存半醉、分享regex
- regexone - 交互式教程
- cheatsheet - 交互式學(xué)習(xí)
- regexcrossword - 通過解答縱橫游戲練習(xí)疚俱，開始之前閱讀'How to play'部分

最后編輯于：2018.02.26 00:56:47

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市缩多，隨后出現(xiàn)的幾起案子呆奕，更是在濱河造成了極大的恐慌养晋，老刑警劉巖，帶你破解...
沈念sama閱讀 219,110評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件梁钾，死亡現(xiàn)場離奇詭異绳泉，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)姆泻，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,443評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門圈纺，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人麦射，你說我怎么就攤上這事蛾娶。” “怎么了潜秋？”我有些...
開封第一講書人閱讀 165,474評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵蛔琅，是天一觀的道長。經(jīng)常有香客問我峻呛，道長罗售，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,881評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任钩述，我火速辦了婚禮寨躁，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘牙勘。我一直安慰自己职恳，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,902評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布方面。她就那樣靜靜地躺著放钦，像睡著了一般。火紅的嫁衣襯著肌膚如雪恭金。梳的紋絲不亂的頭發(fā)上操禀，一...
開封第一講書人閱讀 51,698評論 1贊 305
城市分裂傳說
那天，我揣著相機(jī)與錄音横腿，去河邊找鬼颓屑。笑死，一個(gè)胖子當(dāng)著我的面吹牛耿焊，可吹牛的內(nèi)容都是我干的揪惦。我是一名探鬼主播，決...
沈念sama閱讀 40,418評論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼搀别，長吁一口氣：“原來是場噩夢啊……” “哼丹擎！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,332評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤蒂培，失蹤者是張志新（化名）和其女友劉穎再愈，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體护戳，經(jīng)...
沈念sama閱讀 45,796評論 1贊 316
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡翎冲，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,968評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了媳荒。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片抗悍。...
茶點(diǎn)故事閱讀 40,110評論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖钳枕，靈堂內(nèi)的尸體忽然破棺而出缴渊，到底是詐尸還是另有隱情，我是刑警寧澤鱼炒，帶...
沈念sama閱讀 35,792評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布衔沼，位于F島的核電站，受9級特大地震影響昔瞧，放射性物質(zhì)發(fā)生泄漏指蚁。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,455評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一自晰、第九天我趴在偏房一處隱蔽的房頂上張望凝化。院中可真熱鬧，春花似錦酬荞、人聲如沸搓劫。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,003評論 0贊 22
一樁弒父案袜蚕，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽糟把。三九已至，卻和暖如春牲剃，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背雄可。一陣腳步聲響...
開封第一講書人閱讀 33,130評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工凿傅，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人数苫。一個(gè)月前我還...
沈念sama閱讀 48,348評論 3贊 373
代替公主和親
正文我出身青樓聪舒，卻偏偏與公主長得像，于是被迫代替她去往敵國和親虐急。傳聞我的和親對象是個(gè)殘疾皇子箱残，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,047評論 2贊 355

【python基礎(chǔ)】9-文本處理

字符串方法

正則表達(dá)式

模式匹配和提取

搜索和替換

編譯正則表達(dá)式

正則表達(dá)式進(jìn)一步閱讀

推薦閱讀更多精彩內(nèi)容