正則表達式 Regular Expression

image.png

正則表達式 Regular Expression

正則表達式是一種對字符串過濾的邏輯公式

  • 可以判斷給定的字符串是否匹配
  • 可以獲取字符串中特定的部分

從dataquest 的聯(lián)系中掌握一些常用的用法

1. introduction (instructions)

In the code cell, assign to the variable regex a regular expression that's four characters long and matches every string in the list strings.

strings = ["data science", "big data",metadata]
regex = 'data'

2. Wildcards in Regular Expressions(instructions)

In Python, we use the re module to work with regular expressions. The module's documentation provides a list of these special characters.

For instance, we use the special character "." to indicate that any character can be put in its place.

Assign a regular expression that is three characters long and matches every string in strings to the variable regex.

strings = ["bat",'robotics','megabyte']
regex = "b.t"

3. Searching The Beginnings And Endings Of Srtings(instructions)

We can use the caret symbol ("^") to match the beginning of a string, and the dollar sign ("$") to match the end of a string.

Assign a regular expression that's seven characters long and matches every string in strings (except for bad_string) to the variable regex.

strings = ["better not put too much", "butter in the", "batter"]
bad_string = "We also wouldn't want it to be bitter"
regex = ""
regex = '^b.tter'

4. Introduction to the AskReddit Data Set

which has five columns that appear in the following order:

Title -- The title of the post Score -- The number of upvotes the post received
Time -- When the post was posted
Gold -- How much Reddit Gold users gave the post
NumComs -- The number of comments the post received

5. Reading and Pringting the Data Set(instructions)

Title|Score|Time|Gold|NumComs
---| ---| ---|---|---
What's your internet "white whale", something you've been searching for years to find with no luck?| 11510|1433213314|1|26195
What's your favorite video that is 10 seconds or less?|8656|1434205517|4|8479
What are some interesting tests you can take to find out about yourself?|8480|1443409636|1|4055|
PhD's of Reddit. What is a dumbed down summary of your thesis?|7927|1440188623|0|13201
What is cool to be good at, yet uncool to be REALLY good at?|7711|1440082910|0|20325
Let's use the csv module to read and print our data file, "askreddit_2015.csv". Recall that we can use the csv module by performing the following steps:

  1. Import csv.
  2. Open the file that contains our CSV data in 'r' mode.
  3. Call the csv.reader() function with the file object as input.
  4. Convert the result to a list.
  • Use the csv module to read our data set and assign it to posts_with_header.
  • Use list slicing to exclude the first row, which represents the column names. Assign this sliced data set to posts.
  • Use a for loop and string slicing to print the first 10 rows. See if you notice any patterns in this sample of the data set.
import csv
post_with_header = list(csv.reader(open("askreddit_2015.csv",'r')))
posts = post_with_header[1:]
for post in posts[:10]:
    print(post)

6. Countint Simple Mathes in the Data Set with re()

We mentioned the re module earlier, and now we'll begin to use it in our code. One useful function the module provides is re.search.

With re.search(regex, string), we can check whether string is a match for regex. If it is, the expression will return a match object. If it isn't, it will return None. For now, we won't worry about returning the actual matches - we'll just compare the result to None to see whether we have a match or not.


if re.search("needle", "haystack") is not None:
   print("We found it!")
else:
   print("Not a match")

The code above will print Not a match, because "haystack" is not a match for the regex "needle".

You may have noticed that many of the posts in our AskReddit database are directed towards particular groups of people, using phrases like "Soldiers of Reddit". These types of posts are common, and always follow a similar format. We can use regular expressions to count how many of them are in the top 1,000.Let's do this in our next exercise. We've already read the data set into the variable posts.

Instructions

Count the number of posts in our data set that match the regex "of Reddit". Assign the count to of_reddit_count.

import re
of_reddit_count = 0 
for post in posts:
    if re.search('of Reddit',post[0]) is not None:
        of_reddit_count += 1
print(of_reddit_count)

7. Using Square Brackets to Match Multiple Characters

For example, the regex "[bcr]at" would match the substrings "bat", "cat", and "rat", but nothing else. We indicate that the first character in the regex can be either "b", "c" or "r".

Instructions

  • Use square bracket notation to make the code account for both capitalizations of "Reddit", and count how many posts contain "of Reddit" or "of reddit" in the title.
  • Assign the resulting count to of_reddit_count.
improt re
of_reddit_count = 0 
for post in posts:
    if re.search ('of [rR]eddit',post[0]) is not None:
        of_reddit_count += 1

8. Excaping Special Characters

To deal with this sort of problem, we need to escape (backslash \ )special characters.

Instructions
-Escape the square bracket characters to count the number of posts in our data set that contain the "[Serious]" tag.

  • Assign the count to serious_count.
import re
serious_count = 0
for post in posts:
    if re.search('\[Serious\]',post[0])is not None:
        serious_count +=1

9. Combining Escaped Characters and Multiple Matches

Some people tag serious posts as "[Serious]", and others as "[serious]". We should account for both capitalizations.

Instructions

  • Refine the code to count how many posts have either "[Serious]" or "[serious]" in the title.
  • Assign the count to serious_count.
improt re
serious_count = 0
for post in posts:
    if re.search ('\[[Ss]erious\]',post[0]):
        serious_count += 1

10. Adding More Complexity to Your Regular Expression

In our data set, some users have tagged their posts with "(Serious)" or "(serious)", including the parentheses. Therefore, we should account for both square brackets and parentheses. We can do this by using square bracket notation, and escaping the "[", "]", "(", and ")" characters with the backslash.

Instructions

  • Refine the code so that it counts how many posts have the serious tag enclosed in either square brackets or parentheses.
  • Assign the count to serious_count.
import re
serious_count =0
for post in posts:
    if re.search('[\[\(][Ss]rious[\]\)]',post[0]) is not None:
        serious_count += 1

11. Combining Multiple Regular Expressions

To combine regular expressions, we use the "|" character.

Instructions

  • Use the "^" character to count how many posts include the serious tag at the beginning of the title. Assign this count to serious_start_count.
  • Use the "$" character to count how many posts include the serious tag at the end of the title. Assign this count to serious_end_count.
  • Use the "|" character to count how many posts include the serious tag at either the beginning or end of the title. Assign this count to serious_count_final.
import re

serious_start_count = 0
serious_end_count = 0
serious_count_final = 0

for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]',row[0])is not None:
        serious_start_count+=1
for row in posts:
    if re.search('[\[\(][Ss]erious[\]\)]$',row[0]) is not None:
        serious_end_count +=1
for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$',row[0])is not None:
        serious_count_final +=1

12. Using Regular Expressions to Substitute Strings

The re module provides a sub() function that takes the following parameters (in order):

  • pattern: The regex to match
  • repl: The string that should replace the substring matches
  • string: The string containing the pattern we want to search

Instructions

  • Replace "[serious]", "(Serious)", and "(serious)" with "[Serious]" for all of the titles in posts.
  • You should only need to use one call to sub(), and one regex.
  • Recall that the repl argument is an ordinary string. It's not a regex, so you don't need to escape characters like "[".

Hint

"[\[\(][Ss]erious[\]\)]" is the pattern argument to sub(), and "[Serious]" is the repl argument.

import re
for row in posts:
    re.sub('[\]\)][sS]erious[\]\)]','[Serious]',row[0])

13. Matching Years with Regular Expressions

We can indicate that we're looking for integers in a pattern by using square brackets ("[" and "]"), along with a dash ("-"). For example, "[0-9]" will match any character that falls between 0 and 9 (all of which will be one-digit integers). Similarly, "[a-z]" would match any lowercase letter. We can also specify smaller ranges like "[3-5]" or "[d-g]".

This would work, but let's also add the condition that we only want to match years after year 999 and before year 3000 (any other four-digit numbers in a string are probably not years).

Instructions

  • We've loaded a number of strings into the strings variable for you.
  • Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999.
  • Store every string that contains a year in year_strings. The .append() function will help here.
import re
year_string = []
for string in strings:
    if re.search ('[1-2][0-9][0-9][0-9]',string)is not None:
        year_strings_append(string)

14. Repeating Characters in Regular Expressions

We can use curly brackets ("{" and "}") to indicate that a pattern should repeat. To match any four-digit number, for example, we could repeat the pattern "[0-9]" four times by writing "[0-9]{4}"

Instructions

  • We've loaded a number of strings into the strings variable for you.
  • Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999. Use a regex that takes advantage of curly brackets.
  • Store every string that contains a year in year_strings. The .append() function will help here.
import re
year_srings = []
for string in strings:
    if re.search('[1-2][0-9]{3}',string)is not None:
        year_strings.append(string)

15 . Challenge: Extracting all Years

Finally, let's extract years from a string. The re module contains a findall() function that returns a list of substrings matching the regex. re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

Instructions

  • Use re.findall() to generate a list of all years between 1000 and 2999 in the string years_string.
  • Assign the result to years.
years = re.finall('[1-2][0-9]{3}',years_string)
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末算行,一起剝皮案震驚了整個濱河市梧油,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌州邢,老刑警劉巖儡陨,帶你破解...
    沈念sama閱讀 221,198評論 6 514
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異量淌,居然都是意外死亡迄委,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,334評論 3 398
  • 文/潘曉璐 我一進店門类少,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人渔扎,你說我怎么就攤上這事硫狞。” “怎么了晃痴?”我有些...
    開封第一講書人閱讀 167,643評論 0 360
  • 文/不壞的土叔 我叫張陵残吩,是天一觀的道長。 經(jīng)常有香客問我倘核,道長泣侮,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,495評論 1 296
  • 正文 為了忘掉前任紧唱,我火速辦了婚禮活尊,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘漏益。我一直安慰自己蛹锰,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 68,502評論 6 397
  • 文/花漫 我一把揭開白布绰疤。 她就那樣靜靜地躺著铜犬,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上癣猾,一...
    開封第一講書人閱讀 52,156評論 1 308
  • 那天敛劝,我揣著相機與錄音,去河邊找鬼纷宇。 笑死夸盟,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的呐粘。 我是一名探鬼主播满俗,決...
    沈念sama閱讀 40,743評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼作岖!你這毒婦竟也來了唆垃?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,659評論 0 276
  • 序言:老撾萬榮一對情侶失蹤痘儡,失蹤者是張志新(化名)和其女友劉穎辕万,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體沉删,經(jīng)...
    沈念sama閱讀 46,200評論 1 319
  • 正文 獨居荒郊野嶺守林人離奇死亡渐尿,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,282評論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了矾瑰。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片砖茸。...
    茶點故事閱讀 40,424評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖殴穴,靈堂內(nèi)的尸體忽然破棺而出凉夯,到底是詐尸還是另有隱情,我是刑警寧澤采幌,帶...
    沈念sama閱讀 36,107評論 5 349
  • 正文 年R本政府宣布劲够,位于F島的核電站,受9級特大地震影響休傍,放射性物質(zhì)發(fā)生泄漏征绎。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,789評論 3 333
  • 文/蒙蒙 一磨取、第九天 我趴在偏房一處隱蔽的房頂上張望人柿。 院中可真熱鬧,春花似錦忙厌、人聲如沸顷扩。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,264評論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽隘截。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間婶芭,已是汗流浹背东臀。 一陣腳步聲響...
    開封第一講書人閱讀 33,390評論 1 271
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留犀农,地道東北人惰赋。 一個月前我還...
    沈念sama閱讀 48,798評論 3 376
  • 正文 我出身青樓,卻偏偏與公主長得像呵哨,于是被迫代替她去往敵國和親赁濒。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,435評論 2 359

推薦閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,554評論 0 23
  • 今天讀到“人須有為己之心,方能克己挨务,能克己击你,方能成己”這一句,才明白古人云谎柄,人不為己天誅地滅丁侄,這句話我一...
    墨涵潔閱讀 283評論 1 2
  • 感覺每天都在過重復(fù)的日子!
    jidean閱讀 140評論 1 1
  • 十一月朝巫,從單一以工作鸿摇、學(xué)習(xí)為重,向更加均衡發(fā)展的5個層面發(fā)展:生活劈猿、工作户辱、學(xué)習(xí)、領(lǐng)導(dǎo)力糙臼、關(guān)系。有一些突破恩商,有一些未...
    袁春楠閱讀 242評論 0 0
  • 2月20日变逃,晚7.30相約”和寶寶一起學(xué)英文“,聽mox's shop 學(xué)習(xí)”O(jiān)"的短音怠堪。不見不散
    蘿卜mama閱讀 5,323評論 1 2