Abstract:18年復(fù)活節(jié)前的五天,kaggle舉辦了數(shù)據(jù)預(yù)處理的五個(gè)挑戰(zhàn)似舵。這里做每天學(xué)習(xí)到的技術(shù)要點(diǎn)的回顧笋婿。這篇是最后一天的內(nèi)容吏够,主要是有關(guān)替換文本信息中同一信息但是格式不統(tǒng)一的冗余數(shù)據(jù)疯暑。
有些文本數(shù)據(jù)里面會(huì)有拼寫錯(cuò)誤训柴,多余的空格等情況,如果直接給這些原本有相同意義的數(shù)據(jù)分類缰儿,會(huì)讓機(jī)器學(xué)習(xí)算法覺得他們是不同的數(shù)據(jù)畦粮,可能會(huì)阻礙正確的信息分類。
當(dāng)然可以手動(dòng)修改乖阵,但是隨著數(shù)據(jù)量的越來越大,自動(dòng)修改才是王道预麸。
環(huán)境設(shè)置
需要用到的特殊的包是fuzzywuzzy瞪浸。chardet在上一片第四天的文章中已經(jīng)著重介紹過。
# modules we'll use
import pandas as pd
import numpy as np
# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet
查看重復(fù)項(xiàng)的成因
導(dǎo)入數(shù)據(jù)后吏祸,找出“City”這一列对蒲,看看有多少例情況:
cities = suicide_attacks['City'].unique()
# sort them alphabetically and then take a closer look
cities.sort()
cities
結(jié)果會(huì)得到一個(gè)全是城市名字的列表:
array(['ATTOCK', 'Attock ', 'Bajaur Agency', 'Bannu', 'Bhakkar ', 'Buner', 'Chakwal ', 'Chaman', 'Charsadda', 'Charsadda ', 'D. I Khan', 'D.G Khan', 'D.G Khan ', 'D.I Khan', 'D.I Khan ', 'Dara Adam Khel', 'Dara Adam khel', 'Fateh Jang', 'Ghallanai, Mohmand Agency ', 'Gujrat', 'Hangu', 'Haripur', 'Hayatabad', 'Islamabad', 'Islamabad ', 'Jacobabad', 'KURRAM AGENCY', 'Karachi', 'Karachi ', 'Karak', 'Khanewal', 'Khuzdar', 'Khyber Agency', 'Khyber Agency ', 'Kohat', 'Kohat ', 'Kuram Agency ', 'Lahore', 'Lahore ', 'Lakki Marwat', 'Lakki marwat', 'Lasbela', 'Lower Dir', 'MULTAN', 'Malakand ', 'Mansehra', 'Mardan', 'Mohmand Agency', 'Mohmand Agency ', 'Mohmand agency', 'Mosal Kor, Mohmand Agency', 'Multan', 'Muzaffarabad', 'North Waziristan', 'North waziristan', 'Nowshehra', 'Orakzai Agency', 'Peshawar', 'Peshawar ', 'Pishin', 'Poonch', 'Quetta', 'Quetta ', 'Rawalpindi', 'Sargodha', 'Sehwan town', 'Shabqadar-Charsadda', 'Shangla ', 'Shikarpur', 'Sialkot', 'South Waziristan', 'South waziristan', 'Sudhanoti', 'Sukkur', 'Swabi ', 'Swat', 'Swat ', 'Taftan', 'Tangi, Charsadda District', 'Tank', 'Tank ', 'Taunsa', 'Tirah Valley', 'Totalai', 'Upper Dir', 'Wagah', 'Zhob', 'bannu', 'karachi', 'karachi ', 'lakki marwat', 'peshawar', 'swat'], dtype=object)
可以觀察到的是,其中有不少重復(fù)的地名贡翘,只是因?yàn)楹竺胬锩娑嘁粋€(gè)空格或者大小寫不一樣被認(rèn)為是不同的地方蹈矮。
首先先去除大小寫的困擾并刪除字串首尾的多余空白。這兩個(gè)簡(jiǎn)單操作可以去除英語(yǔ)字符里一大部分的不一致現(xiàn)象鸣驱。
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()
用fuzzywuzzy替換相似項(xiàng)
這時(shí)的城市清單是這樣:
array(['attock', 'bajaur agency', 'bannu', 'bhakkar', 'buner', 'chakwal', 'chaman', 'charsadda', 'd. i khan', 'd.g khan', 'd.i khan', 'dara adam khel', 'fateh jang', 'ghallanai, mohmand agency', 'gujrat', 'hangu', 'haripur', 'hayatabad', 'islamabad', 'jacobabad', 'karachi', 'karak', 'khanewal', 'khuzdar', 'khyber agency', 'kohat', 'kuram agency', 'kurram agency', 'lahore', 'lakki marwat', 'lasbela', 'lower dir', 'malakand', 'mansehra', 'mardan', 'mohmand agency', 'mosal kor, mohmand agency', 'multan', 'muzaffarabad', 'north waziristan', 'nowshehra', 'orakzai agency', 'peshawar', 'pishin', 'poonch', 'quetta', 'rawalpindi', 'sargodha', 'sehwan town', 'shabqadar-charsadda', 'shangla', 'shikarpur', 'sialkot', 'south waziristan', 'sudhanoti', 'sukkur', 'swabi', 'swat', 'taftan', 'tangi, charsadda district', 'tank', 'taunsa', 'tirah valley', 'totalai', 'upper dir', 'wagah', 'zhob'], dtype=object)
可以看到'd. i khan' 和 'd.i khan' 因?yàn)橹虚g一個(gè)字符(空格)的區(qū)別被分成兩類泛鸟。于是需要用模糊匹配來找到相近的文本并把它替換掉。
模糊匹配的原理是:當(dāng)你給出一個(gè)字符串踊东,電腦會(huì)把它和數(shù)據(jù)中的字符比較并打分北滥。比較相似度越高的項(xiàng)分?jǐn)?shù)越高,最高100%闸翅。相似度越高再芋,就意味著把其中一項(xiàng)改成另外一項(xiàng)需要變動(dòng)的字符越少。比如“apple“和”snapple“相差兩次變動(dòng)坚冀,而”in“和”on“相差一次變動(dòng)济赎。語(yǔ)句如下:
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# take a look at them
matches
這個(gè)cell得到的是城市清單里和“d.i.khan"相似度最高的10項(xiàng)以及起相似度分?jǐn)?shù),由高到低排。
[('d. i khan', 100), ('d.i khan', 100), ('d.g khan', 88), ('khanewal', 50), ('sudhanoti', 47), ('hangu', 46), ('kohat', 46), ('dara adam khel', 45), ('chaman', 43), ('mardan', 43)]
下面就要編程替換相似度高于90的項(xiàng)(88的那貨是另外一個(gè)城市司训,不是格式錯(cuò)誤)构捡。
當(dāng)你需要重復(fù)實(shí)現(xiàn)一個(gè)通用功能的時(shí)候,編寫一個(gè)funtion豁遭,以后可以直接調(diào)用叭喜,省事,也省的出錯(cuò)
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()
# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)
# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match
# let us know the function's done
print("All done!")
這里要說明的是:
-
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
是List Comprehension蓖谢,其效果等同于:
close_matches = []
for matches in matches:
if matches[1] >= min_ratio:
close_matches.append(matches[0])
- pandas.DataFrame.isin(Value)會(huì)返回一個(gè)真值表捂蕴,數(shù)據(jù)在value里的位置為1。
調(diào)用函數(shù)就可以完成所有和d.i khan相似的替換:
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")