1. 手機(jī)軟件拍照搜集所有完型填空文章,放入組卷中心
2. 復(fù)制網(wǎng)頁(yè)源碼到sublime
按F12找到源碼位置,右鍵復(fù)制outerHTML
復(fù)制到sublime
3. 確認(rèn)正則表達(dá)式提取選項(xiàng)內(nèi)單詞
觀察源碼,每個(gè)ABCD選項(xiàng)后都有換行符
故正則為
A.([\s\S]+?)B.([\s\S]+?)C.([\s\S]+?)D.([\s\S]+?)\n
4. 利用代碼提取所有單詞姑原,另存為xlsx
讀取源文件到str
正則表示取出ABCD后面的單詞
數(shù)據(jù)清洗: 替換掉 等雜質(zhì)
數(shù)據(jù)裝入list,轉(zhuǎn)為Series呜舒,計(jì)算頻率
import re
import pandas as pd
import numpy
'''
復(fù)制源碼提取選項(xiàng)并統(tǒng)計(jì)頻率,D選項(xiàng)后面是換行符
'''
# 讀取text文本文件
f = open("/Users/josephxie/Desktop/完型填空.html","r") #設(shè)置文件對(duì)象
str = f.read() #將txt文件的所有內(nèi)容讀入到字符串str中
f.close() #將文件關(guān)閉
pattern = re.compile(r'<[\s\S]+?>')
str = re.sub(pattern, '', str)
str = str.replace(' ',' ')
str = str.replace('\n ','')
pattern = re.compile(r'<td width=[\s\S]+?>')
str = re.sub(pattern, '', str)
list = []
reg = r'A.([\s\S]+?)B.([\s\S]+?)C.([\s\S]+?)D.([\s\S]+?)\n'
words = re.findall(reg, str)
for i in words:
for j in i:
list.append(j.lstrip())
# print(list)
data = pd.Series(list) # 計(jì)算頻率,統(tǒng)計(jì)出現(xiàn)次數(shù)
data = data.value_counts()
data.to_excel('/Users/josephxie/Desktop/text.xlsx')
共有1013個(gè)結(jié)果
5. 觀察結(jié)果锭汛,部分?jǐn)?shù)據(jù)出現(xiàn)問(wèn)題
發(fā)現(xiàn)部分選項(xiàng)沒(méi)有匹配到,將錯(cuò)誤數(shù)據(jù)手動(dòng)復(fù)制到新sublime中重新提取
- 有部分正文內(nèi)容有a.
- 有部分選項(xiàng)后面沒(méi)有換行符
觀察后正則變?yōu)?/p>
A.([\s\S]+?)B.([\s\S]+?)C.([\s\S]+?)D.([\s\S]+?)
重新提取
'''
手動(dòng)從結(jié)果中提取失敗的選項(xiàng),D選項(xiàng)后面是空格
'''
f2 = open("/Users/josephxie/Desktop/Html2","r") #設(shè)置文件對(duì)象
str2 = f2.read() #將txt文件的所有內(nèi)容讀入到字符串str中
f2.close() #將文件關(guān)閉
list2 = []
reg = r'A.([\s\S]+?)B.([\s\S]+?)C.([\s\S]+?)D.([\s\S]+?) '
words2 = re.findall(reg, str2)
for i in words2:
for j in i:
list2.append(j.lstrip())
data2 = pd.Series(list2)
data2 = data2.value_counts()
data2.to_excel('/Users/josephxie/Desktop/text2.xlsx')
print(data2)
共有227個(gè)結(jié)果
6. 將倆次結(jié)果合并
'''
合并倆次結(jié)果dataframe
'''
df1 = pd.read_excel('/Users/josephxie/Desktop/text.xlsx', names = ['單詞', '頻率'])
df2 = pd.read_excel('/Users/josephxie/Desktop/text2.xlsx', names = ['單詞', '頻率'])
df3 = df1.append(df2)
df4 = df3.groupby(by=('單詞')).sum()
df4.sort_values('頻率', ascending = False).to_excel('/Users/josephxie/Desktop/result.xlsx')
df4
7. 在excel中用字母排序袭蝗,手動(dòng)將類(lèi)似單詞合并
沒(méi)有想到合適的代碼唤殴,只能手動(dòng)修改
先對(duì)A列排序再手動(dòng)合并相同詞根的單詞