此教程包含如何對(duì)文檔進(jìn)行簡(jiǎn)單的數(shù)據(jù)采集和存儲(chǔ)。
基礎(chǔ)知識(shí)儲(chǔ)備
- String & List & Dictionary & Tuple 相關(guān)函數(shù)
- File IO 相關(guān)函數(shù)
詳見(jiàn)我的另一篇簡(jiǎn)書(shū):
Python for Informatics(File&String&List&Dictionary&Tuple)
項(xiàng)目示例
- 讀取外部文檔怠晴,摳出confidence值牍蜂,計(jì)算平均值(習(xí)題來(lái)自《Python for Informatics》)
from urllib.request import urlopen
file_url = 'http://www.py4inf.com/code/mbox-short.txt'
file_list = urlopen(file_url)
conf_list = []
for line in file_list:
line = str(line, 'utf-8') #注意類(lèi)型轉(zhuǎn)換,urlopen()得到的是byte形式
sign = "X-DSPAM-Confidence: "
if line.startswith(sign): #防止混進(jìn)非目標(biāo)行的數(shù)據(jù)
start = line.find(sign)+len(sign)
end = line.find(' ',start)
confidence = line[start: end]
print(confidence)
conf_list.append(float(confidence))
sum = 0
num = 0
for conf in conf_list:
sum += conf
num +=1
print("Average spam condifence: "+str(sum/num))
- 讀取外部文檔,收集所有單詞(不重復(fù))并儲(chǔ)存在list中霎烙,按字母順序排列(習(xí)題來(lái)自《Python for Informatics》)
from urllib.request import urlopen
url = "http://www.py4inf.com/code/romeo.txt"
url_file = urlopen(url)
words = []
for line in url_file:
line = str(line,'utf-8')
temp_words = line.split()
for word in temp_words:
if word not in words:
words.append(word)
words.sort()
print(words)
- 統(tǒng)計(jì)文本中前十高頻詞(習(xí)題來(lái)自《Python for Informatics》)
import string
fhand = open('text.txt')
words = dict()
for line in fhand:
line = str(line)
table = str.maketrans(' ',' ',string.punctuation)
line.translate(table) #剝?nèi)ニ袠?biāo)點(diǎn),記得Import string(python3中,translate()函數(shù)只有一個(gè)argument)
line.lower()
word_list = line.split()
for word in word_list:
if word not in words:
words[word] =1
else:
words[word]+=1
words_cooked = list()
for key,value in words.items():
words_cooked.append((value,key))
words_cooked.sort(reverse= True)
for key, value in words_cooked[:10]:
print(key,value)