top 10的算法:我們只需要維護一個10個大小的數(shù)組,初始化放入10Query,按照每個Query的統(tǒng)計次數(shù)由大到小排序,然后遍歷這300萬條記錄罪郊,每讀一條記錄list后進行從大到小排序。如果list長度為11尚洽,則pop()默認刪除最后一個元素悔橄。
不難分析出,這樣的算法的時間復(fù)雜度是N*K, 其中K是指top多少癣疟。
#!/usr/bin/python
"""
Your mapper function should print out 10 lines containing longest posts, sorted in
ascending order from shortest to longest.
Please do not use global variables and do not change the "main" function.
"""
import sys
import csv
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
a= []
for line in reader:
a.append(line)# YOUR CODE HERE
a.sort(key=lambda x: len(x[4]),reverse=True)
if len(a) == 11:
a.pop()
for line in reversed(a[0:10]):
writer.writerow(line)
test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"333\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"88888888\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"11111111111\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1000000000\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"22\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"4444\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"666666\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"55555\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"999999999\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"7777777\"\t\"\"
"""
# This function allows you to test the mapper with the provided test string
def main():
import StringIO
sys.stdin = StringIO.StringIO(test_text)
mapper()
sys.stdin = sys.__stdin__
main()
這里top10代碼的核心是:
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
a = []
for line in reader:
a.append(line)# YOUR CODE HERE
a.sort(key=lambda x: len(x[4]),reverse=True)
if len(a) == 11:
a.pop()
for line in reversed(a[0:10]):
writer.writerow(line)
python中的lambda表達式可以減少代碼量挣柬。參數(shù)為列表x,返回x[4]的長度作為排序的key睛挚,reverse=True表示降序排列邪蛔。 a.sort(key=lambda x: len(x[4]),reverse=True)