豆瓣已經(jīng)列出了評(píng)分排行前250的電影,我需要做的嘀倒,只是用Python編寫一個(gè)非常簡(jiǎn)單的程序屈留,獲得這些電影對(duì)應(yīng)的「鏈接」和「標(biāo)題」局冰,然后打印出來(lái)即可。
運(yùn)行結(jié)果
運(yùn)行結(jié)果.jpg
編碼思路
用爬蟲(chóng)爬取給定初始鏈接的信息灌危,然后用正則表達(dá)式篩選自己需要的信息就好了康二。主要是編寫需要重復(fù)調(diào)用的函數(shù),然后再重復(fù)調(diào)用即可勇蝙。
源碼
#coding:utf-8
#--------------------------------------------------
# 程序:獲取豆瓣top250電影
# 作者:lazyboy
# 博客:http://blog.lazyboy.co/
# 日期:2014-12-20
# 語(yǔ)言:Python 2.7
#--------------------------------------------------
import requests,re
# 初始鏈接
url = 'http://movie.douban.com/top250'
# 函數(shù)沫勿,獲得電影鏈接和標(biāo)題
def getlists(u):
links = []
titles = []
r = requests.get(u)
if r.status_code == 200:
t = r.content
p = re.compile('(?<=<ol\sclass="grid_view">)(.|\n)+?(?=</ol>)')
m = p.search(t)
if m:
alllists = m.group()
p2 = re.compile('(?<=</li>)\n.+?(?=<li>)')
m2 = p2.split(alllists)
p3 = re.compile('(?<=href=").+?(?=")')
p4 = re.compile('(?<=class="title">).+?(?=</span>)')
for i in range(0,len(m2)):
m3 = p3.search(m2[i])
m4 = p4.search(m2[i])
if m3 and m4:
links.append(m3.group())
titles.append(m4.group())
return (links,titles)
# 函數(shù),獲得下一頁(yè)網(wǎng)頁(yè)鏈接
def nexturl(u):
r = requests.get(u)
if r.status_code == 200:
t = r.content
p = re.compile('(?<=rel="next"\shref=").+?(?=")')
m = p.search(t)
if m:
return 'http://movie.douban.com/top250' + m.group()
l,t = getlists(url)
# 當(dāng)存在下一頁(yè)鏈接時(shí)味混,運(yùn)行
while nexturl(url):
url = nexturl(url)
a,b = getlists(url)
l,t = l+a,t+b
# 最終鏈接保存在數(shù)組l产雹,標(biāo)題保存在數(shù)組t
# 按照給定格式打印出來(lái)
for i in range(0,len(l)):
print '%s. [%s](%s)' % (str(i+1),t[i].decode('utf-8').encode('gbk'),l[i])