網(wǎng)上看了很多python的爬蟲都是在爬豆瓣電影top250肥败,心里想著沒(méi)事也寫一個(gè)。
1.爬取準(zhǔn)備:
通過(guò)查看豆瓣url,發(fā)現(xiàn)每次都是增加25來(lái)進(jìn)行換頁(yè)
捕獲.PNG
所以我們每次在url新增25即可
我們爬取的信息為:電影名稱/排名/導(dǎo)演演員信息/評(píng)分/slogan
import pandas as pd
import requests
from lxml import etree
#爬取豆瓣top250電影赏迟,放到excle表中作為爬蟲小demo
def crawler(num):
"""
爬取當(dāng)前頁(yè)面的電影數(shù)據(jù)
:return:
"""
base_url="https://movie.douban.com/top250?start={}&filter=".format(num)
r=requests.get(base_url)
r.encoding='utf-8'
res=r.text
html=etree.HTML(res)
return html
def analysis(html):
film_list=[]
for j in range(1,26):
rank_data=html.xpath('//*[@id="content"]/div/div[1]/ol/li[{}]/div/div[1]/em/text()'.format(j))[0]
film_name=html.xpath('//*[@id="content"]/div/div[1]/ol/li[{}]/div/div[2]/div[1]/a/span[1]/text()'.format(j))[0]
film_star=html.xpath('normalize-space(//*[@id="content"]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[1]/text()[1])'.format(j))
score=html.xpath('//*[@id="content"]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/div/span[2]/text()'.format(j))[0]
slogan=html.xpath('//*[@id="content"]/div/div[1]/ol/li[{}]/div/div[2]/div[2]/p[2]/span/text()'.format(j))
if len(slogan)==0:
slogan_value="沒(méi)有slogan"
else:
slogan_value=slogan[0]
film_list.append((rank_data,film_name,film_star,score,slogan_value))
return film_list
def run_data(start_list):
all_film_list=[]
for i in start_list:
crawler_data=crawler(i)
Analysis_data=analysis(crawler_data)
for fime_msg in Analysis_data:
all_film_list.append(fime_msg)
all_film_dataframe=pd.DataFrame(all_film_list,columns=["排名","電影名稱","簡(jiǎn)介","評(píng)分","slogan"])
print(all_film_dataframe)
all_film_dataframe.to_excel("D:/work/film.xls",index=False)
print("電影下載完畢")
def main():
"""
主函數(shù)
:return:
"""
start_list=[0,25,50,75,100,125,150,175,200,225]
film=run_data(start_list)
if __name__ == '__main__':
main()
效果如下:
捕獲.PNG
拜拜~