爬取網(wǎng)址:https://music.douban.com/top250
爬取信息:歌曲名爹袁,表演者远荠,流派,發(fā)行時間失息,出版者譬淳,評分
爬取方式:進入詳細頁面爬取,lxml盹兢,re解析邻梆。
存儲方式:MongoDB存儲
- 獲取actor,style,publish_time,publisher字段時使用了正則表達式,相比定位標(biāo)簽定位信息绎秒,能更精確地匹配到信息浦妄,減少匹配錯誤。
- 使用語句
if len(publishers) == 0:
else:
來判斷空信息替裆。
import requests
from lxml import etree
import re
import pymongo
import time
def get_details_url(url):
r = requests.get(url,headers = headers)
html = etree.HTML(r.text)
song_urls = html.xpath('//a[@class="nbg"]/@href')
return song_urls
def get_info(url):
r = requests.get(url,headers=headers)
html = etree.HTML(r.text)
name = html.xpath('//div[@id="wrapper"]/h1/span/text()')[0]
actor = re.findall("表演者:.*?>(.*?)</a>",r.text,re.S)[0]
styles = re.findall(r"流派:</span> (.*?)<br />",r.text,re.S)
if len(styles) == 0:
style = "未知"
else:
style = styles[0].strip()
publish_time = re.findall(r"發(fā)行時間:</span> (.*?)<br />",r.text,re.S)[0].strip()
publishers = re.findall(r"出版者:</span> (.*?)<br />",r.text,re.S)
if len(publishers) == 0:
publisher = "未知"
else:
publisher = publishers[0].strip()
score = html.xpath('//strong[@class="ll rating_num"]/text()')[0]
#print(name,actor,style,publish_time,publisher,score)
info = {
'歌曲名':name,
'表演者':actor,
'流派':style,
'發(fā)行時間':publish_time,
'出版者':publisher,
'評分':score
}
topmusic.insert_one(info) ##插入數(shù)據(jù),保存到數(shù)據(jù)庫中校辩。
if __name__=="__main__":
client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
topmusic = mydb['topmusic'] #連接數(shù)據(jù)庫窘问,并創(chuàng)建數(shù)據(jù)庫和集合
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
url_list = ['https://music.douban.com/top250?start={}'.format(i*25) for i in range(0,10)] #共10頁
for url in url_list:
song_urls = get_details_url(url)
for song_url in song_urls:
get_info(song_url)
time.sleep(2)