初學(xué)python爬蟲(chóng)斯嚎,遇到諸多疑難問(wèn)題饭尝。今天這個(gè)特別大肯腕。目的是簡(jiǎn)單的使用Xpath爬取豆瓣音樂(lè)top250,并存儲(chǔ)在MySQL中芋肠。
一乎芳、數(shù)據(jù)庫(kù)的建立:
CREATE TABLE dbmusic(
name TEXT
singer TEXT,
rate TEXT,
url TEXT
) ENGINE INNODB DEFAULT CHARSET=utf8
二、爬蟲(chóng)代碼(用XPATH)
from lxml import etree
import requests
import time
import pymysql
conn = pymysql.connect(host='localhost', user='root', passwd='******', db='testdb', port=3306, charset='utf8')
cursor = conn.cursor()
urls =['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
for url in urls:
html = requests.get(url,headers = headers)
selector = etree.HTML(html.text)
infos = selector.xpath('//tr[@class="item"]')
for info in infos:
name = info.xpath('td/a[@title]')[0]
singer = name #由于解構(gòu)不出來(lái)帖池,只好暫時(shí)這樣奈惑。大牛指教!
rate = info.xpath('td/div/div/span[2]/text()')[0]
url = info.xpath('td/div/a/@href')[0]
cursor.execute('use testdb')
cursor.execute("insert into dbmusic(name,singer,rate,url)values(%s,%s,%s,%s)",
(str(name),str(singer),str(rate),str(url))
)
print('succeed')
time.sleep(2)
conn.commit()
結(jié)果:可以爬取數(shù)據(jù)睡汹!
但是肴甸,在數(shù)據(jù)庫(kù)中SELECT name 后結(jié)果卻成了一些諸如
“Element a at 0x5408188” 的內(nèi)容。
而url和rate是正確的囚巴。估計(jì)問(wèn)題出在提取name路徑上原在,但對(duì)照源碼改過(guò)幾次都是這樣的結(jié)果友扰。
望大牛指教!