1.前言
網(wǎng)易云音樂(lè)的網(wǎng)頁(yè)端與服務(wù)端通訊做了加密艰毒,本人才疏學(xué)淺破解不了,于是考慮用
selenium +phantomjs的方式爬取大年,selenium 用pip install selenium 安裝
phantomjs 官網(wǎng)下載安裝包即可介陶,Python版本為3.6,用了mysql數(shù)據(jù)庫(kù)商模,環(huán)境為win7 不過(guò)理論上linux mac也可以跑
2.爬取思路
點(diǎn)開(kāi)一首歌的頁(yè)面,這首歌的評(píng)論也會(huì)出現(xiàn)在頁(yè)面下方蜘澜,當(dāng)然通常情況下一頁(yè)是加載不完的施流,所以得通過(guò)模擬點(diǎn)擊下一頁(yè)的方式一頁(yè)一頁(yè)的撈評(píng)論,因?yàn)橹皇桥廊?shù)據(jù)順便學(xué)習(xí)下鄙信,只爬取了用戶id瞪醋,用戶昵稱,建立了一個(gè) userid nickname music 構(gòu)成的mysql表装诡,同時(shí)將爬取并且通過(guò)selenium 解析出來(lái)的頁(yè)面信息緩存到本地银受,以后也許會(huì)有用的上的地方
3.幾個(gè)關(guān)鍵點(diǎn)
在爬取過(guò)程中有幾個(gè)點(diǎn)需要注意
- 1.selenium +phantomjs 解析頁(yè)面時(shí),如果頁(yè)面上用到了iframe鸦采,需要switch_to.frame()的方式切換到對(duì)應(yīng)的iframe宾巍,否則可能會(huì)導(dǎo)致查找不到需要的 element
- 2.頁(yè)面上有異步的ajax請(qǐng)求時(shí),最好在做操作后等待一會(huì)再查找element渔伯,否則有可能找不到數(shù)據(jù)
- 3.有的操作是讓頁(yè)面重新走了下ajax請(qǐng)求如點(diǎn)擊下一頁(yè)顶霞,這個(gè)時(shí)候頁(yè)面不需要重新加載url地址,有些則需要锣吼,比如這首歌的評(píng)論我爬完了选浑,爬取下一首
4.結(jié)合代碼講解下
代碼一個(gè)是操作數(shù)據(jù)的sqlinstance.py 一個(gè)是主程序 spider_main.py
sqlinstance 用了sqlalchemy框架蓝厌,目前只有一張表用于記錄用戶和歌曲的評(píng)論關(guān)系,
代碼如下
from sqlalchemy import create_engine, text, Column, Integer, String, Sequence, \
Date, UniqueConstraint, BigInteger
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
class MusicCmt(Base):
__tablename__ = "t_musiccmts"
id=Column(Integer,primary_key=True,autoincrement=True)
userid = Column(BigInteger)
nickname=Column(String(20),nullable=False)
musicid = Column(BigInteger, nullable=False)
__table_args__ = (UniqueConstraint('nickname', 'musicid'),
)
# 建立連接
engine = create_engine('mysql+pymysql://pig:123456@localhost:3306/cloudmusic?charset=utf8'
,encoding='utf-8',echo=False,pool_size=50, pool_recycle=3600)
DBSession = sessionmaker(bind=engine)
# 通過(guò)代碼創(chuàng)建數(shù)據(jù)庫(kù)
try:
MusicCmt.__table__.create(engine)
except Exception as e :
print(e)
pass
class SqlInstance:
def addmark(self,**kwargs):
session = DBSession()
try:
# 插入一條數(shù)據(jù)
session.add(
MusicCmt(userid=kwargs['userid'], musicid=kwargs['musicid'],nickname=kwargs["nickname"]))
session.commit()
except Exception as e:
print(e)
pass
session.close()
sqlInstance=SqlInstance()
代碼看完 基本的sqlalchemy 建立mysql數(shù)據(jù)表并插入數(shù)據(jù)的操作就可以做了古徒,其他復(fù)雜操作可以查官網(wǎng)拓提,這里我也不多說(shuō)了
接下來(lái)是主程序了,因?yàn)樵囟ㄎ皇怯脕?lái)很多xpath的語(yǔ)法隧膘,不熟悉的同學(xué)最好先看下語(yǔ)法代态,然后自己打開(kāi)一個(gè)網(wǎng)易云音樂(lè)的頁(yè)面,用Chrome的檢查元素舀寓,再通過(guò)xpath查詢的方式定位element ,可以驗(yàn)證下代碼和xpath語(yǔ)法
主程序代碼如下
import os
import re
import time
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
import configparser
import sqlinstance
# 將目標(biāo)musicid存到了一個(gè)文件里面肌蜻,這個(gè)如果要做大的話可以動(dòng)態(tài)獲得
config = configparser.ConfigParser()
config.read('myselectMusic.ini')
musiclist= config['nemusic']['id'].split(",")
# //PHANTOMJS自定義userAgent互墓,避免被反爬
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
# 設(shè)置一個(gè)很長(zhǎng)的網(wǎng)頁(yè)加載數(shù)據(jù),省的有些情況還得滾動(dòng)頁(yè)面
driver.set_window_size(1920,5000)
for musicid in musiclist:
# 加載歌曲第一頁(yè)
driver.get("http://music.163.com/#/song?id=%s" % (musicid))
# 等待網(wǎng)頁(yè)ajax 請(qǐng)求完成
time.sleep(1)
ele = driver.find_element_by_class_name("g-iframe")
# print(ele)
# 頁(yè)面里面有iframe嵌套蒋搜,如果要定位的元素iframe里面
# 必須切換篡撵,否則會(huì)查找不到
driver.switch_to.frame(ele)
pagenum=1
# 查看下已經(jīng)爬取了多少頁(yè),
# 避免重復(fù)操作
savedfiles = os.listdir("saves")
thissaved = [a for a in
savedfiles if
a.startswith(musicid) and a.count("final")>0]
if(len(thissaved)>0):
continue
maxpage=0
try:
thissaved = [int(a.replace(musicid + "_", "").replace(r".txt", "")) for a in savedfiles if
a.startswith(musicid)]
except:
maxpage == "final"
pass
if maxpage and maxpage == "final":
continue
if len(thissaved)>0:
maxpage = sorted(thissaved, reverse=True)[0]
else:
maxpage=0
maxpage=int(maxpage)
if maxpage>0:
current_is_end=False
while True:
# 這個(gè)xpath是查找頁(yè)面底部的 1.2,3 豆挽。育谬。。這些的翻頁(yè)element
# 點(diǎn)擊就是跳的目標(biāo)頁(yè)了
# 網(wǎng)易云音樂(lè)的布局是 最后一個(gè)是跳轉(zhuǎn)下一頁(yè)帮哈,倒數(shù)第二個(gè)是跳轉(zhuǎn)最后一頁(yè)
# 膛檀,倒數(shù)第二個(gè)是翻一個(gè)大頁(yè)面,如1-10 點(diǎn)了下就變成3-13 這種
pagejumps = driver.find_elements_by_xpath(
"http://div[contains(@class,'u-page')]/a")
target=[ a for a in pagejumps if a.text==str(int(maxpage)+1)]
if target:
pagenum = maxpage+1
target[0].click()
time.sleep(1)
break
else:
pagejumps[-3].click()
time.sleep(1)
while True:
# 這個(gè)xpath就是找到當(dāng)前頁(yè)面的所有評(píng)論的用戶昵稱
cmts=driver.find_elements_by_xpath("http://div[@class='cnt f-brk']/a[@class='s-fc7']")
# print(len(kk))
if len(cmts)>0:
for aelemnt in cmts:
urluid=aelemnt.get_attribute("href")
print("nikename:%s url%s" %(aelemnt.text,aelemnt.get_attribute("href")))
try:
# 從點(diǎn)擊昵稱的跳轉(zhuǎn)鏈接正則解析出用戶uid
saveuid=re.findall(r"id=(\d+)",urluid)[0]
except:
saveuid=None
pass
# 數(shù)據(jù)存入數(shù)據(jù)庫(kù)
sqlinstance.sqlInstance.addmark(userid=saveuid, musicid=musicid,nickname=aelemnt.text)
# 找到頁(yè)面上的下一頁(yè)按鈕
nextpagebtn=driver.find_element_by_xpath("http://div[contains(@class,'u-page')]/a[text()='下一頁(yè)']")
str = driver.find_element_by_xpath("http://html").get_attribute("innerHTML")
# 判斷當(dāng)前頁(yè)是否為最后一頁(yè)
if "js-disable" in nextpagebtn.get_attribute("class"):
print("最后一頁(yè)")
# 緩存下爬取的數(shù)據(jù)
with open("saves/%s_%s.txt" % (musicid, "final"), "w",encoding='utf-8') as f:
f.write(str)
break
else:
nextpagebtn.click()
with open("saves/%s_%s.txt" % (musicid, pagenum), "w",encoding='utf-8') as f:
f.write(str)
pagenum += 1
time.sleep(2)
# 退出瀏覽器別忘了
driver.quit()
然后直接Python 跑下就好了,效果如圖所示