解析一個本地網(wǎng)頁纱昧,獲取標(biāo)題,圖片地址堡赔,價格识脆,評分量和評分星級。
網(wǎng)頁如下
代碼
from bs4 import BeautifulSoup
with open('D:\宣宣\homework/index.html','r') as wb_data:
soup = BeautifulSoup(wb_data,'lxml') #解析網(wǎng)頁內(nèi)容
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
tittles = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
reviews = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
# print(images,tittles,price,reviews,stars,sep= '\n--------------\n')
for tittle,image,price,review,star in zip(tittles,images,prices,reviews,stars):
data = {
'tittle':tittle.get_text(), #提取文本信息
'image':image.get('src'), #提取圖片地址src是地址參數(shù)
'price':price.get_text(),
'review':review.get_text(),
'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))
}
print(data)
'''
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3)
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p.pull-right
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
運行結(jié)果
總結(jié)
1.用Python爬取網(wǎng)頁信息善已,首先得對網(wǎng)頁有基本的了解灼捂。知道如何在瀏覽器查詢對應(yīng)圖片、文字的HTML代碼换团。再通過copy CSS selector進(jìn)行有用信息的提取
2.在星級提取中纵东,stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)'),copy CSS selector是body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3)啥寇,開始沒把最后的span:nth-child(3)這一串去掉,結(jié)果star=0.后來才明白要提取總共多少個星星洒扎,應(yīng)該寫到父級標(biāo)簽 p:nth-child(2) 磷醋,才會統(tǒng)計所有胡诗。nth-child是會出錯的。應(yīng)改為nth-of-type(2)骇陈,意為選擇器匹配屬于父元素的特定類型的第 2個子元素的每個元素你雌。
3.通過不停的出錯拨拓,對照答案渣磷,查文檔,對代碼的理解加深的物独。最后運行代碼成功,又是一件喜悅的事情官研,學(xué)習(xí)動力持續(xù)不斷。