實(shí)戰(zhàn)計(jì)劃第一天冗尤,抓了一個(gè)本地網(wǎng)頁躬厌。
最終成果是這樣的:
Paste_Image.png
我的代碼:
from bs4 import BeautifulSoup
info = []
with open('E:/PycharmProjects/homework2/homework2/1_2_homework_required/index.html','r') as data:
Soup = BeautifulSoup(data,'lxml')
images = Soup.select('body > div > div > div.col-md-9 > div > div > div > img')
titles = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
grades = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
counts = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
# print(images,titles,grades,prices,counts)
for title,image,price,grade,count in zip(titles,images,prices,grades,counts):
data1 = {
'title' : title.get_text(),
'image' : image.get('src'),
'price' : price.get_text(),
'grade' : len(grade.find_all("span" , class_ = "glyphicon glyphicon-star" )),
'count' : count.get_text()
}
print(data1)
info.append(data1)
總結(jié)
- lxml在內(nèi)的三種解析方式
- :nth-child(1)>img 代表具體到每一個(gè)子節(jié)點(diǎn),抓所有元素時(shí)要?jiǎng)h除或 變成nth-of-type
- 步驟1.soup解析2.復(fù)制CSS path(注意格式要對,尤其空格等)3.篩選信息4.字典擴(kuò)充info.append(data1)
- ()tupple []list {}dic
- grade和grades區(qū)別:抓網(wǎng)頁時(shí)grades是父節(jié)點(diǎn)個(gè)數(shù),grade是每個(gè)父節(jié)點(diǎn)下星星構(gòu)成的list