Some critical information has been crawled from a website. The website is as below:
The information we need is "item title", "image", "review number", "price", and "star". The result is shown here:
The general process for the web crawling could be described as below (from the course website) :
1) The html file could be read (r) or write (w) from open() function. There are two ways:?
(1) file = open('absolute or relative file path','r'); ? ?print(file.read()); ? ?file.close()
(2) with open('absolute or relative file path','r') as file: ? print(file.read())
2) A special, unique label information (i.e., css path) should be identified in the html file. The relevant commands are: inspect and copy selector. ?
2) One example of the css path looks like:?
"body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)"
? ? Note: "nth-child" should be changed for "nth-of-type(n)" in BeautifulSoap.?
3) The information, or css path, should be incorporated in soup.select('css path') to get the result list:
"stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')"
The "starts" is a list.?
4) In order to get a single result from the list, we could use zip() function and for "for" "in" structure, to iterate through the "zipped" lists:
"for title,image,review,price,star in zip(titles,images,reviews,prices,stars):"
5) Use get_text(), get('src'), or get("href") functions to retrieve the desired content from the tag.?
data = {
'title': title.get_text(), ? ? ? ? ? ? ? # 使用get_text()方法取出文本
'image': image.get('src'), ? ? ? ? # 使用get 方法取出帶有src的圖片鏈接
'review': review.get_text(),
'price': price.get_text(),
'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))*'★' ? ? ? ? ??
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?# 使用find_all 統(tǒng)計(jì)有幾處是★的樣式 ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? # 由于find_all()返回的結(jié)果是列表,我們?cè)偈褂胠en()方法去計(jì)算列表中的元素個(gè)數(shù),也就是星星的數(shù)量
}