爬取商品信息
由于58的二手商品平臺轉(zhuǎn)轉(zhuǎn)上線,爬取的方法與老師的講解有一些不一樣:
- 58的二手商品新平臺轉(zhuǎn)轉(zhuǎn)璧帝,全是轉(zhuǎn)轉(zhuǎn)商品
- 不區(qū)分個(gè)人商品與企業(yè)商品
- 瀏覽量與網(wǎng)頁一起加載,不再單獨(dú)請求
- 新的詳情頁無發(fā)貼時(shí)間信息,故不爬取
#!usr/bin/env python
#_*_ coding: utf-8 _*_
# python3.5 vs python2.7
# 58zhuanzhuan
from bs4 import BeautifulSoup
import requests
import time
def geturls(urls):
for url in urls:
webdata = requests.get(url)
soup = BeautifulSoup(webdata.text, 'lxml')
itemlist = soup.select('tr.zzinfo > td.t > a.t')
nav = getemtext(soup.select('div.nav a')[-1])
for item in itemlist:
itemurl = item.get('href')
title = getemtext(item)
get_target_info(itemurl, title, nav)
time.sleep(1)
def getemtext(element):
return element.get_text().strip()
def get_target_info(url, title='', nav=''):
wbdata = requests.get(url)
soup = BeautifulSoup(wbdata.text, 'lxml')
#title = soup.select('div.box_left > div > div > h1')
looktime = soup.select('span.look_time')[0]
price = soup.select('span.price_now i')[0]
place = soup.select('div.palce_li i')[0]
data = {
'title': title,
'nav': nav,
'looktime': getemtext(looktime).strip(u'次瀏覽'),
'price': getemtext(price),
'place': getemtext(place)
}
#print(data)
print(data['title'])
print('price: '+ data['price'] + ', view: '+ data['looktime']+ ' times' + ', area: ' + data['place'])
if __name__ == "__main__":
urls = ["http://bj.58.com/pbdn/0/pn{}/".format(pageid) for pageid in range(1, 14)]
geturls(urls)
#http://bj.58.com/tushu/pn2
部分運(yùn)行結(jié)果
微軟平板SURFACE RT
price: 1500, view: 2560 times, area: 北京-豐臺
三星超薄平板咖气,
price: 1200, view: 801 times, area: 北京-通州
iPad1代
price: 680, view: 1333 times, area: 北京-朝陽
轉(zhuǎn)讓iPadmini2帶發(fā)票和包裝盒子16G配件齊全體大
price: 1512, view: 355 times, area: 北京-海淀
95成新16G IPAD4(the new ipad) 第一代高清屏的ipad,現(xiàn)使用無卡頓...
price: 1299, view: 1998 times, area: 北京-通州
全新ipad 沒有注冊的 零磨損 看圖吧
price: 1599, view: 1400 times, area: 北京-大興
蘋果iPad4代賤賣
price: 1200, view: 114 times, area: 北京-順義
總結(jié)
- 類目與標(biāo)題信息從列表頁獲取挖滤,作為參數(shù)傳給get_target_info()崩溪,節(jié)省信息提取時(shí)間
- 打印爬取的結(jié)果時(shí),直接print(data)斩松,中文以unicode編碼輸出伶唯。print(data['title'])可以正常顯示中文字符