以下是我爬取上海鏈家網(wǎng)寶山區(qū)房源信息的學(xué)習(xí)總結(jié)
準(zhǔn)備工作
用到的Python模塊:
- requests
- bs4
- pymongo
- datetime
- time
- random
分析網(wǎng)頁(yè)
登陸http://sh.lianjia.com/ershoufang/baoshan 用Chrome打開(kāi)開(kāi)發(fā)者工具
每條房源信息都在li元素中,我們?cè)賮?lái)看一下翻頁(yè)鏈接
試著點(diǎn)擊下一頁(yè),我們?yōu)g覽器上的鏈接是有規(guī)律可循的
http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
.........
http://sh.lianjia.com/ershoufang/baoshan/100
現(xiàn)在我們?cè)囍廊∏?0頁(yè)的鏈接
import requests
for i in range(1, 11):
r = requests.get('http://sh.lianjia.com/ershoufang/baoshan/d' + str(i))
print(r.url)
爬取結(jié)果
http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
http://sh.lianjia.com/ershoufang/baoshan/d4
http://sh.lianjia.com/ershoufang/baoshan/d5
http://sh.lianjia.com/ershoufang/baoshan/d6
http://sh.lianjia.com/ershoufang/baoshan/d7
http://sh.lianjia.com/ershoufang/baoshan/d8
http://sh.lianjia.com/ershoufang/baoshan/d9
http://sh.lianjia.com/ershoufang/baoshan/d10
解析網(wǎng)頁(yè)
要抓取的信息如下:
- 標(biāo)題:room_title = room.find('div', attrs={'class': 'prop-title'})
- 房屋信息:room_info = room.find('span', attrs={'class': 'info-col row1-text'})
- 位置:room_location = room.find('span', attrs={'class': 'info-col row2-text'})
- 附加信息:extra_info = room.find('div', attrs={'class': 'property-tag-container'})
- 總價(jià):room_price = room.find('span', attrs={'class': 'total-price strong-num'})
- 單價(jià):room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'})
soup = BeautifulSoup(r.text, 'html.parser')
rooms = soup.find('ul', attrs={'class': 'js_fang_list'})
for room in rooms.find_all('li'):
room_title = room.find('div', attrs={'class': 'prop-title'}).get_text()
room_info = room.find('span', attrs={'class': 'info-col row1-text'}).get_text()
room_location = room.find('span', attrs={'class': 'info-col row2-text'}).find('a').get_text()
room_price = room.find('span', attrs={'class': 'total-price strong-num'}).get_text()
room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'}).get_text()
extra_info = room.find('div', attrs={'class': 'property-tag-container'}).get_text()
print(room_title, room_info, room_location, room_price, room_unit_price, extra_info)
下面是網(wǎng)頁(yè)解析下來(lái)的一個(gè)房源信息
廚衛(wèi)全明,臥室?guī)ш?yáng)臺(tái)固歪,地鐵房砌们,高區(qū)采光好
1室1廳 | 44.73平
| 高區(qū)/6層
| 朝南
葑潤(rùn)華庭 255
單價(jià)57008元/平
距離7號(hào)線祁華路站698米
滿二
有鑰匙
存入MongoDB數(shù)據(jù)庫(kù)
MongoDB數(shù)據(jù)結(jié)構(gòu)是以鍵值對(duì){key:value}形式組成,有點(diǎn)類似于JSON
# 鏈接數(shù)據(jù)庫(kù)
client = MongoClient('localhost', 27017)
# 建立數(shù)據(jù)庫(kù)
db = client.tests
# 建立集合
homes = db.homes
rooms_list = []
# 先將爬下來(lái)的數(shù)據(jù)賦值為字典
rooms_info ={
'title': room_title,
'info': room_info,
'location': room_location,
'price': room_price,
'unit_proce': room_unit_price,
'message': extra_info,
'time': datetime.datetime.now()
}
rooms_list.append(rooms_info)
# 存入數(shù)據(jù)庫(kù)
result = homes.insert_many(rooms_list)
print(result)
運(yùn)行代碼,我們可以看到數(shù)據(jù)存入了MongoDB
<pymongo.results.InsertManyResult object at 0x00000260C536AB8>
<pymongo.results.InsertManyResult object at 0x00000260C536AAC>
<pymongo.results.InsertManyResult object at 0x00000260C536AA0>
<pymongo.results.InsertManyResult object at 0x00000260C536AB4>
<pymongo.results.InsertManyResult object at 0x00000260C536AB0>
<pymongo.results.InsertManyResult object at 0x00000260C536A28>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536888>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A48>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
可以下載一個(gè)MongoDB可視化工具,我用的是Robo3T,數(shù)據(jù)就這樣存入了
總共有100頁(yè)的數(shù)據(jù)雹舀,用time.sleep()來(lái)控制速度防止被封掉,但爬取效率實(shí)在很低粗俱,這兩天準(zhǔn)備學(xué)習(xí)pandas
完整代碼在GitHub
簡(jiǎn)書(shū)
歡迎訪問(wèn)博客Treehl的博客