我的結(jié)果
MongoDB in Pycharm
我的代碼:
from bs4 import BeautifulSoup
import requests, pymongo
urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(4)]
client = pymongo.MongoClient('localhost', 27017) # 連接客戶端
walden = client['walden'] # 給數(shù)據(jù)庫命名
xiaozu_3page = walden['xiaozu_3page'] # 給collection命名
# 從聚合頁面獲取租房頁面
def get_pages(url):
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
pages = soup.select('.resule_img_a')
for page in pages:
page = page.get('href')
duanzufang_info(page)
return
# 爬取租房頁面信息
def duanzufang_info(href):
web_data = requests.get(href)
soup = BeautifulSoup(web_data.text, 'lxml')
title = soup.select('.pho_info em')[0].get_text()
addr = soup.select('.pho_info p')[0].get('title')
price = soup.select('.day_l span')[0].get_text()
area = soup.select('.border_none p')[0].get_text().split()[0]
house_type = soup.select('.border_none p')[0].get_text().split()[1]
for_people = soup.select('.h_ico2')[0].get_text()
bed_num = '床' + soup.select('.h_ico3')[0].get_text()
data = {
'title': title,
'address': addr,
'price': int(price),
'area': area,
'type': house_type,
'people': for_people,
'bed': '床'+bed_num
}
print(data)
# 把data寫進MongoDB
xiaozu_3page.insert_one(data)
# 選出價格在500以上的
def find_fangzi():
for info in xiaozu_3page.find():
if info['price'] >= 500:
print(info)
# for url in urls:
# get_pages(url)
find_fangzi()
我的感想:
- 花了將近一個小時完成轧膘。
- 小豬的部分寫起來還是很快的钞螟,既沒有用headers、proxies谎碍,連time.sleep()都沒有鳞滨,粗暴。
- MongoDB可視化界面看起來好友好啊
- 一開始我自作多情把
price
寫成¥+price
蟆淀,后來出錯了呃拯啦,因為沒有用int,結(jié)果不能用>=篩選熔任,而到mongodb中修改數(shù)據(jù)類型褒链,卻又把price全部弄成了0(應(yīng)該是¥符號出了問題吧)。