Python實(shí)戰(zhàn)計(jì)劃學(xué)習(xí)第三個(gè)實(shí)戰(zhàn)項(xiàng)目 爬取租房信息
成果:
截取了一小部分匀谣,實(shí)在是太多了而且最近網(wǎng)絡(luò)出問題了爬取很慢鞠评。夏块。北专。
2016-11-15.png
接下來是代碼
import requests
from bs4 import BeautifulSoup
import time
def get_links(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
links = soup.select('#page_list > ul > li > a')
for link in links:
href = link.get('href')
get_details(href)
def if_sex(sex_name):
if sex_name == ['member_ico']:
return '男'
elif sex_name == ['member_ico1']:
return '女'
else:
return '不明' #爬取的時(shí)候發(fā)現(xiàn)有些房東沒有填寫性別,所以用了else毕籽。
def get_details(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('div.con_l > div.pho_info > h4 > em')
addresss = soup.select('div.con_l > div.pho_info > p > span')
prices = soup.select('div.day_l > span')
images = soup.select('img[id="curBigImage"]')
landlords = soup.select('div.member_pic > a > img')
names = soup.select('div.w_240 > h6 > a')
sexs = soup.select('div.member_pic > div')
for title, address, price, image, landlord, name, sex in zip(titles, addresss, prices, images, landlords, names, sexs):
data = {
'title': title.get_text(),
'address': address.get_text().rstrip(),
'price': price.get_text(),
'image': image.get("src"),
'landlord': landlord.get("src"),
'name': name.get_text(),
'sex': if_sex(sex.get('class'))
}
print(data)
urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1,14)]
for url in urls:
get_links(url)
time.sleep(4)
總結(jié):
- 剛寫完的時(shí)候發(fā)現(xiàn)地址爬取后會(huì)在后邊有'\n' 所以我加了rstrip()抬闯,變成了'address': address.get_text().rstrip()
就給解決了。 - 爬取的時(shí)候發(fā)現(xiàn)有些房東沒有填寫性別关筒,所以用了else解決性別不明的問題溶握。
- 可以添加統(tǒng)計(jì)的代碼,不然看著挺亂的蒸播。睡榆。