python實(shí)戰(zhàn)計(jì)劃的第三個(gè)項(xiàng)目:爬取租房信息魔吐。
最終結(jié)果如下:
one_three.png
其中包括9張頁面弯汰,每張頁面包含24間房,共計(jì)216間房間窗宦,即216條數(shù)據(jù)逃贝。
每條數(shù)據(jù)包含7項(xiàng)信息谣辞,分別是:標(biāo)題迫摔、地址沐扳、日租金、第一張房間圖片鏈接句占、房東圖片鏈接沪摄、房東性別和房東名稱。
代碼如下:
import requests
from bs4 import BeautifulSoup
import time
def get_links(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
links = soup.select('#page_list > ul > li > a')
for link in links:
href = link.get('href')
one(href)
def if_sex(sexname):
if sexname == ['member_girl_ico']:
return '女'
elif sexname == ['member_boy_ico']:
return '男'
else:
return '沒填寫'
def one(url, data=None):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('div.pho_info > h4 > em')
addres = soup.select('div.pho_info > p > span.pr5')
prices = soup.select('#pricePart > div.day_l > span')
images = soup.select('#curBigImage')
pictures = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
sexes = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')
names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
# print(titles,addres,prices,pictures,names)
if (data == None):
for title, addre, price, picture, name, sex, image in zip(titles, addres, prices, pictures, names, sexes,
images):
data = {
'title': title.get_text(),
'addre': addre.get_text().replace('\n', '').replace(' ', ''),
'price': price.get_text(),
'picture': picture.get('src'),
'name': name.get_text(),
'sex': if_sex(sex.get('class')),
'image': image.get('src')
}
print(data)
urls = ['http://wh.xiaozhu.com/search-duanzufang-p{}-0/?startDate=2016-07-17&endDate=2016-08-24'.format(i) for i in
range(1, 10)]
for url in urls:
get_links(url)
time.sleep(2)
總結(jié):
1.一個(gè)大的任務(wù)盡可能的拆分成小的任務(wù)纱烘,并注意每一塊的輸入條件與輸出信息杨拐。
2.replace('a','b'),replace方法,用b替換a擂啥。