爬蟲三大庫 request葛假、BeautifulSoup、lxml庫
推薦使用lxml作為解釋器滋恬,其效率高
使用請求頭來偽裝瀏覽器聊训,右鍵檢查,請求頭在network中尋找User-Agent恢氯,找到network后刷新一下
拉到最下面带斑。
import lxml
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
print(soup.prettify())
解析得到的Soup文檔有find()鼓寺、find_all()、selector()方法
1.soup.find_all("div","item") #查找div標(biāo)簽勋磕,class="item"
2.find()方法與find_all()方法類似妈候,前者查找全部,后者第一個(gè)
3.selector()方法
soup.selector(div.item>a>h1) #括號內(nèi)容通過Chrome瀏覽器復(fù)制得到挂滓,如圖苦银,選擇想要查看的信息,右鍵檢查
此時(shí)便能得到如下:
page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i
把li:nth-child(1)改為li:nth-of-type(1),注意>左右都有一個(gè)空格赶站。親測比較難用墓毒,了解而已。
用find_all()比較好使亲怠,代碼如下:
import lxml
import requests
from bs4import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
prices=soup.find_all("span","result_price")
names=soup.find_all("span","result_title hiddenTxt")
for price,namein zip(prices,names):
print(name.get_text(),price.get_text())
綜合應(yīng)用所计,爬取北京地區(qū)短租房信息:
爬取多頁的信息,首先手動(dòng)翻頁团秽,得到每頁的地址主胧,如下:
http://bj.xiaozhu.com/search-duanzufang-p2-0/
http://bj.xiaozhu.com/search-duanzufang-p3-0/
發(fā)現(xiàn)了吧,更改pi就可以實(shí)現(xiàn)自動(dòng)翻頁习勤,本次爬取全部的信息踪栋。
'''
import lxml
import requests
from bs4 import BeautifulSoup
import time
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
}
def get_links(url):
web_data=requests.get(url,headers=headers)
soup=BeautifulSoup(web_data.text,"lxml")
prices=soup.find_all("span","result_price")
names=soup.find_all("span","result_title hiddenTxt")
informations=soup.find_all("em","hiddenTxt")
links=soup.find_all("a","resule_img_a")
for name,price,link,information in zip(names,prices,links,informations):
data={
"name":name.get_text().strip(),
"price":price.get_text().strip(),
"href":link.get("href"),
"information":information.get_text().replace("\n"," ").strip()
}
print(data)
urls=['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(number) for number in range(1,14)]
for url in urls:
get_links(url)
time.sleep(2)
'''