requests庫入門實操我的個人博客
- 京東商品頁面爬取
- 亞馬遜商品頁面的爬取
- 百度/360搜索關(guān)鍵字提交
- IP地址歸屬地查詢
- 網(wǎng)絡(luò)圖片的爬取和儲存
1.京東商品頁面的爬取
import requests
def GetHTMLText(url):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text[:1000]
except:
print("爬取失敗")
if __name__ == '__main__':
url = "https://item.jd.com/30185690434.html"
print(GetHTMLText(url))
image
2.亞馬孫商品頁面的爬取
某些網(wǎng)站可能有反爬機制纽乱。通常的反爬策略有:
- 通過Headers反爬蟲
- 基于用戶行為反爬蟲
- 動態(tài)頁面的反爬蟲
參考
#如網(wǎng)站對Headers的User-Agent進行檢測,可定制請求頭偽裝成瀏覽器
import requests
def GetHTMLText(url):
try:
#定制請求頭
headers = {"user-agent":"Mozilla/5.0"}
r = requests.get(url,headers = headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text[:1000]
except:
print("爬取失敗")
if __name__ == '__main__':
url = "https://www.amazon.cn/gp/product/B01M8L5Z3Y"
print(GetHTMLText(url))
3.百度/360搜索關(guān)鍵字提交
使用params參數(shù),利用接口keyword
#百度搜索引擎關(guān)鍵詞提交接口: http://www.baidu.com/s?wd=keyword
#360搜索引擎關(guān)鍵詞提交接口: http://www.so.com/s?q=keyword
import requests
def Get(url):
headers = {'user-agent':'Mozilla/5.0'}
key_word = {'wd':'python'}
try:
r=requests.get(url,headers=headers,params=key_word)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.request.url)
#return r.request.url
return r.text
except:
return "爬取失敗"
if __name__ == '__main__':
url = "http://www.baidu.com/s"
#print(Get(url))
print(len(Get(url)))
image
4.IP地址歸屬地查詢
使用IP138的API接口
http://m.ip138.com/ip.asp?ip=ipaddress
# ip地址查詢
import requests
url ="http://m.ip138.com/ip.asp?ip="
ip = str(input())
try:
r= requests.get(url+ip)
r.raise_for_status()
print(r.status_code)
#r.encoding = r.apparent_encoding
print(r.text[-500:])
except:
print("failed")
image
5.網(wǎng)絡(luò)圖片的爬取和儲存
# spider_for_imgs
import requests
import os
url = "http://n.sinaimg.cn/sinacn12/w495h787/20180315/1923-fyscsmv9949374.jpg"
#C:\Users\Administrator\Desktop\spider\first week\imgs/
root = "C://Users/Administrator/Desktop/spider/first week/imgs/"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("save successfully!")
else:
print("file already exist!")
except:
print("spider fail")
image
image
image