首先感謝【小甲魚(yú)】極客Python之效率革命凛辣。講的很好岸售,通俗易懂驾窟,適合入門(mén)舆床。
感興趣的朋友可以訪問(wèn)https://fishc.com.cn/forum-319-1.html來(lái)支持小甲魚(yú)檀咙。謝謝大家雅倒。
想要學(xué)習(xí)requests庫(kù)的可以查閱: https://fishc.com.cn/forum.php?mod=viewthread&tid=95893&extra=page%3D1%26filter%3Dtypeid%26typeid%3D701
1.找到目標(biāo)URL
https://s.taobao.com/search?q=XXXX寶貝的名字XXXXXX
我們先把源碼爬下來(lái)看看
# -*- coding:UTF-8 -*-
import requests
def open_url(keyword):
payload = {'q': "零基礎(chǔ)入門(mén)學(xué)習(xí)Python", "sort": "sale-desc"}
url = "https://s.taobao.com/search"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
}
res = requests.get(url, params=payload, headers=headers)
return res
def main():
keyword = input(u"請(qǐng)輸入搜索關(guān)鍵詞:")
res = open_url(keyword)
with open('items.txt', 'w', encoding='utf-8') as file:
file.write(res.text)
if __name__ == '__main__':
main()
通過(guò)觀察發(fā)現(xiàn),我們想要的內(nèi)容好像就在這里;】伞蔑匣!然后我們就上正則,把這一塊摳出來(lái)
源碼.png
2.用正則來(lái)定位元素
# -*- coding:UTF-8 -*-
import re
def main():
with open("items.txt", 'r', encoding="utf-8") as file1:
# re.search(pattern, string, flags=0)
g_page_config = re.search(r"g_page_config = (.*?);\n", file1.read()) # .*? 表示匹配任意數(shù)量的重復(fù)棕诵,但是在能使整個(gè)匹配成功的前提下使用最少的重復(fù)
with open("g_page_config.txt", 'w', encoding="utf-8") as file2:
file2.write(g_page_config.group(1))
if __name__ == '__main__':
main()
正則摳出來(lái)的內(nèi)容.png
發(fā)現(xiàn)內(nèi)容還是好多裁良,字典里面有字典,字典里面還有字典校套,頭大价脾,怎么辦?
我們就按照老辦法笛匙,把后綴名改成.json侨把,然后用火狐瀏覽器打開(kāi)。
定位.png
3.提取我們想要的數(shù)據(jù)(按銷量排序妹孙,統(tǒng)計(jì)前3頁(yè)所有的銷量)
# -*- coding:UTF-8 -*-
import re
import json
import requests
def open_url(keyword, page=1):
# &s=0表示從第1個(gè)商品開(kāi)始顯示秋柄,由于1頁(yè)有44個(gè)商品,所以&s=44表示第二頁(yè)
payload = {'q': keyword, 's': str((page - 1) * 44), "sort": "sale-desc"}
url = "https://s.taobao.com/search"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
}
res = requests.get(url, params=payload, headers=headers)
return res
# 獲取列表頁(yè)的所有商品
def get_items(res):
g_page_config = re.search(r"g_page_config = (.*?);\n", res.text)
page_config_json = json.loads(g_page_config.group(1)) # 將已編碼的 JSON 字符串解碼為 Python 對(duì)象
page_items = page_config_json['mods']['itemlist']['data']['auctions']
results = [] # 整理出我們關(guān)注的信息
for each_item in page_items:
dict1 = dict.fromkeys(('nid', 'title', 'detail_url', 'view_price', 'view_sales', 'nick'))
dict1['nid'] = each_item['nid']
dict1['title'] = each_item['title']
dict1['detail_url'] = each_item['detail_url']
dict1['view_price'] = each_item['view_price']
dict1['view_sales'] = each_item['view_sales']
dict1['nick'] = each_item['nick']
results.append(dict1)
return results
# 統(tǒng)計(jì)該頁(yè)面所有商品的銷量
def count_sales(items):
count = 0
for each in items:
if '小甲魚(yú)' in each['title']:
count += int(re.search(r'\d+', each['view_sales']).group())
return count
def main():
keyword = input(u"請(qǐng)輸入搜索關(guān)鍵詞:")
page = 3 # 前三頁(yè)
total = 0
for each in range(page):
res = open_url(keyword, each+1)
items = get_items(res)
total += count_sales(items)
print("總銷量是:", total)
if __name__ == '__main__':
main()
輸出.png