1、前言
最近小姐姐工作需要舔糖,需要爬取天貓某店的全部商品娱两,正好小哥學(xué)習(xí)了Python幾個月,就答應(yīng)上手試試金吗!結(jié)果第一道題就難住了十兢,天貓登陸需要賬號密碼和驗證碼!R∶怼旱物!雖然知道可以通過模擬和Session操作,但是卫袒,始終是新手開車宵呛,還沒有學(xué)習(xí)那么高深,感覺會走很多彎路Oδ宝穗!另外封孙,也想想,有沒有什么更簡單的方法讽营??泡徙?
不出意思橱鹏,還真發(fā)現(xiàn)啦!天貓的手機版可以不用登陸堪藐,全部數(shù)據(jù)訪問@蚶肌!礁竞!就這樣~
開始吧糖荒!
2、遇到的坑點
本文主要是在 利用Python爬蟲爬取指定天貓店鋪全店商品信息 - 晴空行 - 博客園 這個大哥的基礎(chǔ)上模捂,踩坑填坑捶朵,然后增加自己一些數(shù)據(jù)要求~
- 坑一
File "/Users/HTC/Documents/Programing/Python/WebCrawlerExample/WebCrawler/Tmall_demo.py", line 63, in get_products
writer.writerows(products)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/csv.py", line 158, in writerows
return self.writer.writerows(map(self._dict_to_list, rowdicts))
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/csv.py", line 151, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'titleUnderIconList'
writer.writerows
沒有找到這個'titleUnderIconList'字段,這個字段應(yīng)該是天貓的接口后來返回的數(shù)據(jù)狂男,在代碼里只能刪除掉:
del product['titleUnderIconList']del product['titleUnderIconList']
- 坑二
File "/Users/HTC/Documents/Programing/Python/WebCrawlerExample/WebCrawler/Tmall_demo.py", line 65, in get_products
writer.writerows(products)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/csv.py", line 158, in writerows
return self.writer.writerows(map(self._dict_to_list, rowdicts))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 26-27: ordinal not in range(128)
熟悉的人兒综看,看到python3與python2的區(qū)別,就知道岖食,'ascii' codec can't encode
就是編碼問題红碑,問題就出來這里writer.writerows
, python3處理、解析或轉(zhuǎn)換和保存時泡垃,最好都指定一下使用 utf-8
編碼吧析珊,特別是遇到中文的情況!
最后指定編碼用utf-8:
with open(self.filename, 'a', encoding="utf-8", newline='') as f:
writer = csv.DictWriter(f, fieldnames=title)
writer.writerows(products)
- 坑三
035009803B0
圖片下載錯誤 : http//img.alicdn.com/bao/uploaded/i4/821705368/TB1Sht8cfQs8KJjSZFEXXc9RpXa_!!0-item_pic.jpg Invalid URL '035009803B0': No schema supplied. Perhaps you meant http://035009803B0?
02100713003
圖片下載錯誤 : http//img.alicdn.com/bao/uploaded/i1/821705368/TB1_OIkXQfb_uJkSmRyXXbWxVXa_!!0-item_pic.jpg Invalid URL '02100713003': No schema supplied. Perhaps you meant http://02100713003?
02800614023
圖片下載錯誤 : http//img.alicdn.com/bao/uploaded/i3/821705368/TB1kKK6cInI8KJjSsziXXb8QpXa_!!0-item_pic.jpg Invalid URL '02800614023': No schema supplied. Perhaps you meant http://02800614023?
下圖圖片失敗的提示蔑穴,原因是天貓接口返回的商品數(shù)據(jù)如下:
{
item_id: 14292263734,
title: "XXXXXX",
img: "http://img.alicdn.com/bao/uploaded/i2/821705368/TB1Us3Qcr_I8KJjy1XaXXbsxpXa_!!0-item_pic.jpg",
sold: "3",
quantity: 0,
totalSoldQuantity: 2937,
url: "http://detail.m.tmall.com/item.htm?id=xxxxx",
price: "188.00",
titleUnderIconList: [ ]
},
不帶協(xié)議名字V已啊!存和!不知道是什么時候的歷史留下的坑點吧N荨!哑姚!大廠也是有坑的<婪埂!
3叙量、總結(jié)
具體的代碼倡蝙,可參考我github代碼:
代碼詳細的解析還是參考這位大神的 利用Python爬蟲爬取指定天貓店鋪全店商品信息 - 晴空行 - 博客園,寫的非常的詳細绞佩!
整體來說寺鸥,因為天貓的商品數(shù)據(jù)通過js來獲取猪钮,所以比較容易獲取到數(shù)據(jù),而不用大量的爬取頁面的商品胆建,這個很贊烤低!所以,爬蟲這技術(shù)活笆载,有很多方法扑馁,能找到好的方法,才是爬蟲的最高境界傲棺ぁ腻要!加油~
代碼
python就是利害,一百行代碼就搞定涝登!
import os
import requests
import json
import csv
import random
import re
from datetime import datetime
from urllib import request
import time
class TM_producs(object):
def __init__(self, storename):
self.storename = storename
self.url = 'https://{}.m.tmall.com'.format(storename)
self.headers = {
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 "
"(KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1"
}
datenum = datetime.now().strftime('%Y%m%d_%H%M%S')
self.filename = '{}_{}.csv'.format(self.storename, datenum)
self.get_file()
def get_file(self):
'''創(chuàng)建一個含有標題的表格'''
title = ['item_id', 'product_id', 'price', 'quantity', 'sold', 'title', 'totalSoldQuantity', 'url', 'img']
with open(self.filename, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=title)
writer.writeheader()
return
def get_totalpage(self):
'''提取總頁碼數(shù)'''
num = random.randint(83739921, 87739530)
endurl = '/shop/shop_auction_search.do?sort=s&p=1&page_size=12&from=h5&ajson=1&_tm_source=tmallsearch&callback=jsonp_{}'
url = self.url + endurl.format(num)
html = requests.get(url, headers=self.headers).text
infos = re.findall('\(({.*})\)', html)[0]
infos = json.loads(infos)
totalpage = infos.get('total_page')
return int(totalpage)
def get_products(self, page):
'''提取單頁商品列表'''
num = random.randint(83739921, 87739530)
endurl = '/shop/shop_auction_search.do?sort=s&p={}&page_size=12&from=h5&ajson=1&_tm_source=tmallsearch&callback=jsonp_{}'
url = self.url + endurl.format(page, num)
html = requests.get(url, headers=self.headers).text
infos = re.findall('\(({.*})\)', html)[0]
infos = json.loads(infos)
products = infos.get('items')
for product in products:
del product['titleUnderIconList']
item_id = product['item_id']
product_id = self.get_product_spm(item_id)
product['product_id'] = product_id
imgUrl = 'https:' + product['img']
self.save_img(imgUrl, product_id)
item_id = product['item_id']
# print(products)
title = ['item_id', 'product_id', 'price', 'quantity', 'sold', 'title', 'totalSoldQuantity', 'url', 'img']
with open(self.filename, 'a', encoding="utf-8", newline='') as f:
writer = csv.DictWriter(f, fieldnames=title)
writer.writerows(products)
def get_product_spm(self, item_id):
url = 'https://detail.m.tmall.com/item.htm?id={}'.format(item_id)
html = requests.get(url, headers=self.headers).text
# {"貨號":"07300318000 "}
product_id = re.findall(r'"貨號":"(.+?)"}', html)[0].strip()
print(product_id)
return product_id
def save_img(self, img_url, file_name):
try:
# 獲得圖片后綴
file_suffix = os.path.splitext(img_url)[1]
cwd = os.getcwd()
save_path = os.path.join(cwd, 'images/')
if not os.path.exists(save_path):
os.makedirs(save_path)
image_path = os.path.join(save_path, file_name + file_suffix)
# 下載圖片
image = requests.get(img_url, headers=self.headers)
# 命名并保存圖片
with open(image_path, 'wb') as f:
f.write(image.content)
except Exception as e:
print('圖片下載錯誤 :', file_name, e)
def main(self):
'''循環(huán)爬取所有頁面寶貝'''
total_page = self.get_totalpage()
for i in range(1, total_page + 1):
self.get_products(i)
print('總計{}頁商品雄家,已經(jīng)提取第{}頁'.format(total_page, i))
time.sleep(1 + random.random())
if __name__ == '__main__':
storename = 'mgssp'
tm = TM_producs(storename)
tm.main()
參考
- iHTCboy/WebCrawlerExample: 網(wǎng)頁爬蟲實踐示例
- 利用Python爬蟲爬取指定天貓店鋪全店商品信息 - 晴空行 - 博客園
- Hopetree/E-commerce-crawlers: 電商網(wǎng)站爬蟲合集,淘寶京東亞馬遜等
- 如有疑問胀滚,歡迎在評論區(qū)一起討論趟济!
- 如有不正確的地方,歡迎指導(dǎo)咽笼!
注:本文首發(fā)于 iHTCboy's blog咙好,如若轉(zhuǎn)載,請注來源