Date:2016-12-1
By:Black Crow
前言:
本次作業(yè)為課程第四部分的作業(yè),爬取動態(tài)加載數(shù)據(jù)。爬取下來的數(shù)據(jù)存儲為CSV文件撩穿,然后通過EXCEL做了幾張簡單的圖功舀,爬了淘寶‘bra’關(guān)鍵詞的6800多個的商品。
作業(yè)效果:
海外產(chǎn)品價格遠(yuǎn)高于大陸產(chǎn)品.png
廣東產(chǎn)品占比46%.png
廣東產(chǎn)品均價以河源最高毫别,陽江最低.png
我的代碼:
import requests,json,csv
class Taobao():
'''通過關(guān)鍵詞獲取淘寶頁面娃弓,通過頁面中的json數(shù)據(jù)提取商品信息'''
def init(self,start_page,end_page,keyword):
self.urls =['https://s.taobao.com/search?data-key=s&data-value={}&ajax=true&_ksTS=1480429993551_840&callback=&q={}'.format(str(i*44),keyword) for i in range(start_page,end_page+1)]
self.write_file_head()#一次寫入頭信息
for url in self.urls:#循環(huán)控制多頁內(nèi)容追加寫入
data = self.get_data(url)
product_info_list = self.get_product_info(data)
self.write_file(product_info_list)
def get_data(self,url):
r=requests.get(url)
# data = r.text
data =json.loads(r.text)#轉(zhuǎn)換為dict
return data
def get_product_info(self,data):
porduct_info_list =[]
for i in data['mods']['itemlist']['data']['auctions']:#數(shù)據(jù)較多,可以通過json在線解析網(wǎng)站切換視圖
product_info={
'detail_url':i['detail_url'].replace('//',''),
'location' :i['item_loc'].replace('//',''),
'shoplink' :i['shopLink'].replace('//',''),
'reserve_price':i['reserve_price'],
'fee':i['view_fee'],
'raw_title' :i['raw_title'],
'view_price':i['view_price'],
'pic_url':i['pic_url'].replace('//',''),
'shop_owner':i['nick'],
'user_id':i['user_id'],
}
porduct_info_list.append(product_info)
return porduct_info_list
def write_file_head(self):
with open('product_info.csv', 'a+',newline='') as file:
fieldnames = ['raw_title','view_price','reserve_price','location',
'fee','detail_url', 'shoplink','pic_url','shop_owner' ,'user_id']
writer = csv.DictWriter(file,fieldnames=fieldnames)
writer.writeheader()#因為涉及多頁時重復(fù)寫入頭信息會浪費岛宦,故一次寫入台丛,后續(xù)只寫入產(chǎn)品信息
def write_file(self,product_info_list):
with open('product_info.csv','a+',newline='') as file:#采用a+在底部寫入,'newline'不加會多出空行砾肺,原理暫不知道
fieldnames = ['raw_title','view_price','reserve_price','location','fee','detail_url', 'shoplink','pic_url','shop_owner' ,'user_id']
writer= csv.DictWriter(file,fieldnames=fieldnames)
# writer.writeheader()
writer.writerows(product_info_list)#writerows與writerow不同挽霉,writerows可以寫入字典,相對高效
if name == 'main':
keyword = input('Keyword:') # 輸入關(guān)鍵詞变汪,返回數(shù)據(jù)的基礎(chǔ)依據(jù)
start_page = int(input('From page:'))#因為每頁有44個侠坎,要控制頁碼,先轉(zhuǎn)為int
end_page = int(input('To page:'))#同上
file = Taobao(start_page,end_page,keyword)
print('Done!')
####總結(jié):
>1. 淘寶商品頁顯示的只有100頁(每頁44個商品)裙盾,做測試的時候曾做了300頁实胸,爬取6.8K商品時報錯了。
2. 鏈接中有&callback=json*闷煤,如果不在請求鏈接中刪除童芹,會存在返回的數(shù)據(jù)不是json格式的問題。
3. csv文件寫入的時候用了a+追加寫入鲤拿,所以代碼對頭信息單獨做了處理的假褪,寫入信息過程中在加newline=‘’之前是會出現(xiàn)亂碼,原因暫未查明近顷。