這個(gè)實(shí)例的函數(shù)主體和Python3簡(jiǎn)單的爬取淘寶數(shù)據(jù)一樣
就是目標(biāo)網(wǎng)站從淘寶更換成了1688(因?yàn)榘l(fā)現(xiàn)淘寶沒有原材料)
實(shí)現(xiàn)方式很簡(jiǎn)單,60行簡(jiǎn)單代碼就能實(shí)現(xiàn)
主要是python的包庫太過完善成翩,我根本不需要什么語法知識(shí)觅捆,直接調(diào)用現(xiàn)成的函數(shù)就行了
1688也沒有登陸訪問限制
cookies那部分就可以刪掉了
具體函數(shù)作用參照Python3簡(jiǎn)單的爬取淘寶數(shù)據(jù)
不同的地方在于他的編碼了url而不直接用漢字(話說編碼的網(wǎng)站更多吧)
噢,對(duì)爬到的頁面進(jìn)行的處理用bs4而不是正則
用的庫也有所不同
使用軟件
- vscode
- python3
用到的庫
urllib //處理url用的庫
requests //請(qǐng)求庫
xlwt //excel處理庫
bs4 //頁面內(nèi)容處理庫
整體代碼:
import urllib
import xlwt
from bs4 import BeautifulSoup
import requests
def writeExcel(ilt,name):
if(name != ''):
count = 0
workbook = xlwt.Workbook(encoding= 'utf-8')
worksheet = workbook.add_sheet('temp')
worksheet.write(count,0,'序號(hào)')
worksheet.write(count,1,'價(jià)格')
worksheet.write(count,2,'名稱')
for g in ilt:
count = count + 1
worksheet.write(count,0,count)
worksheet.write(count,1,g[0])
worksheet.write(count,2,g[1])
workbook.save(name+'.xls')
print('已保存為:'+name+'.xls')
else:
printGoodsList(ilt)
def getHTMLText(url):
kv = {'user-agent':'Mozilla/5.0'}
try:
r = requests.get(url,headers=kv, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt, html):
try:
bf = BeautifulSoup(html, 'html.parser')
price = bf.find_all(attrs={'click-item':"price"})
title = bf.find_all(attrs={'click-item':"title"})
for i in range(len(price)):
ilt.append([price[i].text,title[i].text])
except:
print("")
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}"
print(tplt.format("序號(hào)","價(jià)格", "名稱"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count, g[0], g[1]))
def main():
goods = input('搜索商品:')
depth = int(input('搜索頁數(shù):'))
name = input('輸入保存的excel名稱(留空print):')
start_url = 'https://www.1688.com/chanpin/-.html?spm=a261b.2187593.searchbar.2.oUjRZK&keywords=' + urllib.parse.quote(goods,safe='/',encoding='gb2312')
infoList = []
print('處理中...')
for i in range(depth):
try:
url = start_url + '&beginPage=' + str(i+1)
html = getHTMLText(url)
parsePage(infoList, html)
print('第%i頁成功...' %(i+1))
except:
continue
writeExcel(infoList,name)
print('完成!')
main()
討論
很簡(jiǎn)單的小程序麻敌,回看淘寶好像這個(gè)更適合練手的樣子
還有很多要改進(jìn)的地方:
- 用數(shù)據(jù)庫代替xlwt寫入excel(對(duì)于少量數(shù)據(jù)excel可能方便一點(diǎn)
- 嘗試Xpath替代bs4(聽說xpath比bs4好用栅炒?