銷(xiāo)售最重要的就是數(shù)據(jù) 幅垮,數(shù)據(jù)一般來(lái)源于網(wǎng)站腰池,b2b, 還有一些會(huì)展的會(huì)刊。
這里要學(xué)習(xí)的忙芒,就是beautifulSoup網(wǎng)站示弓, 一段段小小的代碼,5分鐘可以幫你節(jié)約輸入六個(gè)小時(shí)呵萨。
首先看看代碼奏属,
__author__ = 'lixiang'
#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
import re
from openpyxl import Workbook
urls = ['','',''] #網(wǎng)站保密
links = []
for url in urls:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
source = response.read()
response.close()
soup = BeautifulSoup(source)
urlLink = soup.find_all(href=re.compile("custom_exhibitor"))
number = len(urlLink)
for numbers in range(number):
links.append(urlLink[numbers]['href'])
count = 2
wb = Workbook()
ws =wb.active
for url in links:
thtext=[]
tdtext=[]
text=[]
text1=[]
request = urllib2.Request(url)
response = urllib2.urlopen(request)
source=response.read()
response.close()
soup =BeautifulSoup(source)
thtext = soup.find_all("th")
tdtext = soup.find_all("td")
length = len(thtext)
for i in range(length):
a = thtext[i].string
text.append(a)
for j in range(length):
try:
b = tdtext[j].string.lstrip()
except AttributeError:
b = tdtext[j].string
text1.append(b)
print text1[1]
if count >1 :
ws.append([text[i]for i in range(length)])
count = count -1
else:
pass
ws.append([text1[j]for j in range(length)])
wb.save('文件名.xlsx')
以上代碼,比較滿意的是甘桑,可以爬數(shù)據(jù)了拍皮,但是有幾個(gè)問(wèn)題歹叮, 如何讓源代碼可讀性,比如是否可以實(shí)現(xiàn)類(lèi)铆帽。 以及多線程加快爬蟲(chóng)速度咆耿。
這是下一次迭代的事情。
感謝互聯(lián)網(wǎng)爹橱,感謝知識(shí)萨螺,這就是效率吧。