前言
1.適合有一定程序思維的速成者。
2.適合有其他語言基礎(chǔ)的速成者相叁。
爬蟲流程
1.明確需求
就是你要從哪個(gè)網(wǎng)址拿到什么信息,本次我們需要yelp 的二級(jí)網(wǎng)址拿到商戶的商戶名,電話须尚,地址哄尔,訪問鏈接假消。
2.分析網(wǎng)址
查找你所需要的內(nèi)容的相關(guān)熟悉。
3.開發(fā)調(diào)試
vscode 或者 pycharm
3.輸出文檔
輸出csv文檔
開始搞活
1.先看下需求
需要在yelp網(wǎng)址上查找固定的幾個(gè)美國州 ['FL','GA','IL','IN','MD','MA','MI', 'NJ','NY','NC','OH','PA','SC','TN','TX','VA']內(nèi) "All You Can Eat" l類型的店鋪信息.
如下圖所示岭接,具體鏈接為https://www.yelp.com/search?find_desc=All+You+Can+Eat&find_loc=FL
可以看到總21頁富拗,每頁10條,除了正常要獲取的10條商戶信息外還有,數(shù)量不固定的廣告位鸣戴。為了好爬取信息啃沪,我這邊直接忽略廣告位的信息。直接爬取每頁最多10條的商戶信息窄锅。
2.點(diǎn)擊第二頁查看下不同頁數(shù)的 訪問規(guī)律:
https://www.yelp.com/search?find_desc=All+You+Can+Eat&find_loc=FL&start=10
每頁start+10
3.接著查看 我們需要的數(shù)據(jù)在哪個(gè)位置
打開瀏覽器的F12 進(jìn)入調(diào)試模式创千,建議google瀏覽器
通過初步分析發(fā)現(xiàn)我們要的數(shù)據(jù)在這邊,看過去class name都是固定的,也有我們要的列表點(diǎn)擊進(jìn)去的二級(jí)url追驴。
或者通過這種方式查找
通過觀察發(fā)現(xiàn)數(shù)據(jù)在<Script>標(biāo)簽下面
下面這個(gè)是列表的名稱和地址
本文通過的是爬取Script標(biāo)簽數(shù)據(jù)械哟,來獲取二級(jí)網(wǎng)址。
4.獲取到二級(jí)網(wǎng)址后殿雪,我們繼續(xù)分析二級(jí)網(wǎng)址內(nèi)容
如下圖所示暇咆,電話,地址都有了丙曙。
通過f12 看看這些東西在哪里
看上圖就可以直接看出name,telephone,address
5.東西都查找完了爸业,這個(gè)時(shí)候,如何提取亏镰,如何編寫代碼呢沃呢,對(duì)于不熟悉python的人來說,chatGpt就可以派上用場了,哪里報(bào)錯(cuò)了拆挥,就問chatGpt薄霜,邊問邊搞。
比如:
如何獲取scrpt 標(biāo)簽中的內(nèi)容
如何爬取script標(biāo)簽中的指定內(nèi)容(正則)
只要你懂的問纸兔,就有結(jié)果惰瓜。
接下來我們用chatGpt來邊問邊寫代碼.
先看看代碼主體流程
# 程序結(jié)構(gòu)
class xxxSpider(object):
def __init__(self):
# 定義常用變量,比如url或計(jì)數(shù)變量等
def get_html(self):
# 獲取響應(yīng)內(nèi)容函數(shù),使用隨機(jī)User-Agent
def parse_html(self):
# 使用正則表達(dá)式來解析頁面,提取數(shù)據(jù)
def write_html(self):
# 將提取的數(shù)據(jù)按要求保存汉矿,csv崎坊、MySQL數(shù)據(jù)庫等
def run(self):
# 主函數(shù),用來控制整體邏輯
if __name__ == '__main__':
# 程序開始運(yùn)行時(shí)間
spider = xxxSpider()
spider.run()
基本上的爬蟲流程就是這樣的洲拇,定義url奈揍,獲取url的內(nèi)容,解析url赋续,把獲取到的東西寫到文件中存儲(chǔ)男翰。
具體的代碼如下:
csv 文件工具 csvUtil.py
import csv
def writeCsv(name,urls):
with open(name+'.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
#構(gòu)建字段名稱,也就是key
# 寫入表頭
writer.writerow(['url'])
# 寫入每個(gè) URL
for url in urls:
writer.writerow([url])
def initShopCsv(name):
with open(name+'.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
#構(gòu)建字段名稱纽乱,也就是key
# 寫入表頭
writer.writerow(['ShopName',"Phone","Address","Url"])
def writeShopCsv(data,name):
with open(name+'.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
# 寫入每個(gè)
writer.writerow(data)
print("寫入數(shù)據(jù):"+str(data))
常量類address_info.py
states=['FL','GA','IL','IN','MD','MA','MI',
'NJ','NY','NC','OH','PA','SC','TN','TX','VA']
主體代碼 yelp.py
from urllib import parse
import time
import random
from fake_useragent import UserAgent
import requests
from lxml import etree
from bs4 import BeautifulSoup
import re
import json
import csvUtil
from static.address_info import states
#忽略ssl報(bào)錯(cuò)
requests.packages.urllib3.disable_warnings()
import concurrent.futures
class openTableSpider(object):
#1.初始化蛾绎,設(shè)置基地址
def __init__(self):
self.url='https://www.yelp.com/search?{}'
#2.搜索店面,加入頭部信息,查詢時(shí)間理應(yīng)大于當(dāng)前時(shí)間,否則容易被發(fā)現(xiàn)不是人工點(diǎn)擊
#查找到收尋的店明的列表信息
def searchShopHtml(self,url,state,size):
#參數(shù)信息
params={
'find_desc':"All You Can Eat",
'find_loc':state,
'start':size
}
#格式化信息
full_url=url.format(parse.urlencode(params))
print(""+full_url)
ua=UserAgent()
# req=requests.Request(url=full_url,)
respone=requests.get(url=full_url,headers={'User-Agent':'AdsBot-Google'},verify=False)
# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(respone.text, 'html.parser')
# 查找所有的 <script> 標(biāo)簽
script_tags = soup.find_all('script')
# 輸出每個(gè) <script> 標(biāo)簽的內(nèi)容
for script in script_tags:
script_str=script.string
# 關(guān)鍵字 businessUrl /biz 開頭
if script_str and 'businessUrl' in script_str:
# print("開始:"+script_str)
#查找到相應(yīng)的訪問url
link_match=re.findall(r'"businessUrl":"([^"]+)"',script_str)
if link_match:
format_urls=[x for x in link_match if "ad_business_id" not in x]
# 去重
urls = list(set(format_urls))
# print(len(urls))
print(str(urls))
# csvUtil.writeCsv(shopName,link_match)
return urls
# 3.解析查找到的每家店的url,并且獲取到店面和電話號(hào)碼鸦列,以及地址租冠。
def parse_Url(self,url,state):
url="https://www.yelp.com/"+url
print("Start doing "+url)
ua=UserAgent()
respone=requests.get(url=url,headers={'User-Agent':ua.random},verify=False)
# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(respone.text, 'html.parser')
# 查找所有的 <script> 標(biāo)簽
script_tags = soup.find_all('script')
# 輸出每個(gè) <script> 標(biāo)簽的內(nèi)容
# 找到name,phone,address 數(shù)據(jù)的位置,發(fā)現(xiàn)在 script標(biāo)簽底下
# name window.__INITIAL_STATE__ restaurantProfile restaurant name address
# phone= contactInformation formattedPhoneNumber
# address address":{Phone number
for script in script_tags:
script_str=script.string
address=name=phone=""
if script_str and 'telephone' in script_str:
# print(script_str)
m_address = re.search( r'"address":{(.*?)},', script_str)
if m_address:
# {"line1":"5115 Wilshire Blvd","line2":"","state":"CA","city":"Los Angeles","postCode":"90036","country":"United States","__typename":"Address"},
temp=m_address.group(1)
# print(temp)
try:
streetAddress = re.search(r'"streetAddress":"([^"]*)"', temp).group(1)
addressLocality = re.search( r'"addressLocality":"([^"]*)"', temp).group(1)
addressCountry = re.search( r'"addressCountry":"([^"]*)"', temp).group(1)
addressRegion = re.search( r'"addressRegion":"([^"]*)"', temp).group(1)
postalCode = re.search(r'"postalCode":"([^"]*)"', temp).group(1)
address=streetAddress.replace("\\n","")+","+addressLocality+","+addressRegion+","+postalCode
print(address)
# flag=any(item.upper() in state.upper() for item in openTableAddress)
# # 如果不在地址池中的店薯嗤,則不記錄下來
# if flag is False:
# return
except Exception as e:
address=streetAddress+","+addressLocality+","+addressRegion+","+postalCode
print("地址解析異常:", e)
m_name = re.search( r'"name":"(.*?)",', script_str)
if m_name:
name=m_name.group(1)
# print(name)
m_phone = re.search( r'telephone":"(.*?)",', script_str)
if m_phone:
phone=m_phone.group(1)
# print(phone)
data=[name,phone,address,url]
csvUtil.writeShopCsv(data=data,name=state)
# 4.一個(gè)一個(gè)流程
def doTransaction(self,state,page):
urls=self.searchShopHtml(self.url,state,page)
if not urls:
return
print(len(urls))
while len(urls)%10==0:
print(len(urls))
new_urls=[]
reptry=0
page+=10
reptry=0
#重試10次
while(reptry<10):
try:
new_urls = self.searchShopHtml(spider.url,state, page)
break
except Exception as e:
print("發(fā)生了其他異常:", e)
# 重試一次
reptry+=1
time.sleep(random.randint(5,10))
# 如果沒有新的URLs顽爹,即查詢結(jié)果為空,終止循環(huán)
if not new_urls:
break
urls += new_urls
time.sleep(random.randint(5,10))
count=0
#重試10次
for url in urls:
reptry=0
while(reptry<10):
try:
self.parse_Url(url,state)
count+=1
print(state+" deal count "+str(count))
break
except Exception as e:
print("發(fā)生了其他異常:", e)
reptry+=1
finally:
time.sleep(random.randint(3,10))
#5.入口函數(shù)
def run(self):
for state in states:
csvUtil.initShopCsv(state)
# 創(chuàng)建線程池并并發(fā)執(zhí)行任務(wù)
with concurrent.futures.ThreadPoolExecutor() as executor:
# 使用executor.map并發(fā)執(zhí)行任務(wù)骆姐,傳遞參數(shù)列表
executor.map(self.doTransaction, states, [0] * len(states))
print("All tasks have been submitted.")
# # 單線程處理
# threads = []
# for state in states:
# print(state)
# csvUtil.initShopCsv(state)
# thread = threading.Thread(target=)
# thread.start()
# threads.append(thread)
# # 等待所有線程完成
# for thread in threads:
# thread.join()
if __name__=='__main__':
start=time.time()
spider=openTableSpider() #實(shí)例化一個(gè)對(duì)象spider
# spider.searchShopHtml(spider.url,"New York",0) #調(diào)用入口函數(shù)
# spider.parse_Url("biz/zen-korean-bbq-duluth?osq=All+You+Can+Eat+Buffet","GEORGIA")
spider.run()
end=time.time()
#查看程序執(zhí)行時(shí)間
print('執(zhí)行時(shí)間:%.2f'%(end-start))
具體內(nèi)容查看注釋,有不懂操作的直接問gpt