需求
開(kāi)始是需要把省份的名稱(chēng)浆洗,省份編碼經(jīng)緯度導(dǎo)入數(shù)據(jù)庫(kù)消约,為后面接口提供數(shù)據(jù)逢勾。
需要爬取的經(jīng)緯度地址:(因?yàn)殚_(kāi)始就找到這個(gè))
思路
先通過(guò)WebDriver把頁(yè)面爬取下來(lái)邑跪,然后觀察結(jié)構(gòu)解析需要的表格部分高职,最后把爬取下來(lái)的數(shù)據(jù)用excel保存再導(dǎo)入數(shù)據(jù)庫(kù)
前期準(zhǔn)備:
- 安裝Selenium WebDriver
pip install selenium
Selenium WebDriver提供了各種語(yǔ)言的編程接口钩乍,來(lái)進(jìn)行Web自動(dòng)化開(kāi)發(fā)。
安裝完成后怔锌,運(yùn)行python解釋器寥粹,執(zhí)行命令import selenium变过,如果沒(méi)有異常,則表示安裝成功了排作,如下所示
image.png - 下載瀏覽器的驅(qū)動(dòng)
chrom瀏覽器的web driver(chromedriver.exe)牵啦,可以在下面網(wǎng)址訪問(wèn):
http://npm.taobao.org/mirrors/chromedriver/
firefox(火狐瀏覽器)的web driver (geckodriver.exe)在這里訪問(wèn):
https://github.com/mozilla/geckodriver/releases
其他瀏覽器驅(qū)動(dòng)可以見(jiàn)下面列表:
Edge:https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver
Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/
下載對(duì)應(yīng)版本:
下載BeautifulSoup
BeautifulSoup4是一個(gè)HTML/XML的解析器,主要的功能是解析和提取HTML/XML的數(shù)據(jù)妄痪。和lxml庫(kù)一樣哈雏。
BeautifulSoup4用來(lái)解析HTML比較簡(jiǎn)單,API使用非常人性化衫生,支持CSS選擇器裳瘪,是Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器,也支持lxml解析器罪针。
pip install beautifulsoup4
下載openpyxl
OpenPyXl是一個(gè)Python的模塊 可以用來(lái)處理excle表格
安裝:
直接pip install openpyxl
就可以
實(shí)現(xiàn)步驟
- 先引入需要模塊
from selenium import webdriver
from bs4 import BeautifulSoup
from openpyxl.workbook import Workbook
from openpyxl.writer.excel import ExcelWriter
- 指定chrom驅(qū)動(dòng)頁(yè)面最大化
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") # 這是我的驅(qū)動(dòng)地址自己改改
driver.maximize_window()
- get 方法 打開(kāi)指定網(wǎng)址
driver.get(
"https://blog.csdn.net/abcmaopao/article/details/79554904")
html = driver.execute_script("return document.documentElement.outerHTML")
- 通過(guò)beautifulSoup解析
使用BeautifulSoup類(lèi)解析這段代碼彭羹,獲取一個(gè)BeautifulSoup的對(duì)象,然后按照標(biāo)準(zhǔn)格式輸出泪酱。
soup = BeautifulSoup(html, 'lxml')
- 獲取市級(jí)的excel表格
if(soup):
# 創(chuàng)建工作簿獲取當(dāng)前工作表sheet然后取個(gè)名字
wb = Workbook()
ws = wb.active
ws.title = u'省份經(jīng)緯度'
# list用來(lái)保存數(shù)據(jù)
list=[]
# 遍歷表的的每一行派殷,然后把每一行的每一列變成一個(gè)數(shù)組
# 再把這個(gè)數(shù)組壓入list中
for tr in soup.find_all('tr'):
col = []
for td in tr.find_all('td'):
col.append(td.get_text())
# ws.cell(row=i, column=j).value = td.get_text()
list.append(col)
print(list)
# 輸出看看然后導(dǎo)入excel表格
i = 0
for r in list:
if(i==0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
elif(i>0 and int(r[1])%10000!=0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
# 保存
wb.save('市級(jí).xlsx')
print("保存成功!")
driver.quit()
保存出來(lái)的表結(jié)構(gòu)
完整代碼
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from openpyxl.workbook import Workbook
from openpyxl.writer.excel import ExcelWriter
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
driver.maximize_window()
driver.get(
"https://blog.csdn.net/abcmaopao/article/details/79554904")
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'lxml')
if(soup):
wb = Workbook()
ws = wb.active
ws.title = u'省份經(jīng)緯度'
list=[]
for tr in soup.find_all('tr'):
col = []
for td in tr.find_all('td'):
col.append(td.get_text())
# ws.cell(row=i, column=j).value = td.get_text()
list.append(col)
print(list)
i = 0
for r in list:
if(i==0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
elif(i>0 and int(r[1])%10000!=0):
j = 0
for c in r:
ws.cell(row=i+1, column=j+1).value = c
print(i,j,c)
j += 1
i += 1
# j += 1
# i+=1
wb.save('市級(jí).xlsx')
print("保存成功墓阀!")
driver.quit()
抽取全國(guó)各省市的DataV.GeoAtlas json地圖數(shù)據(jù)
然后現(xiàn)在需要把[全國(guó)地圖json api] (http://datav.aliyun.com/tools/atlas/#&lat=30.316551722910077&lng=104.20306438764393&zoom=3.5) 下載到本地毡惜,但這次要省級(jí)的
一樣的思路把省級(jí)的行政編碼爬取下來(lái)
也就是把elif(i>0 and int(r[1])%10000!=0):
改成elif(i>0 and int(r[1])%10000==0):
然后這次變成讀取每一個(gè)省份的編碼,動(dòng)態(tài)爬取json保存
完整代碼
from openpyxl.workbook import Workbook
from openpyxl import load_workbook
def getJson(code):
path = "https://geo.datav.aliyun.com/areas/bound/geojson?code="
driver.get(path + code+'_full')
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'lxml')
print(soup.get_text())
if (soup.get_text()):
f = open(code + '_full.json', 'w',encoding='utf-8')
f.write(soup.get_text())
f.close()
print("保存成功" + code)
driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
driver.maximize_window()
code = "100000"
getJson(code)
wb = load_workbook('test1.xlsx')["省份經(jīng)緯度"]
print(wb.rows)
list=[]
i = 0
for row in wb.rows:
if(i>0):
chil = []
print(row[1].value)
code = row[1].value
getJson(code)
i += 1
driver.quit()