利用python爬蟲獲取全國(guó)五級(jí)地址

1莫矗、抓取省級(jí)地址

2019年數(shù)據(jù)

province.png

區(qū)劃和城鄉(xiāng)劃分的最新數(shù)據(jù)為2019年的葬凳,點(diǎn)擊上方鏈接即可查看2019年相關(guān)數(shù)據(jù)。分析網(wǎng)頁(yè)可以看出,各省的鏈接和文本信息都存放在如下的標(biāo)簽中。

<a href="11.html">北京市<br></a>

由于國(guó)家統(tǒng)計(jì)局的網(wǎng)址結(jié)構(gòu)比較簡(jiǎn)單,因此可以直接使用正則表達(dá)式提取

pattern = re.compile("<a href='(.*?)'>(.*?)<")

具體地序目,抓取31省數(shù)據(jù)代碼如下所示。由于后面抓取五級(jí)數(shù)據(jù)時(shí)需要頻繁訪問服務(wù)器伯襟,因此多準(zhǔn)備幾個(gè)請(qǐng)求頭猿涨。另外,url為上方的鏈接逗旁, 為了避免亂碼設(shè)置一下response的編碼嘿辟。

import requests
import re
import random
import time
import os
import pandas as pd

# 設(shè)置請(qǐng)求頭
def get_headers():
    user_agent = [
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
        "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
        "UCWEB7.0.2.37/28/999",
        "NOKIA5700/ UCWEB7.0.2.37/28/999",
        "Openwave/ UCWEB7.0.2.37/28/999",
        "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
        "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25"
    ]

    headers = {
        'Cookie': '_trs_uv=kfp3v12j_6_8t0e; SF_cookie_1=37059734; _trs_ua_s_1=kfxdjigi_6_4w48',
        'Host': 'www.stats.gov.cn',
        'Referer': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/',
        'User-Agent': random.choice(user_agent),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
    }

    return headers


# 獲取31省
def get_province():
    url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html'
    response = requests.get(url, headers=get_headers())
    response.raise_for_status()  
    response.encoding = response.apparent_encoding 
    # response.encoding = 'gbk'
    response.close()
    pattern = re.compile("<a href='(.*?)'>(.*?)<")
    result = list(set(re.findall(pattern, response.text)))
    return result

# 寫入到csv文件
def write_province():
    province = get_province()
    tem = []
    for i in province:
        tem.append([i[0], i[1]])
    df_province = pd.DataFrame(tem)
    df_province.to_csv('省.csv', index=0)
    return None

2舆瘪、抓取市級(jí)地址

image.png

分析河北省的網(wǎng)址可以發(fā)現(xiàn)片效,url由http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html變?yōu)?a target="_blank">http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/13.html,即后綴改為“13.html”英古,其中“13”為之前抓取的省級(jí)標(biāo)簽中的數(shù)據(jù)淀衣,如:<a href="13.html">河北省<br></a>
市級(jí)數(shù)據(jù)的存放與省級(jí)數(shù)據(jù)的存放存在一定的差異召调,市級(jí)數(shù)據(jù)依然存放在如下的標(biāo)簽中膨桥,不同之處在于蛮浑,如果使用抓取省級(jí)數(shù)據(jù)的正則表達(dá)式來(lái)抓取市級(jí)數(shù)據(jù),最終的結(jié)果會(huì)多了個(gè)地址編碼:“130100000000”只嚣。實(shí)際的處理也很簡(jiǎn)單沮稚,將抓取的結(jié)果中的地址編碼刪除即可。因?yàn)榈刂肪幋a為純數(shù)字册舞,容易刪除蕴掏。

<a href="13/1301.html">130100000000</a>
<a href="13/1301.html">石家莊市</a>

為了保證爬取的質(zhì)量,筆者實(shí)際爬取一級(jí)數(shù)據(jù)之后立即進(jìn)行保存调鲸,保存的文件中包含鏈接和文本盛杰,如河北的數(shù)據(jù)保存為:['11.html', '河北省']。爬取市級(jí)數(shù)據(jù)時(shí)只需適當(dāng)修改一下url以及請(qǐng)求頭中的referer參數(shù)藐石,具體代碼如下所示即供。

# 獲取31省
write_province()
province = pd.read_csv('省.csv').values

# 獲取342城市
def get_city(province_code):
    url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + province_code
    headers=get_headers()
    headers['Referer'] = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html'
    response = requests.get(url, headers=headers)
    response.raise_for_status()  
    response.encoding = 'gbk' 
    response.close()
    pattern = re.compile("<a href='(.*?)'>(.*?)<")
    result = list(set(re.findall(pattern, response.text)))
    res = []
    for j in result:
        if '0' not in j[1]:
            res.append(j)
    return res

def write_city():
    tem = []
    for i in province:
        city = get_city(i[0]) 
        print('正在抓取:' , i[1], '共{}個(gè)城市'.format(len(city)))
        time.sleep(random.random())
        for j in city:
            tem.append([i[0], i[1], j[0], j[1]])
    pd.DataFrame(tem).to_csv('市.csv', index=0)
    return Non

3于微、抓取三級(jí)逗嫡、四級(jí)(區(qū)縣、街道)地址

三級(jí)株依、四級(jí)地址的抓取方式與市級(jí)地址的抓取類似祸穷,后面的代碼幾乎等于復(fù)制前面的代碼,不同之處在于url與referer的構(gòu)造勺三,三級(jí)雷滚、四級(jí)地址的抓取代碼如下所示。

# 獲取3068區(qū)縣
def get_district(city_code):
    url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + city_code
    headers=get_headers()
    headers['Referer'] = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/{}.html'.format(city_code.split('/')[0])
    response = requests.get(url, headers=headers)
    response.raise_for_status()  
    response.encoding = 'gbk' 
    response.close()
    pattern = re.compile("<a href='(.*?)'>(.*?)<")
    result = list(set(re.findall(pattern, response.text)))
    res = []
    for j in result:
        if '0' not in j[1]:
            res.append(j)
    return res

def write_district():
    tem = []
    for i in city:
        district = get_district(i[2]) 
        print('正在抓嚷鸺帷:', i[1], i[3], '共{}個(gè)區(qū)'.format(len(district)))
        time.sleep(random.random())
        for j in district:
            tem.append([i[0], i[1], i[2], i[3], j[0], j[1]])
        print(tem[-1], '\n')
    pd.DataFrame(tem).to_csv('區(qū).csv', index=0)
    return None


# 獲取43027街道
def get_road(province_code, city_code, district_code):
    url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + province_code.split('.')[0] + '/' + district_code
    headers=get_headers()
    headers['Referer'] = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + city_code
    response = requests.get(url, headers=headers)
    response.raise_for_status()  
    response.encoding = 'gbk' 
    response.close()
    pattern = re.compile("<a href='(.*?)'>(.*?)<")
    result = list(set(re.findall(pattern, response.text)))
    res = []
    for j in result:
        if '0' not in j[1]:
            res.append(j)
    return res

def write_road():
    tem = []
    for i in district:
        success = False
        while not success:
            try:
                road = get_road(i[0], i[2], i[4])
                print(i[1], i[3], i[5], '爬取成功祈远,共{}個(gè)街道'.format(len(road)))
                time.sleep(random.random() / 2)
                success = True
            except Exception as e:
                print(e)
                print(i[1], i[3], i[5], '爬取失敗,重新爬取')
        for j in road:
            tem.append([i[0], i[1], i[2], i[3], i[4], i[5], j[0], j[1]])
        print(tem[-1], '\n')
    pd.DataFrame(tem).to_csv('路.csv', index=0)
    return None

# 獲取342城市
write_city()
city = pd.read_csv('市.csv').values

# 獲取3068區(qū)縣
write_district()
district = pd.read_csv('區(qū).csv').values

# 獲取43027街道
write_road()
df = pd.read_csv('路.csv')

4商源、抓取五級(jí)地址

抓取五級(jí)地址則略有不同车份,不同之處有兩點(diǎn)。

  • 五級(jí)地址所在的標(biāo)簽有所變化
  • 五級(jí)地址數(shù)量較大牡彻,需要加入一定的優(yōu)化手段

五級(jí)地址的標(biāo)簽多了兩個(gè)地址編碼扫沼,且標(biāo)簽類型有所改變,實(shí)際爬取中適當(dāng)修改正則表達(dá)式庄吼,并在結(jié)果中將地址編碼提取即可

<td>130202002001</td>
<td>111</td>
<td>友誼里社區(qū)居委會(huì)</td>

此外缎除,五級(jí)地址的抓取,增加了try总寻,except機(jī)制器罐,當(dāng)某條數(shù)據(jù)抓取失敗時(shí)重新抓取該條數(shù)據(jù),直至抓取成功渐行。同時(shí)轰坊,為了保證抓取的穩(wěn)定性铸董,筆者采取逐省抓取、立即保存的方式抓取肴沫,最終利用pandas將數(shù)據(jù)合并粟害。具體地、五級(jí)地址抓取代碼如下所示颤芬。

# 獲取656781五級(jí)地址
def get_community(province_code, district_code, road_code):
    url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + province_code.split('.')[0] + '/' + district_code.split('/')[0] + '/' + road_code
    headers=get_headers()
    headers['Referer'] = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/' + province_code.split('.')[0] + '/' + district_code
    response = requests.get(url, headers=headers)
    response.raise_for_status()  
    response.encoding = 'gbk'
    response.close()
    pattern = re.compile('<td>(.*?)</td>')
    result = list(set(re.findall(pattern, response.text)))
    res = []
    for j in result:
        if not re.findall('^\d*$', j):
            res.append(j)
    res.remove('名稱')
    return res

def write_community(filename):
    tem = []
    for i in road:
        success = False
        while not success:
            try:
                community = get_community(i[0], i[4], i[6])
                print(i[1], i[3], i[5], i[7], '\t------>爬取成功我磁,共{}個(gè)村委會(huì)'.format(len(community)))
                time.sleep(random.random() / 4)
                success = True
            except Exception as e:
                print(e)
                print(i[1], i[3], i[5], i[7], '\t------>爬取失敗,重新爬取')
        for j in community:
            tem.append([i[1],i[3],i[5],i[7], j])
        # print(tem[-1], '\n')
    pd.DataFrame(tem).to_csv(filename, index=0)
    return None

# 合并各省五級(jí)地址
def merge():
    file_list = os.listdir('address/')
    data = pd.DataFrame()
    for i in file_list:
        data = data.append(pd.read_csv('address/' + i))
    data.rename(columns={'0':'一級(jí)', '1':'二級(jí)', '2':'三級(jí)', '3':'四級(jí)', '4':'五級(jí)', }, inplace=True)
    return data

# 分省獲取656781五級(jí)地址
lis = df['1'].unique()
for i in lis:
    road = df[df['1']==i].values
    write_community(i + '.csv')

# 合并各省五級(jí)地址
address = merge()
address.to_csv('address.csv', index=0)
address.head()

5驻襟、數(shù)據(jù)展示

樣例.png

完整數(shù)據(jù):https://pan.baidu.com/s/1BAkVbjkJHipEArIrE7Ntwg
提取碼:z6dx

6夺艰、參考文獻(xiàn)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市沉衣,隨后出現(xiàn)的幾起案子郁副,更是在濱河造成了極大的恐慌,老刑警劉巖豌习,帶你破解...
    沈念sama閱讀 211,884評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件存谎,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡肥隆,警方通過(guò)查閱死者的電腦和手機(jī)既荚,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,347評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)栋艳,“玉大人恰聘,你說(shuō)我怎么就攤上這事∥迹” “怎么了晴叨?”我有些...
    開封第一講書人閱讀 157,435評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)矾屯。 經(jīng)常有香客問我兼蕊,道長(zhǎng),這世上最難降的妖魔是什么件蚕? 我笑而不...
    開封第一講書人閱讀 56,509評(píng)論 1 284
  • 正文 為了忘掉前任孙技,我火速辦了婚禮,結(jié)果婚禮上排作,老公的妹妹穿的比我還像新娘牵啦。我一直安慰自己,他們只是感情好纽绍,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,611評(píng)論 6 386
  • 文/花漫 我一把揭開白布蕾久。 她就那樣靜靜地躺著,像睡著了一般拌夏。 火紅的嫁衣襯著肌膚如雪僧著。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 49,837評(píng)論 1 290
  • 那天障簿,我揣著相機(jī)與錄音盹愚,去河邊找鬼。 笑死站故,一個(gè)胖子當(dāng)著我的面吹牛皆怕,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播西篓,決...
    沈念sama閱讀 38,987評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開眼愈腾,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了岂津?” 一聲冷哼從身側(cè)響起虱黄,我...
    開封第一講書人閱讀 37,730評(píng)論 0 267
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎吮成,沒想到半個(gè)月后橱乱,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,194評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡粱甫,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,525評(píng)論 2 327
  • 正文 我和宋清朗相戀三年泳叠,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片茶宵。...
    茶點(diǎn)故事閱讀 38,664評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡危纫,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出乌庶,到底是詐尸還是另有隱情叶摄,我是刑警寧澤,帶...
    沈念sama閱讀 34,334評(píng)論 4 330
  • 正文 年R本政府宣布安拟,位于F島的核電站蛤吓,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏糠赦。R本人自食惡果不足惜会傲,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,944評(píng)論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望拙泽。 院中可真熱鬧淌山,春花似錦、人聲如沸顾瞻。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,764評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)荷荤。三九已至退渗,卻和暖如春移稳,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背会油。 一陣腳步聲響...
    開封第一講書人閱讀 31,997評(píng)論 1 266
  • 我被黑心中介騙來(lái)泰國(guó)打工个粱, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人翻翩。 一個(gè)月前我還...
    沈念sama閱讀 46,389評(píng)論 2 360
  • 正文 我出身青樓都许,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親嫂冻。 傳聞我的和親對(duì)象是個(gè)殘疾皇子胶征,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,554評(píng)論 2 349