Python3爬蟲抓取東方財富網(wǎng)股票數(shù)據(jù)并實現(xiàn)MySQL數(shù)據(jù)庫存儲

1. 環(huán)境：

windows10
python3
mysql 5.7

2.Python爬蟲抓取網(wǎng)頁數(shù)據(jù)并保存到本地數(shù)據(jù)文件中

開啟mysql數(shù)據(jù)庫
首先導入需要的數(shù)據(jù)模塊娘赴，定義函數(shù)：

# -*- coding: utf-8 -*-
"""
Created on Fri Dec 29 15:54:40 2017

@author: JayMo
"""

import urllib
import re
import pandas as pd
import pymysql
import os

#爬蟲抓取網(wǎng)頁函數(shù)
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    html = html.decode('gbk')
    return html

#抓取網(wǎng)頁股票代碼函數(shù)
def getStackCode(html):
    s = r'<li><a target="_blank" 
    pat = re.compile(s)
    print(pat)
    code = pat.findall(html)
    print(code)
    return code

真正干活的代碼塊：

Url = 'http://quote.eastmoney.com/stocklist.html'#東方財富網(wǎng)股票數(shù)據(jù)連接地址
filepath = 'F:\\data\\'#定義數(shù)據(jù)文件保存路徑
#實施抓取
code = getStackCode(getHtml(Url)) 
#獲取所有股票代碼（以6開頭的样傍，應該是滬市數(shù)據(jù)）集合
CodeList = []
for item in code:
    if item[0]=='6':
        CodeList.append(item)
#抓取數(shù)據(jù)并保存到本地csv文件
for code in CodeList:
    print('正在獲取股票%s數(shù)據(jù)'%code)
    url = 'http://quotes.money.163.com/service/chddata.html?code=0'+code+\
        '&end=20171228&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
    urllib.request.urlretrieve(url, filepath+code+'.csv')

修改url中的end的值孽锥，可爬取截止日期不同的股票數(shù)據(jù)茵瘾。先看下抓取的結(jié)果掷空。CodeList是抓取到的所有股票代碼的集合吵血，我們看到它共包含1416條元素煞赢，即1416支股票數(shù)據(jù)慎璧。因為股票太多床嫌，所以只抓取以6開頭的。
抓取到的股票數(shù)據(jù)會分別存儲到csv文件中胸私，一只股票數(shù)據(jù)一個文件厌处。理論上會有1416個csv文件，和股票代碼數(shù)一致岁疼。

圖片.png

打開一個本地數(shù)據(jù)文件看一下抓取的數(shù)據(jù)長什么樣子：

圖片.png

3. 將數(shù)據(jù)存儲到MySQL數(shù)據(jù)庫

首先建立本地數(shù)據(jù)庫連接：

#數(shù)據(jù)庫名稱和密碼
name = 'xxxx'
password = 'xxxx'  #替換為自己的用戶名和密碼
#建立本地數(shù)據(jù)庫連接(需要先開啟數(shù)據(jù)庫服務)
db = pymysql.connect('localhost', name, password, charset='utf8')
cursor = db.cursor()

其中卓嫂，數(shù)據(jù)庫名稱(name)和密碼(password)是安裝MySQL時設(shè)置的庭敦。

創(chuàng)建數(shù)據(jù)庫，專門用來存儲本次股票數(shù)據(jù)：

#創(chuàng)建數(shù)據(jù)庫stockDataBase，如果存在則跳過
sqlSentence1 = "create database if not exists stockDataBase"
cursor.execute(sqlSentence1)#選擇使用當前數(shù)據(jù)庫
sqlSentence2 = "use stockDataBase;"
cursor.execute(sqlSentence2)

在首次運行的時候一般都會正常創(chuàng)建數(shù)據(jù)庫氧吐，但如果再次運行，因數(shù)據(jù)庫已經(jīng)存在剧蚣，那么跳過創(chuàng)建趁窃，繼續(xù)往下執(zhí)行。創(chuàng)建好數(shù)據(jù)庫后字逗，選擇使用剛剛創(chuàng)建的數(shù)據(jù)庫京郑，在該數(shù)據(jù)庫中存儲數(shù)據(jù)表。

下面看具體的存儲代碼：

#獲取本地文件列
fileList = os.listdir(filepath)
#依次對每個數(shù)據(jù)文件進行存儲
for fileName in fileList:
    data = pd.read_csv(filepath+fileName, encoding="gbk")
   #創(chuàng)建數(shù)據(jù)表葫掉，如果數(shù)據(jù)表已經(jīng)存在些举，會跳過繼續(xù)執(zhí)行下面的步驟print('創(chuàng)建數(shù)據(jù)表stock_%s'% fileName[0:6])
    sqlSentence3 = "create table if not exists stock_%s" % fileName[0:6] + "(日期 date, 股票代碼 VARCHAR(10), 名稱 VARCHAR(10), 收盤價 float,\
                       最高價 float, 最低價 float, 開盤價 float, 前收盤 float, 漲跌額 float, 漲跌幅 float, 換手率 float,\
                       成交量 bigint, 成交金額 bigint, 總市值 bigint, 流通市值 bigint)"
    cursor.execute(sqlSentence3)#迭代讀取表中每行數(shù)據(jù)，依次存儲（整表存儲還沒嘗試過）
    print('正在存儲stock_%s'% fileName[0:6])
    length = len(data)
    for i in range(0, length):
        record = tuple(data.loc[i])
        #插入數(shù)據(jù)語句
        try:
            sqlSentence4 = "insert into stock_%s" % fileName[0:6] + "(日期, 股票代碼, 名稱, 收盤價, 最高價, 最低價, 開盤價,\
                               前收盤, 漲跌額, 漲跌幅, 換手率, 成交量, 成交金額, 總市值, 流通市值) \
                               values ('%s',%s','%s',%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)" % record
            #獲取的表中數(shù)據(jù)很亂俭厚，包含缺失值户魏、Nnone、none等挪挤，插入數(shù)據(jù)庫需要處理成空值
            sqlSentence4 = sqlSentence4.replace('nan','null').replace('None','null').replace('none','null') 
            cursor.execute(sqlSentence4)
        except:#如果以上插入過程出錯绪抛，跳過這條數(shù)據(jù)記錄，繼續(xù)往下進行
            break

結(jié)果：

圖片.png

代碼并不復雜电禀，只要注意其中幾個點就好了幢码。

1.邏輯層次：
包含兩層循環(huán)，外層循環(huán)是對股票代碼的循環(huán)尖飞，內(nèi)層循環(huán)是對當前股票的每一條記錄的循環(huán)症副。說白了就是按照股票一支一支的存儲店雅，對于每一支股票，按照它每日的記錄一條一條的存儲贞铣。是不是很簡單很暴力闹啦？是的！完全沒有考慮更加優(yōu)化的方式辕坝。

2.讀取本地數(shù)據(jù)文件的編碼方式：
使用'gbk'編碼窍奋，默認應該是'utf8'，但好像不支持中文酱畅。

3.創(chuàng)建數(shù)據(jù)表：
同樣的琳袄，如果數(shù)據(jù)表已經(jīng)存在（判斷是否存在if not exists），則跳過創(chuàng)建纺酸，繼續(xù)執(zhí)行下面的步驟（會繼續(xù)存儲）窖逗。有個問題是，有可能數(shù)據(jù)重復存儲餐蔬，可以選擇跳過存儲或者只存儲最新數(shù)據(jù)碎紊。我在這里沒有考慮太多額外的處理。其次樊诺，指定字段格式仗考，后邊幾個字段成交量、成交金額词爬、總市值秃嗜、流通市值，因為數(shù)據(jù)較大缸夹，選擇使用bigint類型痪寻。

4.沒有指定數(shù)據(jù)表的主鍵：
最初是打算使用日期作為主鍵的，后來發(fā)現(xiàn)獲取到的數(shù)據(jù)中竟然包含重復日期的數(shù)據(jù)虽惭，這就打破了主鍵的唯一性橡类，會出bug的，然后我也沒有多去思考數(shù)據(jù)文件的內(nèi)容芽唇，也不會進一步使用這些個數(shù)據(jù)顾画，也就圖省事直接不設(shè)置主鍵了。

5.構(gòu)造sql語句sqlSentence4：
該過程實現(xiàn)中匆笤，直接把股票數(shù)據(jù)記錄tuple了研侣，然后使用字符串格式化（%操作符）。造成的精度問題沒有多考慮炮捧，不知道會不會產(chǎn)生什么樣的影響庶诡。%s有的上邊帶著' '，是為了在sql語句中表示字符串咆课。其中有一個%s'末誓，只有右邊有單引號扯俱，匹配的是股票代碼，只有一邊單引號喇澡，這是因為從數(shù)據(jù)文件中讀取到的字符串已經(jīng)包含了左邊的單引號迅栅，左邊不需要再添加了。這是數(shù)據(jù)文件格式的問題晴玖，為了表示文本形式預先使用了單引號读存。

6.異常值處理：
文本文件中，包含有空值呕屎、None让簿、none等不標準化數(shù)據(jù)，這里全部替換為null了榨惰，即數(shù)據(jù)庫的空值拜英。

完成MySQL數(shù)據(jù)庫數(shù)據(jù)存儲后静汤，需要關(guān)閉數(shù)據(jù)庫連接：

#關(guān)閉游標琅催，提交，關(guān)閉數(shù)據(jù)庫連接
cursor.close()
db.commit()
db.close()

不關(guān)閉數(shù)據(jù)庫連接虫给，就無法在MySQL端進行數(shù)據(jù)庫的查詢等操作藤抡，相當于數(shù)據(jù)庫被占用。

4.MySQL數(shù)據(jù)庫查詢

db = pymysql.connect('localhost', name, password, 'stockDataBase', charset='utf8')
cursor = db.cursor()
#查詢數(shù)據(jù)庫并打印內(nèi)容
cursor.execute('select * from stock_600000')
results = cursor.fetchall()
for row in results:
    print(row)
#關(guān)閉
cursor.close()
db.commit()
db.close()

以上逐條打印抹估，會凌亂到死的缠黍。也可以在MySQL端查看，先選中數(shù)據(jù)庫：use stockDatabase;药蜻，然后查詢：select * from stock_600000;

5.完整代碼

實際上瓷式，整個事情完成了兩個相對獨立的過程：1.爬蟲獲取網(wǎng)頁股票數(shù)據(jù)并保存到本地文件；2.將本地文件數(shù)據(jù)儲存到MySQL數(shù)據(jù)庫语泽。并沒有直接的考慮把從網(wǎng)頁上抓取到的數(shù)據(jù)實時（或者通過一個臨時文件）扔進數(shù)據(jù)庫贸典，跳過本地數(shù)據(jù)文件這個過程。這里只是嘗試著去實現(xiàn)了一下這件事情踱卵，代碼沒有做任何的優(yōu)化考慮廊驼。本身不實際去使用，只是樂趣而已惋砂，差不多先這樣妒挎。哈哈~~

#導入需要使用到的模塊
import urllib
import re
import pandas as pd
import pymysql
import os

#爬蟲抓取網(wǎng)頁函數(shù)
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    html = html.decode('gbk')
    return html

#抓取網(wǎng)頁股票代碼函數(shù)
def getStackCode(html):
    s = r'<li><a target="_blank" 
    pat = re.compile(s)
    code = pat.findall(html)
    return code
    
#########################開始干活############################
Url = 'http://quote.eastmoney.com/stocklist.html'#東方財富網(wǎng)股票數(shù)據(jù)連接地址
filepath = 'F:\\data\\'#定義數(shù)據(jù)文件保存路徑
#實施抓取
code = getStackCode(getHtml(Url)) 
#獲取所有股票代碼（以6開頭的，應該是滬市數(shù)據(jù)）集合
CodeList = []
for item in code:
    if item[0]=='6':
        CodeList.append(item)
#抓取數(shù)據(jù)并保存到本地csv文件
for code in CodeList:
    print('正在獲取股票%s數(shù)據(jù)'%code)
    url = 'http://quotes.money.163.com/service/chddata.html?code=0'+code+\
        '&end=20171228&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
    urllib.request.urlretrieve(url, filepath+code+'.csv')


##########################將股票數(shù)據(jù)存入數(shù)據(jù)庫###########################

#數(shù)據(jù)庫名稱和密碼
name = 'xxxx'
password = 'xxxx'  #替換為自己的賬戶名和密碼
#建立本地數(shù)據(jù)庫連接(需要先開啟數(shù)據(jù)庫服務)
db = pymysql.connect('localhost', name, password, charset='utf8')
cursor = db.cursor()
#創(chuàng)建數(shù)據(jù)庫stockDataBase
sqlSentence1 = "create database stockDataBase"
cursor.execute(sqlSentence1)#選擇使用當前數(shù)據(jù)庫
sqlSentence2 = "use stockDataBase;"
cursor.execute(sqlSentence2)

#獲取本地文件列表
fileList = os.listdir(filepath)
#依次對每個數(shù)據(jù)文件進行存儲
for fileName in fileList:
    data = pd.read_csv(filepath+fileName, encoding="gbk")
   #創(chuàng)建數(shù)據(jù)表西饵，如果數(shù)據(jù)表已經(jīng)存在酝掩，會跳過繼續(xù)執(zhí)行下面的步驟print('創(chuàng)建數(shù)據(jù)表stock_%s'% fileName[0:6])
    sqlSentence3 = "create table stock_%s" % fileName[0:6] + "(日期 date, 股票代碼 VARCHAR(10),     名稱 VARCHAR(10),\
                       收盤價 float,    最高價    float, 最低價 float, 開盤價 float, 前收盤 float, 漲跌額    float, \
                       漲跌幅 float, 換手率 float, 成交量 bigint, 成交金額 bigint, 總市值 bigint, 流通市值 bigint)"
    cursor.execute(sqlSentence3)
    except:
        print('數(shù)據(jù)表已存在！')

    #迭代讀取表中每行數(shù)據(jù)眷柔，依次存儲（整表存儲還沒嘗試過）
    print('正在存儲stock_%s'% fileName[0:6])
    length = len(data)
    for i in range(0, length):
        record = tuple(data.loc[i])
        #插入數(shù)據(jù)語句
        try:
            sqlSentence4 = "insert into stock_%s" % fileName[0:6] + "(日期, 股票代碼, 名稱, 收盤價, 最高價, 最低價, 開盤價, 前收盤, 漲跌額, 漲跌幅, 換手率, \
            成交量, 成交金額, 總市值, 流通市值) values ('%s',%s','%s',%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)" % record
            #獲取的表中數(shù)據(jù)很亂期虾，包含缺失值积蜻、Nnone、none等彻消，插入數(shù)據(jù)庫需要處理成空值
            sqlSentence4 = sqlSentence4.replace('nan','null').replace('None','null').replace('none','null') 
            cursor.execute(sqlSentence4)
        except:
            #如果以上插入過程出錯丙笋，跳過這條數(shù)據(jù)記錄御板，繼續(xù)往下進行
            break

#關(guān)閉游標怠肋，提交笙各，關(guān)閉數(shù)據(jù)庫連接
cursor.close()
db.commit()
db.close()


###########################查詢剛才操作的成果##################################

#重新建立數(shù)據(jù)庫連接
db = pymysql.connect('localhost', name, password, 'stockDataBase', charset='utf8)
cursor = db.cursor()
#查詢數(shù)據(jù)庫并打印內(nèi)容
cursor.execute('select * from stock_600000')
results = cursor.fetchall()
for row in results:
    print(row)
#關(guān)閉
cursor.close()
db.commit()
db.close()