東方財(cái)富網(wǎng)提供了上市公司的公告大全思币,將其抓取出來逃片,進(jìn)行整理后空郊,提供給基金經(jīng)理做投資決策。本文介紹了該項(xiàng)目的python爬蟲實(shí)現(xiàn)過程丈攒。
- 需求分析
- 業(yè)務(wù)分析
- 頁面數(shù)據(jù)分析
- 程序結(jié)構(gòu)設(shè)計(jì)
- 編碼實(shí)現(xiàn)
- 基礎(chǔ)工具函數(shù)
- 邏輯實(shí)現(xiàn)
需求分析
1.1 業(yè)務(wù)分析
分析公告匯總頁面哩罪,發(fā)現(xiàn)滬深A(yù)股公告、中小板公告巡验、創(chuàng)業(yè)板公告际插,頁面表現(xiàn)比較一致,而三板公告實(shí)現(xiàn)方式有所變化显设,先抓取滬深A(yù)股公告框弛、中小板公告和創(chuàng)業(yè)板公告
1.2 頁面數(shù)據(jù)分析
翻頁頁面,發(fā)現(xiàn)地址欄鏈接不變捕捂,數(shù)據(jù)是采用ajax動(dòng)態(tài)加載瑟枫。
在chrome的Network中找到數(shù)據(jù)請(qǐng)求地址:
<pre>
http://data.eastmoney.com/notices/getdata.ashx?StockCode=&FirstNodeType=0&CodeType=1&PageIndex=2&PageSize=50&jsObj=LYqzjWvf&SecNodeType=0&Time=&rt=50032558
http://data.eastmoney.com/notices/getdata.ashx?StockCode=&FirstNodeType=0&CodeType=1&PageIndex=3&PageSize=50&jsObj=maFEjDLX&SecNodeType=0&Time=&rt=50032562
</pre>
同時(shí)在網(wǎng)頁源代碼中也發(fā)現(xiàn):
<pre>
dataurl: "/notices/getdata.ashx?StockCode=&FirstNodeType=0&CodeType=3&PageIndex={page}&PageSize={pageSize}&jsObj={jsname}{param}",
</pre>
對(duì)比大概可以肯定如下參數(shù):
<pre>
CodeType 公告板塊分類
PageIndex 頁面序號(hào)
PageSize 頁面大小
</pre>
不太確定的是<code>jsObj</code>和<code>rt</code>這2個(gè)參數(shù)斗搞。
繼續(xù)分析網(wǎng)絡(luò)請(qǐng)求,發(fā)現(xiàn)有一個(gè)load_table_data.js的請(qǐng)求慷妙,看起來比較可疑榜旦。在load_table_data.js中搜索jsObj未發(fā)現(xiàn),但是搜索&rt發(fā)現(xiàn)下面代碼:
<pre>
update: function () {
var _t = this;
if (_t.options.beforeupdate(_t))
return;
var jsname = _t.getCode(8),
_url = _t.parperUrl();
_t.options.code = jsname;
_url = _url.replace("{jsname}", jsname);
_url += (_url.indexOf('?') > -1) ? "&rt=" : "?rt=";
_url += parseInt(parseInt(new Date().getTime()) / 30000);
_t.loadThead();
_t.scorllTop();
_t.showLoading();
_t.tools.loadJs(_url, _t.options.charset,
function () {
if (typeof (_t.options.load_div) != "undefined") {
if (_t.options && _t.options.nodetemp) {
_t.options.nodetemp.style.position = "";
}
_t.options.load_div.style.display = "none"
}
if (!(eval("typeof " + jsname) == "undefined") || eval("typeof " + jsname == null)) {
var loaddata = eval(jsname);
if (jsname != _t.options.code) {
return
}
_t.options.data = loaddata;
_t.display()
} else {
// alert("數(shù)據(jù)加載失敗景殷,請(qǐng)刷新頁面重新嘗試!")
if (console && console.log) {
console.log("tools.loadJs掛了稍后再改");
}
}
})
},
</pre>
大概就是這里了澡屡,在257行猿挚,加上斷點(diǎn)調(diào)試一下(如果對(duì)前端不太熟悉,調(diào)試方式是在chrome開發(fā)者工具的【Sources】標(biāo)簽里驶鹉,找到load_table_data.js 绩蜻,鼠標(biāo)點(diǎn)擊257行)
這樣就確認(rèn)了鏈接里的<code>jsObj</code>和<code>rt</code>的實(shí)現(xiàn)方法。
jsObj生成函數(shù)如下:
<pre>
getCode: function (num) {
var str = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
var codes = str.split('');
num = num || 6;
var code = "";
for (var i = 0; i < num; i++) {
code += codes[Math.floor(Math.random() * 52)]
}
return code
},
var jsname = _t.getCode(8),
</pre>
rt生成代碼如下
<pre>
parseInt(parseInt(new Date().getTime()) / 30000)
</pre>
1.3 程序結(jié)構(gòu)設(shè)計(jì)
(直接使用有道云筆記md畫的圖室埋,有點(diǎn)渣)
編碼實(shí)現(xiàn)
基礎(chǔ)工具函數(shù)
- 下載函數(shù)办绝、參數(shù)解析
參看上市公司重要公告集錦抓取 - jsObj 模擬函數(shù)
# load_table_data.js getCode
def getCode(num=6):
s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
codes = list(s)
code = ""
for x in xrange(0, num):
idx = int(math.floor(random() * 52))
code += codes[idx]
return code
- rt 模擬函數(shù)
# _url += parseInt(parseInt(new Date().getTime()) / 30000);
def getRightTime():
r = int(time.time() / 30)
return r
- 郵件附件上傳
參看上市公司重要公告集錦抓取,關(guān)于附件上傳部分如下:
mail_msg = read_html(now)
# 郵件正文內(nèi)容
# msg.attach(MIMEText(now+" 日,二級(jí)市場(chǎng)公告信息姚淆。詳情請(qǐng)見附件excel", 'plain', 'utf-8'))
msg.attach(MIMEText(mail_msg, 'html', 'utf-8'))
# 構(gòu)造附件2孕蝉,傳送當(dāng)前目錄下的 xls 文件
att2 = MIMEText(open(fileName, 'rb').read(), 'base64', 'utf-8')
att2["Content-Type"] = 'application/octet-stream'
# 解決中文附件下載時(shí)文件名亂碼問題
att2.add_header('Content-Disposition', 'attachment', filename='=?utf-8?b?' +
base64.b64encode(fileName.encode('UTF-8')) + '?=')
msg.attach(att2)
- 日期比較
def time_compare(notice_date):
tt = time.mktime(time.strptime(notice_date, "%Y-%m-%d"))
# 得到公告的時(shí)間戳
if noticeCate == 1:
# A股公告取當(dāng)日
# 得到本地時(shí)間(當(dāng)日零時(shí))的時(shí)間戳
st = time.strftime("%Y-%m-%d", time.localtime(time.time()))
else:
# 新三板公告取前日
# 得到本地時(shí)間(當(dāng)日零時(shí))的時(shí)間戳
st = time.strftime(
"%Y-%m-%d", time.localtime(time.time() - 60 * 60 * 24))
t = time.strptime(st, "%Y-%m-%d")
now_ticks = time.mktime(t)
# 周一需要是大于
if tt >= now_ticks:
return True
else:
return False
- excel讀寫
# 寫excel
def write_sheet(workbook, sheetName, rows):
worksheet = workbook.add_sheet(sheetName)
worksheet.write(0, 0, label="代碼")
worksheet.write(0, 1, label="名稱")
worksheet.write(0, 2, label="公告標(biāo)題")
worksheet.write(0, 3, label="公告類型")
for x in xrange(0, len(rows)):
row = rows[x]
for y in xrange(0, 4):
if y == 2:
alink = 'HYPERLINK("%s";"%s")' % (row[4], row[2])
worksheet.write(x + 1, y, xlwt.Formula(alink))
else:
item = row[y]
worksheet.write(x + 1, y, item)
# 打開excel
def open_excel(file='file.xls'):
try:
data = xlrd.open_workbook(file)
return data
except Exception, e:
logger.debug(str(e))
- 數(shù)據(jù)循環(huán)下載
def do_notice(notices, plate):
for page in xrange(1, 10):
rt = getRightTime()
code = getCode(8)
url = getUrl(apiurl, plate["codeType"], page, code, rt)
jsdata = download_get_html(url)
if jsdata != None:
json_str = jsdata[15:-1]
datas = json.loads(json_str)["data"]
for data in datas:
# 公告日期
notice = parser_data(data)
if notice != None:
notices.append(notice)
else:
logger.debug("page end notices %s %d"& (plate["name"], len(notices)))
return
else:
logger.debug("no notices %s %d"& (plate["name"], len(notices)))
return
- 數(shù)據(jù)解析
def parser_data(data):
temp = data["CDSY_SECUCODES"][0]
noteicedate = data["NOTICEDATE"]
date = noteicedate[0:noteicedate.index('T')]
code = temp["SECURITYCODE"]
name = temp["SECURITYSHORTNAME"]
title = data["NOTICETITLE"]
typeName = '公司公告'
if data["ANN_RELCOLUMNS"] and len(data["ANN_RELCOLUMNS"]) > 0:
typeName = data["ANN_RELCOLUMNS"][0]["COLUMNNAME"]
namestr = unicode(name).encode("utf-8")
detailLink = baseurl + '/notices/detail/' + code + '/' + \
data["INFOCODE"] + ',' + \
base64.b64encode(urllib.quote(namestr)) + '.html'
# print date,code,name,title,typeName,detailLink
if time_compare(date):
return [code, name, title, typeName, detailLink, date]
else:
return None