場景介紹
有一個前臺的 Web 應(yīng)用,框架會記錄訪問日志,并定期歸檔,存儲在特定的目錄解幼,目錄格式如下:
/onlinelogs/應(yīng)用名/環(huán)境名/年/月/日/小時/
,例如 /onlinelogs/<app_name>/Prod/2018/08/07/01
包警。
在該目錄下:
- 訪問日志文件可能有多個撵摆,文件名以
access-log
開頭 - 已壓縮為
.gz
文件,并且只讀害晦,例如:
- 訪問日志文件中特铝,有部分行是記錄 HTTP 請求的,格式如下所示:
- 從中可以看出壹瘟,請求的目標(biāo)資源鲫剿,響應(yīng)碼,客戶端信息
222.67.225.134 - - [04/Aug/2018:01:16:44 +0000] "GET /?ref=as_cn_ags_resource_tb&ck-tparam-anchor=123067 HTTP/1.1" 200 7798 "https://gs.amazon.cn/resources.html/ref=as_cn_ags_hnav1_re_class" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"
222.67.225.134 - - [04/Aug/2018:01:16:54 +0000] "GET /tndetails?tnid=3be3f34dee8a4bf08baa072a478fc882 HTTP/1.1" 200 9152 "https://gs.amazon.cn/sba/?ref=as_cn_ags_resource_tb&ck-tparam-anchor=123067&tnm=Offline" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15"
222.67.225.134 - - [04/Aug/2018:01:17:37 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"
222.67.225.134 - - [04/Aug/2018:01:17:39 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"
222.67.225.134 - - [04/Aug/2018:01:17:40 +0000] "GET /paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435 HTTP/1.1" 200 7138 "https://gs.amazon.cn/sba/paymentinfo?oid=10763561&rtxref=164f9f8e2b5e4d7790d02d1220eae435" "Mozilla/5.0 (Linux; Android 8.1.0; DE106 Build/OPM1.171019.026; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/62.0.3202.84 Mobile Safari/537.36 AliApp(DingTalk/4.5.3) com.alibaba.android.rimet/0 Channel/10006872 language/zh-CN"
140.243.121.197 - - [04/Aug/2018:01:17:41 +0000] "GET /?ref=as_cn_ags_resource_tb HTTP/1.1" 302 - "https://gs.amazon.cn/resources.html/ref=as_cn_ags_hnav1_re_class" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
分析目標(biāo)
- 記錄不同時間段的訪問量
- 統(tǒng)計 PC 和 Mobile 端訪問量
- 統(tǒng)計不同頁面的訪問量
基本思想:
- 由于原日志文件已壓縮稻轨,并且只讀灵莲,所以需要創(chuàng)建一個臨時目錄
/tmp/logs/unziped_logs
來解壓縮日志文件 - 利用正則表達(dá)式
^access\-log.*gz+$
過濾日志文件 - 利用正則表達(dá)式
^.*GET (.*) HTTP.*$
過濾日志文件中的 HTTP 請求 - 通過日志行是否有
Mobile
來判斷客戶端
代碼如下:(部分內(nèi)容屏蔽)
#!/usr/bin/python3
import os
import os.path
import re
import shutil
import gzip
from collections import defaultdict
# define the ziped and unziped log file directorys
source_logs_dir = '/onlinelogs/<app_name>/Prod'
unziped_logs_dir = '/tmp/logs/unziped_logs'
# clear the unziped log file directory if exists
if os.path.exists(unziped_logs_dir):
shutil.rmtree(unziped_logs_dir)
# create the unziped log file directory
os.mkdir(unziped_logs_dir)
# regex used to match target log file name
log_file_name_regex = re.compile(r'^access\-log.*gz+$')
# regex used to match HTTP request
http_request_regex = re.compile(r'^.*GET.*gs.amazon.cn')
# regex used to match request page
request_page_regex = re.compile(r'^.*GET (.*) HTTP.*$')
# request_page_regex = re.compile(r'^.*GET (.*)\?.*$')
# a dictionary to store the HTTP request count of each day
day_count = defaultdict(int)
# a dictionary to store the count of each device (PC or Mobile)
device_count = defaultdict(int)
device_count['PC'] = 0
device_count['Mobile'] = 0
# a dictionary to store the count of each request page
request_page_count = defaultdict(int)
for root, dirs, files in os.walk(source_logs_dir):
for name in files:
# find the target log files
if log_file_name_regex.search(name):
# parst the day
day = root[-13:-3]
# copy the target log files
shutil.copyfile(os.path.join(root, name), os.path.join(unziped_logs_dir, name))
# unzip the log files
unziped_log_file = gzip.open(os.path.join(unziped_logs_dir, name), 'rb')
http_request_count = 0
pc_count = 0
mobile_count = 0
for line in unziped_log_file:
if(http_request_regex.search(line)):
# parse the request page
regex_obj = request_page_regex.search(line)
request_page = regex_obj.group(1)
# remove params of the request page
if('?' in request_page):
request_page = request_page[:request_page.find('?')]
http_request_count = http_request_count + 1
if('Mobile' in line):
mobile_count = mobile_count + 1
else:
pc_count = pc_count + 1
# update the count of each request page
if(request_page in request_page_count):
request_page_count[request_page] = request_page_count[request_page] + 1
else:
request_page_count[request_page] = 1
# update the HTTP request count of each day
if(day in day_count):
day_count[day] = day_count[day] + http_request_count
else:
day_count[day] = http_request_count
# update the count of each device (PC or Mobile)
device_count['PC'] = device_count['PC'] + pc_count
device_count['Mobile'] = device_count['Mobile'] + mobile_count
# remvoe the original zip log files
os.remove(os.path.join(unziped_logs_dir, name))
# print the HTTP request count of each day
total = 0
print 'HTTP request count of each day'
for day, count in sorted(day_count.items()):
print day, ':', count
total = total + count
print 'Total = ', total
print '###############################'
total = 0
print 'count of each device (PC or Mobile)'
# print the count of each device (PC or Mobile)
for device, count in sorted(device_count.items()):
print device, ':', count
total = total + count
print 'Total = ', total
print '###############################'
total = 0
print 'count of each request page'
# print the count of each request page
for request_page, count in sorted(request_page_count.items()):
print request_page, ':', count
total = total + count
print 'Total = ', total