鏈路監(jiān)控
以前的方案
全量校驗,邏輯上就是select * from tcbuyer.order的結(jié)果咒循,和select * from tc.order 的結(jié)果作對比据途。
偽增量校驗 ,比較上一個小時的數(shù)據(jù)叙甸。
單流增量校驗颖医, 基于事件的比較,當(dāng)買家?guī)焐梢还P訂單后裆蒸,相應(yīng)地MySQL會產(chǎn)生一條binlog便脊,單流增量校驗系統(tǒng)就能以這條binlog作為觸發(fā)條件,解析出binlog內(nèi)容光戈,去實時反查賣家?guī)煊袥]有對應(yīng)記錄哪痰。
AMG的校驗圖模型——Check Graph
假設(shè)交易鏈路有4個業(yè)務(wù)系統(tǒng)需要對賬,分別是交易久妆、庫存晌杰、資金和支付,其中涉及的事件分別對應(yīng) 交易下單事件筷弦、減庫存事件肋演、使用紅包資金事件、支付事件烂琴。對賬的需求如下:
交易事件 和 庫存事件 做校驗爹殊;
交易事件 和 資金事件 做校驗;
交易事件 和 支付事件 做校驗奸绷;
資金事件 和 支付事件 做校驗梗夸。
一旦上面4個校驗中的其中一個出現(xiàn)問題,都認為是業(yè)務(wù)系統(tǒng)存在異常号醉,需要及時報出來反症。
明顯可以看出來,這是一個圖模型畔派。比如铅碍,A事件和B事件校驗,則存在一條邊线椰,連接A和B點胞谈。以事件作為點(Node),事件間的校驗方法作為邊(Edge),構(gòu)造出一個圖(Graph)模型烦绳。按照上述場景悔叽,構(gòu)造的圖模型如下:
交易 《-----校驗----》
集團的各個系統(tǒng)為了業(yè)務(wù)解耦、保證主鏈路的性能或可用性爵嗅,各系統(tǒng)之間常常存在各種同步異步調(diào)用、強弱依賴關(guān)系笨蚁。一旦網(wǎng)絡(luò)抖動睹晒、業(yè)務(wù)系統(tǒng)bug、或是某個子系統(tǒng)出現(xiàn)異常括细,就可能就會出現(xiàn)業(yè)務(wù)數(shù)據(jù)不一致伪很。拿最核心的交易系統(tǒng)和庫存系統(tǒng)來說,用戶下了單之后奋单,沒減庫存锉试,那么很有可能出現(xiàn)超賣;用戶關(guān)閉訂單之后览濒,沒有回補庫存呆盖,那么就會導(dǎo)致少賣。這就是交易和庫存系統(tǒng)之間的數(shù)據(jù)不一致贷笛。
from influxdb import InfluxDBClient
json_body = [
{
"measurement": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"value": 0.64
}
}
]
client = InfluxDBClient('localhost', 8086, 'root', 'root', 'example')
client.create_database('example')
client.write_points(json_body)
result = client.query('select value from cpu_load_short;')
print("Result: {0}".format(result))
insert prism_trace_log,serverApp='camel',serviceName='index.api', rt=50 '2017-09-08 13:00:01'
// TRACE 類型默認不輸出 rpcId
==========================BaseModel
traceId
rpcId
timestamp
rpcType
rpcId
hostIp
==========================RpcModel
clientApp
clientIp
clientSpan
serverApp
serverIp
serverSpan
opName //操作名稱应又,一般視 RPC 情況確定,如 LOCAL乏苦、SYNC株扛、CALLBACK、FUTURE 等汇荐;對于數(shù)據(jù)庫,如 QUERY洞就、UPDATE、INSERT掀淘、DELETE
opType //操作類型旬蟋,一般視 RPC 情況確定,如序列化方式革娄,或讀寫標(biāo)記等咖为;對于數(shù)據(jù)庫,分成 R、W 兩種表示讀稠腊、寫操作
serviceName //接口名躁染,
methodName //方法名
error //0
result // 1,2,3,3,4,5
==========================
http 總量
select count(*),sum(error),avg(serverSpan) from prism_trace where rpcType=0 and serverApp = ?
http 按頁面統(tǒng)計
select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=0 and serverApp = ? group by serviceName
RPC 總量
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ?
RPC 按服務(wù)統(tǒng)計
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName
RPC 服務(wù)來源
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp
RPC 服務(wù)去向
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by serverApp
RPC 應(yīng)用來源
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by clientApp
RPC 應(yīng)用去向
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serverApp
DB 總量
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=3 and serverApp = ?
DB 按表統(tǒng)計
select count(),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? group by serviceName
DB 統(tǒng)計表的來源
select count(*),avg(serverSpan),max(serverSpan),sum(error) from prism_trace where rpcType=1 and serverApp = ? and serviceName=? group by clientApp
錯誤類型:
/**
* 未知
/
UNKNOWN,
/*
* 成功
/
OK,
/*
* 業(yè)務(wù)錯誤
/
BIZ_ERROR,
/*
* RPC 錯誤
/
RPC_ERROR,
/*
* 超時
/
TIMEOUT,
/*
* 軟錯誤,一般用于資源找不到架忌、未命中吞彤、加鎖未成功、
* 版本不一致導(dǎo)致未更新等情況,需要根據(jù)中間件不同來判定
/
SOFT_ERROR,
/*
* 限流錯誤
*/
LIMIT_ERROR,
模型的字段如下:
OpName:DB 操作名饰恕,如 QUERY挠羔、UPDATE,(TDDL v5 后增加的)INSERT埋嵌、DELETE
OpType:DB 操作類型破加,分成 R、W 兩種表示讀雹嗦、寫操作
ServiceDim1:物理庫名
ServiceDim2:tableName范舀,例如 JOIN:TABLE_A,TABLE_C,TABLE_B
ServiceDim3:邏輯 SQL 編碼
ServerName:(db@dbName),例如 andor_mysql_group
ClientName:clientAppId
ServerDimKey:TDDL_opName@dbName:tableName
tlive,,mtop/get.do(),500
1. tlive,fun,CommentService,save,100
2. tlive,fun,CommentService,save, 90
fun,db,"table1",100
tlive,fun,MemberService,save,200
fun,db,"table2",100
/*
* Rpc 類型的數(shù)字編號
*/
// @formatter:off
public static final int RPC_TYPE_UNKNOWN = 255;
public static final int RPC_TYPE_TRACE = 0;
public static final int RPC_TYPE_HSF = 1;
public static final int RPC_TYPE_HSF_SERVER = 2;
public static final int RPC_TYPE_NOTIFY = 3;
public static final int RPC_TYPE_TDDL = 4;
public static final int RPC_TYPE_TAIR = 5;
public static final int RPC_TYPE_SEARCH = 6;
public static final int RPC_TYPE_MASTER = 11;
public static final int RPC_TYPE_SLAVE = 12;
public static final int RPC_TYPE_METAQ = 13;
public static final int RPC_TYPE_DRDS = 14;
public static final int RPC_TYPE_TFS = 15;
public static final int RPC_TYPE_ALIPAY = 16;
public static final int RPC_TYPE_HTTP_B = 20;
public static final int RPC_TYPE_HTTP = 25;
public static final int RPC_TYPE_SENTINEL = 26;
public static final int RPC_TYPE_LOCAL = 30;
public static final int RPC_TYPE_JINGWEI = 32;
public static final int RPC_TYPE_ISEARCH = 36;
public static final int RPC_TYPE_LOCAL_NG = 40;
public static final int RPC_TYPE_CSB_SERVER = 52;
public static final int RPC_TYPE_HTTP_SERVER = 251;
public static final int RPC_TYPE_METAQ_RCV = 252;
public static final int RPC_TYPE_ACCESS = 253;
public static final int RPC_TYPE_NOTIFY_RCV = 254;
//自定的RPCTYPE
public static final int RPC_TYPE_CUSTOM_TRACE = 90;
public static final int RPC_TYPE_CUSTOM_RPC_CLIENT = 91;
public static final int RPC_TYPE_CUSTOM_RPC_SERVER = 92;
public static final int RPC_TYPE_CUSTOM_MESSAGE_PUB = 93;
public static final int RPC_TYPE_CUSTOM_MESSAGE_SUB = 96;
public static final int RPC_TYPE_CUSTOM_DB = 94;
public static final int RPC_TYPE_CUSTOM_CACHE = 95;
public static final int RPC_TYPE_CUSTOM_PROTOCOL_CLIENT = 97;
public static final int RPC_TYPE_CUSTOM_PROTOCOL_SERVER = 98;
// @formatter:on
->A->B->C
client, server, type
-,a,0
a,b,1
a,b,2
b,c,1
b,c,2
LOCAL_IP_ADDRESS= getLocalInetAddress();
IP_16 = getIP_16(LOCAL_IP_ADDRESS);
IP_16 = getIP_16(LOCAL_IP_ADDRESS);
1.應(yīng)用概要了罪,2. 服務(wù)詳情,3. 應(yīng)用去向 4.應(yīng)用來源
- 概要
數(shù)字
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by serverIp,time(1d)
表格
from prism_trace
where (rpcType='0' or rpcType='2' or rpcType='3') and serverApp='cammel' and time>now()-1d and time<=now() group by rpcType锭环,serviceName,time(1d)
- 服務(wù)詳情
大圖
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='cammel' and serviceName='?' and time>now()-1d and time<=now() group by time(1d)
去向
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where clientApp='camel' and rpcType='1' and clientService='/login.do' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)
來源
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='whale' and rpcType='1' and serviceName='MemberQueryService' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)
- 應(yīng)用去向
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where clientApp='camel' and rpcType='1' and time>now()-1d and time<=now() group by serverApp,serviceName,time(1d)
- 應(yīng)用來源
select
count(srSpan) as hitCount,mean(ssSpan) as rtAvg,sum(error) as errCount
from prism_trace
where serverApp='whale' and rpcType='1' and time>now()-1d and time<=now() group by clientApp,clientService,time(1d)