轉(zhuǎn)載:http://bean-li.github.io/atop-exit-code/
1. 前言
Daemon進(jìn)程凌晨無故退出了豹休,log中沒有任何有效信息判斷退出的原因。 QA找我確定下退出的原因,是收到信號(hào)被殺死非剃,還是自己異常退出了俊啼。
幸好有atop,會(huì)紀(jì)錄進(jìn)程的退出碼或者收到的信號(hào)值剂陡。
2. 方法
請(qǐng)看下圖:
image.png
上圖中第一行 #exit 20305 表示在過去10分鐘內(nèi)摊唇,有20305個(gè)進(jìn)程退出了。
其中這一行表示涯鲁,在兩個(gè)采樣時(shí)間點(diǎn)中間巷查,ceph-osd退出了,<> 保護(hù)的進(jìn)程表示退出的進(jìn)程抹腿。如何判斷它是正常退出岛请,還是收到信號(hào),如果是前者警绩,其返回值是多少崇败,如果是后者,又收到了什么信號(hào)呢肩祥?
atop中的ST和EXC這兩個(gè)字段后室,可以告訴我們答案
ST
The status of a process.
The first position indicates if the process has been started during the last interval (the value N means 'new process').
The second position indicates if the process has been finished during the last interval.
The value E means 'exit' on the process' own initiative; the exit code is displayed in the column 'EXC'.
The value S means that the process has been terminated unvoluntarily by a signal; the signal number is displayed in the in the column 'EXC'.
The value C means that the process has been terminated unvoluntarily by a signal, producing a core dump in its current directory; the signal number is displayed in the column 'EXC'.
S和C,表示收到了信號(hào)混狠,不得不退出岸霹,這時(shí)候, EXC字段紀(jì)錄就是導(dǎo)致進(jìn)程退出的信號(hào)值将饺。
EXC
The exit code of a terminated process (second position of column 'ST' is E) or the fatal signal number (second position of column 'ST' is S or C).
對(duì)于本例贡避, ST= NS痛黎,表示收到了信號(hào),才導(dǎo)致退出刮吧, EXC=10 表示收到了10號(hào)信號(hào)湖饱。
kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX
3. 尾聲
誰向ceph-osd進(jìn)程發(fā)送了SIGUSR1信號(hào),systemtap就可以來幫忙了:
編寫如下腳本杀捻,監(jiān)控發(fā)送到某進(jìn)程的所有信號(hào):
probe begin
{
printf("%-30s%-8s %-16s %-8s %-16s %6s %-16s\n",
"TIME","SPID", "SNAME", "RPID", "RNAME", "SIGNUM", "SIGNAME")
}
probe signal.send
{
if (pid_name == @1)
printf("%-30s%-8d %-16s %-8d %-16s %6d %-16s\n",
ctime(gettimeofday_s()),pid(), execname(), sig_pid, pid_name, sig, sig_name)
}**
stap sigmon.stap ceph-osd
測(cè)試下井厌,其輸出如下:
$ stap sigmon.stp ceph-osd
TIME SPID SNAME RPID RNAME SIGNUM SIGNAME
Wed Nov 2 14:21:15 2016 19977 sh 19884 ceph-osd 17 SIGCHLD
Wed Nov 2 14:21:15 2016 19992 sh 19884 ceph-osd 17 SIGCHLD
Wed Nov 2 14:21:20 2016 21218 sh 19884 ceph-osd 17 SIGCHLD
Wed Nov 2 14:21:20 2016 21224 sh 19884 ceph-osd 17 SIGCHLD
Wed Nov 2 14:21:22 2016 9786 bash 19884 ceph-osd 10 SIGUSR1