一:為什么需要看門狗?
Watchdog,初次見到這個詞語是在大學(xué)的單片機(jī)書上, 談到了看門狗定時器. 在很早以前那個單片機(jī)剛發(fā)展的時候, 單片機(jī)容易受到外界工作影響, 導(dǎo)致自己的程序跑飛, 因此有了看門狗的保護(hù)機(jī)制, 即:需要每多少時間內(nèi)都去喂狗, 如果不喂狗, 看門狗將觸發(fā)重啟. 大體原理是, 在系統(tǒng)運(yùn)行以后啟動了看門狗的計數(shù)器依溯,看門狗就開始自動計數(shù)圃伶,如果到了一定的時間還不去清看門狗扣草,那么看門狗計數(shù)器就會溢出從而引起看門狗中斷,造成系統(tǒng)復(fù)位鹰贵。
而手機(jī), 其實是一個超強(qiáng)超強(qiáng)的單片機(jī), 其運(yùn)行速度比單片機(jī)快N倍, 存儲空間比單片機(jī)大N倍, 里面運(yùn)行了若干個線程, 各種軟硬件協(xié)同工作, 不怕一萬,就怕萬一, 萬一我們的系統(tǒng)死鎖了, 萬一我們的手機(jī)也受到很大的干擾程序跑飛了. 都可能發(fā)生jj思密達(dá)的事情, 因此, 我們也需要看門狗機(jī)制.
二:Android系統(tǒng)層看門狗
看門狗有硬件看門狗和軟件看門狗之分, 硬件就是單片機(jī)那種的定時器電路, 軟件, 則是我們自己實現(xiàn)一個類似機(jī)制的看門狗.Android系統(tǒng)為了保證系統(tǒng)的穩(wěn)定性,也設(shè)計了這么一個看門狗康嘉,其為了保證各種系統(tǒng)服務(wù)能夠正常工作砾莱,要監(jiān)控很多的服務(wù),并且在核心服務(wù)異常時要進(jìn)行重啟凄鼻,還要保存現(xiàn)場腊瑟。
接下來我們就看看Android系統(tǒng)的Watchdog是怎么設(shè)計的。
注:本文以Android6.0代碼講解
Android系統(tǒng)的Watchdog源碼路徑在此:
frameworks/base/services/core/java/com/android/server/Watchdog.java
Watchdog的初始化位于SystemServer.
/frameworks/base/services/java/com/android/server/SystemServer.java
在SystemServer中會對Watchdog進(jìn)行初始化块蚌。
492 Slog.i(TAG, "Init Watchdog");
493 final Watchdog watchdog = Watchdog.getInstance();
494 watchdog.init(context, mActivityManagerService);
此時Watchdog會走如下初始化方法闰非,先是構(gòu)造方法,再是init方法:
216 private Watchdog() {
217 super("watchdog");
218 // Initialize handler checkers for each common thread we want to check. Note
219 // that we are not currently checking the background thread, since it can
220 // potentially hold longer running operations with no guarantees about the timeliness
221 // of operations there.
222
223 // The shared foreground thread is the main checker. It is where we
224 // will also dispatch monitor checks and do other work.
225 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226 "foreground thread", DEFAULT_TIMEOUT);
227 mHandlerCheckers.add(mMonitorChecker);
228 // Add checker for main thread. We only do a quick check since there
229 // can be UI running on the thread.
230 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
231 "main thread", DEFAULT_TIMEOUT));
232 // Add checker for shared UI thread.
233 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
234 "ui thread", DEFAULT_TIMEOUT));
235 // And also check IO thread.
236 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
237 "i/o thread", DEFAULT_TIMEOUT));
238 // And the display thread.
239 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
240 "display thread", DEFAULT_TIMEOUT));
241
242 // Initialize monitor for Binder threads.
243 addMonitor(new BinderThreadMonitor());
244 }
246 public void init(Context context, ActivityManagerService activity) {
247 mResolver = context.getContentResolver();
248 mActivity = activity;
249 // 注冊重啟廣播
250 context.registerReceiver(new RebootRequestReceiver(),
251 new IntentFilter(Intent.ACTION_REBOOT),
252 android.Manifest.permission.REBOOT, null);
253 }
但是我們看了源碼會知道峭范,Watchdog這個類繼承于Thread财松,所以還會需要一個啟動的地方,就是下面這行代碼纱控,這是在ActivityManagerService的SystemReady接口中干的辆毡。
Watchdog.getInstance().start();
TAG: HandlerChecker
上面的代碼中有個比較重要的類,HandlerChecker
,這是Watchdog用來檢測主線程甜害,io線程舶掖,顯示線程,UI線程的機(jī)制尔店,代碼也不長眨攘,直接貼出來吧。其原理就是通過各個Handler的looper的MessageQueue
來判斷該線程是否卡住了嚣州。當(dāng)然鲫售,該線程是運(yùn)行在SystemServer進(jìn)程中的線程。
public final class HandlerChecker implements Runnable {
88 private final Handler mHandler;
89 private final String mName;
90 private final long mWaitMax;
91 private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
92 private boolean mCompleted;
93 private Monitor mCurrentMonitor;
94 private long mStartTime;
95
96 HandlerChecker(Handler handler, String name, long waitMaxMillis) {
97 mHandler = handler;
98 mName = name;
99 mWaitMax = waitMaxMillis;
100 mCompleted = true;
101 }
102
103 public void addMonitor(Monitor monitor) {
104 mMonitors.add(monitor);
105 }
106 // 記錄當(dāng)前的開始時間
107 public void scheduleCheckLocked() {
108 if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
109 // If the target looper has recently been polling, then
110 // there is no reason to enqueue our checker on it since that
111 // is as good as it not being deadlocked. This avoid having
112 // to do a context switch to check the thread. Note that we
113 // only do this if mCheckReboot is false and we have no
114 // monitors, since those would need to be executed at this point.
115 mCompleted = true;
116 return;
117 }
118
119 if (!mCompleted) {
120 // we already have a check in flight, so no need
121 return;
122 }
123
124 mCompleted = false;
125 mCurrentMonitor = null;
126 mStartTime = SystemClock.uptimeMillis();
127 mHandler.postAtFrontOfQueue(this);
128 }
129
130 public boolean isOverdueLocked() {
131 return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
132 }
133 // 獲取完成時間標(biāo)識
134 public int getCompletionStateLocked() {
135 if (mCompleted) {
136 return COMPLETED;
137 } else {
138 long latency = SystemClock.uptimeMillis() - mStartTime;
139 if (latency < mWaitMax/2) {
140 return WAITING;
141 } else if (latency < mWaitMax) {
142 return WAITED_HALF;
143 }
144 }
145 return OVERDUE;
146 }
147
148 public Thread getThread() {
149 return mHandler.getLooper().getThread();
150 }
151
152 public String getName() {
153 return mName;
154 }
155
156 public String describeBlockedStateLocked() {
157 if (mCurrentMonitor == null) {
158 return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
159 } else {
160 return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
161 + " on " + mName + " (" + getThread().getName() + ")";
162 }
163 }
164
165 @Override
166 public void run() {
167 final int size = mMonitors.size();
168 for (int i = 0 ; i < size ; i++) {
169 synchronized (Watchdog.this) {
170 mCurrentMonitor = mMonitors.get(i);
171 }
172 mCurrentMonitor.monitor();
173 }
174
175 synchronized (Watchdog.this) {
176 mCompleted = true;
177 mCurrentMonitor = null;
178 }
179 }
180 }
通過上面的代碼该肴,我們可以看到一個核心的方法是
mHandler.getLooper().getQueue().isPolling()
這個方法的實現(xiàn)在MessageQueue中情竹,我將代碼貼出來,我們可以看到上面的注釋寫到:返回當(dāng)前的looper線程是否在polling工作來做匀哄,這個是個很好的用于檢測loop是否存活的方法秦效。我們從HandlerChecker
源碼可以看到,如果looper這個返回true拱雏,將會直接返回棉安。
139 /**
140 * Returns whether this looper's thread is currently polling for more work to do.
141 * This is a good signal that the loop is still alive rather than being stuck
142 * handling a callback. Note that this method is intrinsically racy, since the
143 * state of the loop can change before you get the result back.
144 *
145 * <p>This method is safe to call from any thread.
146 *
147 * @return True if the looper is currently polling for events.
148 * @hide
149 */
150 public boolean isPolling() {
151 synchronized (this) {
152 return isPollingLocked();
153 }
154 }
155
若沒有返回true底扳,表明looper當(dāng)前正在工作铸抑,會post一下自己,同時將mComplete置為false衷模,標(biāo)明已經(jīng)發(fā)出一個消息正在等待處理鹊汛。如果當(dāng)前的looper沒有阻塞蒲赂,那很快,將會調(diào)用到自己的run方法刁憋。
自己的run方法干了什么呢滥嘴。干的是TAG: HandlerChecker源碼里面的166行
,里面對自己的Monitors遍歷并進(jìn)行monitor至耻。(注:此處的monitor下面會講到)若皱,若有monitor發(fā)生了阻塞,那么mComplete會一直是false尘颓。
那么在系統(tǒng)檢測調(diào)用這個獲取完成狀態(tài)時走触,就會進(jìn)入else里面,進(jìn)行了時間的計算疤苹,并返回相應(yīng)的時間狀態(tài)碼互广。
133 // 獲取完成時間標(biāo)識
134 public int getCompletionStateLocked() {
135 if (mCompleted) {
136 return COMPLETED;
137 } else {
138 long latency = SystemClock.uptimeMillis() - mStartTime;
139 if (latency < mWaitMax/2) {
140 return WAITING;
141 } else if (latency < mWaitMax) {
142 return WAITED_HALF;
143 }
144 }
145 return OVERDUE;
146 }
好了,到這我們已經(jīng)知道是怎么判斷線程是否卡住的了
- MessageQueue.isPolling
- Monitor.monitor
TAG:Monitor
204 public interface Monitor {
205 void monitor();
206 }
Monitor是一個接口卧土,實現(xiàn)這個接口的類有好幾個惫皱。比如:如下我搜出來的結(jié)果
看,有這么多的類實現(xiàn)了該接口尤莺,而且我們都不用去猜旅敷,就可以知道,他們一定會注冊到這個Watchdog中颤霎。注冊到哪的呢扫皱,下面代碼可以看到。
225 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226 "foreground thread", DEFAULT_TIMEOUT);
227 mHandlerCheckers.add(mMonitorChecker);
275 public void addMonitor(Monitor monitor) {
276 synchronized (this) {
277 if (isAlive()) {
278 throw new RuntimeException("Monitors can't be added once the Watchdog is running");
279 }
280 mMonitorChecker.addMonitor(monitor);
281 }
282 }
所以各個實現(xiàn)這個接口的類捷绑,只需要調(diào)一下韩脑,上述接口就行了。我們看一下ActivityManagerService
類的調(diào)法粹污。路徑在此段多,點(diǎn)擊可以進(jìn)入。
/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
2381 Watchdog.getInstance().addMonitor(this);
19655 /** In this method we try to acquire our lock to make sure that we have not deadlocked */
19656 public void monitor() {
19657 synchronized (this) { }
19658 }
可以看到壮吩,我們的AMS實現(xiàn)了該接口进苍,并在2381行,將自己注冊進(jìn)Watchdog. 同時其monitor方法只是同步一下自己鸭叙,確保自己沒有死鎖觉啊。
干的事情雖然不多,但這足夠了沈贝。足夠讓外部通過這個方法得到AMS是否死了杠人。
好了,現(xiàn)在我們知道是如何判斷其他服務(wù)是否死鎖了,那么看Watchdog的run方法是怎么完成這一套機(jī)制的吧嗡善。
TAG: Watchdog.run
run方法就是死循環(huán)辑莫,不斷的去遍歷所有HandlerChecker,并調(diào)其監(jiān)控方法,等待三十秒罩引,評估狀態(tài)各吨。具體見下面的注釋:
341 @Override
342 public void run() {
343 boolean waitedHalf = false;
344 while (true) {
345 final ArrayList<HandlerChecker> blockedCheckers;
346 final String subject;
347 final boolean allowRestart;
348 int debuggerWasConnected = 0;
349 synchronized (this) {
350 long timeout = CHECK_INTERVAL;
351 // Make sure we (re)spin the checkers that have become idle within
352 // this wait-and-check interval
// 在這里,我們遍歷所有HandlerChecker,并調(diào)其監(jiān)控方法袁铐,記錄開始時間
353 for (int i=0; i<mHandlerCheckers.size(); i++) {
354 HandlerChecker hc = mHandlerCheckers.get(i);
355 hc.scheduleCheckLocked();
356 }
357
358 if (debuggerWasConnected > 0) {
359 debuggerWasConnected--;
360 }
361
362 // NOTE: We use uptimeMillis() here because we do not want to increment the time we
363 // wait while asleep. If the device is asleep then the thing that we are waiting
364 // to timeout on is asleep as well and won't have a chance to run, causing a false
365 // positive on when to kill things.
366 long start = SystemClock.uptimeMillis();
// 等待30秒揭蜒,使用uptimeMills是為了不把手機(jī)睡眠時間算進(jìn)入,手機(jī)睡眠時系統(tǒng)服務(wù)同樣睡眠
367 while (timeout > 0) {
368 if (Debug.isDebuggerConnected()) {
369 debuggerWasConnected = 2;
370 }
371 try {
372 wait(timeout);
373 } catch (InterruptedException e) {
374 Log.wtf(TAG, e);
375 }
376 if (Debug.isDebuggerConnected()) {
377 debuggerWasConnected = 2;
378 }
379 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
380 }
381 // 評估Checker的狀態(tài)剔桨,里面會遍歷所有的HandlerChecker,并獲取最大的返回值忌锯。
382 final int waitState = evaluateCheckerCompletionLocked();
// 最大的返回值有四種情況,分別是:COMPLETED對應(yīng)消息已處理完畢線程無阻塞
383 if (waitState == COMPLETED) {
384 // The monitors have returned; reset
385 waitedHalf = false;
386 continue;
// WAITING對應(yīng)消息處理花費(fèi)0~29秒,繼續(xù)運(yùn)行
387 } else if (waitState == WAITING) {
388 // still waiting but within their configured intervals; back off and recheck
389 continue;
// WAITED_HALF對應(yīng)消息處理花費(fèi)30~59秒领炫,線程可能已經(jīng)被阻塞偶垮,需要保存當(dāng)前AMS堆棧狀態(tài)
390 } else if (waitState == WAITED_HALF) {
391 if (!waitedHalf) {
392 // We've waited half the deadlock-detection interval. Pull a stack
393 // trace and wait another half.
394 ArrayList<Integer> pids = new ArrayList<Integer>();
395 pids.add(Process.myPid());
396 ActivityManagerService.dumpStackTraces(true, pids, null, null,
397 NATIVE_STACKS_OF_INTEREST);
398 waitedHalf = true;
399 }
400 continue;
401 }
402 //OVERDUE對應(yīng)消息處理已經(jīng)花費(fèi)超過60, 能夠走到這里,說明已經(jīng)發(fā)生了超時60秒了帝洪。那么下面接下來全是應(yīng)對超時的情況
403 // something is overdue!
404 blockedCheckers = getBlockedCheckersLocked();
405 subject = describeCheckersLocked(blockedCheckers);
406 allowRestart = mAllowRestart;
407 }
408
409 // If we got here, that means that the system is most likely hung.
410 // First collect stack traces from all threads of the system process.
411 // Then kill this process so that the system will restart.
412 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
413
.......各種記錄的保存
468
469 // Only kill the process if the debugger is not attached.
470 if (Debug.isDebuggerConnected()) {
471 debuggerWasConnected = 2;
472 }
473 if (debuggerWasConnected >= 2) {
474 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
475 } else if (debuggerWasConnected > 0) {
476 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
477 } else if (!allowRestart) {
478 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
479 } else {
480 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
481 for (int i=0; i<blockedCheckers.size(); i++) {
482 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
483 StackTraceElement[] stackTrace
484 = blockedCheckers.get(i).getThread().getStackTrace();
485 for (StackTraceElement element: stackTrace) {
486 Slog.w(TAG, " at " + element);
487 }
488 }
489 Slog.w(TAG, "*** GOODBYE!");
490 Process.killProcess(Process.myPid());
491 System.exit(10);
492 }
493
494 waitedHalf = false;
495 }
496 }
上述可以看到似舵, 如果走到412行處。便是重啟系統(tǒng)前的準(zhǔn)備了葱峡。
會進(jìn)行以下事情:
- 寫Eventlog
- 以追加的方式砚哗,輸出system_server和3個native進(jìn)程的棧信息
- 輸出kernel棧信息
- dump所有阻塞線程
- 輸出dropbox信息
- 判斷有沒有debuger,沒有的話砰奕,重啟系統(tǒng)了蛛芥,并輸出log: *** WATCHDOG KILLING SYSTEM PROCESS:
三:總結(jié):
以上便是Android系統(tǒng)層Watchdog的原理了。設(shè)計的比較好军援。若由我來設(shè)計仅淑,我還真想不到使用Monitor那個鎖機(jī)制來判斷。
接下來總結(jié)以下:
- Watchdog是一個線程胸哥,用來監(jiān)聽系統(tǒng)各項服務(wù)是否正常運(yùn)行涯竟,沒有發(fā)生死鎖
- HandlerChecker用來檢查Handler以及monitor
- monitor通過鎖來判斷是否死鎖
- 超時30秒會輸出log,超時60秒會重啟(debug情況除外)
本文作者:Anderson/Jerey_Jobs
博客地址 : http://jerey.cn/
簡書地址 : Anderson大碼渣
github地址 : https://github.com/Jerey-Jobs