看門狗最初的意義是因?yàn)樵缙谇度胧皆O(shè)備上的程序經(jīng)常跑飛(比如說電磁干擾等)踩麦,所以專門設(shè)置了一個(gè)硬件看門狗佛猛,每隔一段時(shí)間制妄,看門狗就去檢查某個(gè)參數(shù)是不是被設(shè)置了祠锣,如果發(fā)現(xiàn)該參數(shù)被設(shè)置了酷窥,則判斷為系統(tǒng)出錯(cuò)伴网,然后強(qiáng)制重啟。
Watchdog是Android用于對(duì)SystemServer的參數(shù)設(shè)置進(jìn)行監(jiān)聽的看門狗澡腾。那它看的是哪幾個(gè)門呢,主要是幾個(gè)重要的service的門动分。
- ActivityManagerService
- PowerManagerService
- WindowManagerService
一旦發(fā)現(xiàn)service出了問題,就會(huì)殺掉system_server,而這也會(huì)使zygote隨其一起自殺澜公,最后導(dǎo)致重啟java世界。
那system_server是如何使用Watchdog來為自己服務(wù)的呢玛瘸?
system_server和Watchdog的交互流程可以總結(jié)為以下三個(gè)步驟:
- Watchdog.getInstance().init()
- Watchdog.getInstance().start().
- Watchdog.getInstance().addMonitor()
這三個(gè)步驟都非常簡單。先看第一步
創(chuàng)建和初始化Watchdog
getInstance用于創(chuàng)建Watchdog
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
}
接著看看Init函數(shù)做了些什么
public void init(Context context, ActivityManagerService activity) {
mResolver = context.getContentResolver();
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
2.讓W(xué)atchdog看門狗跑起來
SystemServer調(diào)用了Watchdog的start函數(shù)右核,這將導(dǎo)致Watchdog的run在另外一個(gè)線程中被執(zhí)行渺绒。
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
//這回真有問題了菱鸥,所以就把自己干掉吧。
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
隔一段時(shí)間給另外一個(gè)線程發(fā)送一條monitor消息躏鱼,那個(gè)線程將檢查各個(gè)service的健康情況氮采。而看門狗會(huì)等待檢查結(jié)果,如果最后沒有返回結(jié)果染苛,那么它會(huì)殺掉systemServer.
3.列隊(duì)檢查
要想支持看門狗的檢查鹊漠,就需要讓這些Service實(shí)現(xiàn)Monitor接口,
public interface Monitor {
void monitor();
}
例如WindowManagerServer
public class WindowManagerService extends IWindowManager.Stub
implements ==Watchdog.Monitor,== WindowManagerPolicy.WindowManagerFuncs
然后Watchdog就會(huì)調(diào)用它們的monitor函數(shù)進(jìn)行檢查了茶行。
那么Service的健康是如何判定的呢躯概。我們以WindowManagerService為例,先看看它是怎么把自己交給看門狗檢查的畔师,代碼如下:
// Add ourself to the Watchdog monitors.
//在構(gòu)造函數(shù)中把自己加入了Watchdog的檢查列隊(duì)中
Watchdog.getInstance().addMonitor(this);
而Watchdog調(diào)用各個(gè)monitor函數(shù)到底又檢查了什么呢娶靡?再看看它實(shí)現(xiàn)的monitor函數(shù)吧。
WindowManagerServer-->
@Override
public void monitor() {
//原來monitor檢查的就是這些service是不是又發(fā)生死鎖了
synchronized (mWindowMap) { }
}
原來看锉,watchdog最怕系統(tǒng)服務(wù)死鎖了姿锭,對(duì)于這種情況也只能采取殺系統(tǒng)的方式了。
說明:這種情況我只碰過一次伯铣,原因是一個(gè)函數(shù)占著鎖呻此,但長時(shí)間沒有返回。沒有返回的原因是這個(gè)函數(shù)需要和硬件交互懂傀,而硬件又沒有及時(shí)返回趾诗。