watchdog 主要監(jiān)控運(yùn)行在system_server中服務(wù)的線程始绍,比如ams隐孽,一旦發(fā)現(xiàn)阻塞將輸出調(diào)用棧信息,甚至重啟system_server進(jìn)程
1. watchdog框架圖
a. watchdog 繼承自Thread
b. HandlerChecker作為watchdog的內(nèi)部類,實(shí)現(xiàn)Runnable接口飒责,用于檢查保存Handler線程的狀態(tài)惠赫、回調(diào)監(jiān)視器方法把鉴;
c. RebootRequestReceiver 監(jiān)聽(tīng)重啟廣播(Intent.ACTION_REBOOT
);當(dāng)收到廣播后儿咱,調(diào)用PowerMS
重啟系統(tǒng)庭砍;
d. BinderThreadMonitor 監(jiān)視binder
線程是否可用;
監(jiān)視器將阻塞直到有一個(gè)可用的binder
線程處理來(lái)自IPC的請(qǐng)求混埠,目的是為了確保其他進(jìn)程可以與服務(wù)通信怠缸;
2. watchdog 初始化過(guò)程
2.1 watchdog啟動(dòng)
#SystemServer.java
private void startOtherServices() {
...
mSystemServiceManager.startBootPhase(SystemService.PHASE_WAIT_FOR_DEFAULT_DISPLAY);
...
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService); // 注冊(cè)RebootRequestReceiver
...
mActivityManagerService.systemReady(new Runnable() {
@Override
public void run() {
Slog.i(TAG, "Making services ready");
mSystemServiceManager.startBootPhase(
SystemService.PHASE_ACTIVITY_MANAGER_READY);
...
startSystemUi(context); // 啟動(dòng)SystemUI
...
Watchdog.getInstance().start(); // 啟動(dòng)watchdog線程
...
mSystemServiceManager.startBootPhase(
SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
...
}
}
}
2.2 watchdog構(gòu)造函數(shù)初始化
static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000; // 判斷阻塞默認(rèn)時(shí)間60s
private Watchdog() {
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
}
此階段主要添加監(jiān)聽(tīng)的線程 前臺(tái)(foreground thread
)、主線程(main thread
)钳宪、UI線程(ui thread
)揭北、IO線程(i/o thread
)、Disaplay(display thread
)以及Binder(BinderThreadMonitor
)到mHandlerCheckers
吏颖。
3. WatchDog 監(jiān)聽(tīng)機(jī)制
線程運(yùn)行在SystemServer
中
public void run() {
boolean waitedHalf = false;
boolean mSFHang = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
String subject;
mSFHang = false;
final boolean allowRestart;
synchronized (this) {
long timeout = CHECK_INTERVAL; // (DB ? 10*1000 : 60*1000) /2
long SFHangTime;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked(); //3.1 向watchdog 監(jiān)控的線程的Handler 發(fā)送消息
}
long start = SystemClock.uptimeMillis();
while (timeout > 0) { // 等待30s在向下執(zhí)行
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); // ?> 30s
}
final int waitState = evaluateCheckerCompletionLocked(); // 3.3 計(jì)算HandlerChecker的狀態(tài)
if (waitState == COMPLETED) {
// The monitors have returned; reset
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) { // 首次阻塞超過(guò)30s
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST); // 第一次 打印調(diào)用棧
waitedHalf = true;
}
continue;
}
blockedCheckers = getBlockedCheckersLocked(); // 3.5 獲取超時(shí)(debug?10s:60s)的線程隊(duì)列
subject = describeCheckersLocked(blockedCheckers); // 3.7 獲取描述信息
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
// 系統(tǒng)可能掛機(jī)搔体,手機(jī)棧信息,然后重啟system_server
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
// 打印NATIVE_STACKS_OF_INTEREST native 進(jìn)程棧信息
/*
// Which native processes to dump into dropbox's stack traces
public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
"/system/bin/audioserver",
"/system/bin/cameraserver",
"/system/bin/drmserver",
"/system/bin/mediadrmserver",
"/system/bin/mediaserver",
"/system/bin/sdcard",
"/system/bin/surfaceflinger",
"media.codec", // system/bin/mediacodec
"media.extractor", // system/bin/mediaextractor
"com.android.bluetooth", // Bluetooth service
};
*/
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000); // 確保棧信息打印完成
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
// dump kernel中線程阻塞的信息
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
name, null, stack, null);
}
};
dropboxThread.start();
Slog.v(TAG, "** save all info before killnig system server **");
mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null, subject, null, null, null);
if ((mSFHang == false) && (controller != null)) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
Slog.i(TAG, "Binder.setDumpDisabled");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
Slog.i(TAG, "Activity controller requested to reboot");
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
// 打印每一個(gè)超時(shí)線程的棧信息
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid()); // kill system_server
System.exit(10);
}
}
}
3.1 HandlerChecker::scheduleCheckLocked
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
// 同一時(shí)刻半醉,只允許一個(gè)任務(wù)
return;
}
mCompleted = false; // 任務(wù)未完成
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis(); // 記錄開(kāi)始時(shí)間
mHandler.postAtFrontOfQueue(this); // 將消息發(fā)送到監(jiān)聽(tīng)線程的MQ, 將調(diào)用HandlerChecker的run方法處理
}
3.2 HandlerChecker::run
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor(); // 調(diào)用具體服務(wù)的monitor方法
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
3.2.1 AMS::monitor
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
通過(guò)嘗試獲取鎖判斷是否發(fā)生死鎖疚俱;
3.3 WatchDog::evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked()); // 獲取狀態(tài)
// state: 取watchdog所監(jiān)控線程中 最大的state
}
return state;
}
a. 當(dāng)COMPLETED或WAITING,則不處理;
b. 當(dāng)WAITED_HALF(超過(guò)30s)且為首次, 則第一次輸出system_server 棧信息;
c. 當(dāng)OVERDUE, 則輸出更多信息(AMS::dumpStackTraces、Kernel::dumpKernelStackTraces缩多、dropbox信息)
3.4 WatchDog::getCompletionStateLocked
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime; // 檢測(cè)monitor的消息添加到Handler時(shí)間 mStartTime
if (latency < mWaitMax/2) { // 0~30s
return WAITING;
} else if (latency < mWaitMax) { // 30~60s
return WAITED_HALF;
}
}
return OVERDUE; // > 60s
}
3.5 獲取阻塞的Checkers
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
3.6 WatchDog::isOverdueLocked 超時(shí)判斷
public boolean isOverdueLocked() {
// mWaitMax = (DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000)
return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
}
3.7 WatchDog::describeCheckersLocked
private String describeCheckersLocked(ArrayList<HandlerChecker> checkers) {
StringBuilder builder = new StringBuilder(128);
for (int i=0; i<checkers.size(); i++) {
if (builder.length() > 0) {
builder.append(", ");
}
builder.append(checkers.get(i).describeBlockedStateLocked()); // 所有阻塞HandlerChecker的信息
}
return builder.toString();
}
3.8 HandlerChecker::describeBlockedStateLocked
public String describeBlockedStateLocked() {
if (mCurrentMonitor == null) { // 非前臺(tái)
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else { // mMonitorChecker(foreground thread)
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
4. 定制watchdog贝艮龋活進(jìn)程
- 定制watchdog生命周期依賴原生watchdog
a. watchdog調(diào)用systemReady時(shí)夯尽,啟動(dòng)線程并處于等待狀態(tài)wait
b. 在定制的watchdog提前設(shè)定好要啟動(dòng)線程intent - 監(jiān)聽(tīng)進(jìn)程生命周期
a. ams::startProcessLocked
b. ams::handleAppDiedLocked - 重啟
a. 當(dāng)AMS通過(guò)handleAppDiedLocked通知線程死亡,則在1啟動(dòng)線程的隊(duì)列中添加任務(wù)登馒,喚醒(notifyAll)watchdog 重啟進(jìn)程匙握;