主目錄見:Android高級(jí)進(jìn)階知識(shí)(這是總目錄索引)
[This tutorial was written by 無心追求]
Android中的Watchdog
- 在Android中,Watchdog是用來監(jiān)測(cè)關(guān)鍵服務(wù)是否發(fā)生了死鎖簇爆,如果發(fā)生了死鎖就kill進(jìn)程癞松,重啟SystemServer
- Android的Watchdog是在SystemServer中進(jìn)行初始化的,所以Watchdog是運(yùn)行在SystemServer進(jìn)程中
- Watchdog是運(yùn)行一個(gè)單獨(dú)的線程中的入蛆,每次wait 30s之后就會(huì)發(fā)起一個(gè)監(jiān)測(cè)行為响蓉,如果系統(tǒng)休眠了,那Watchdog的wait行為也會(huì)休眠哨毁,此時(shí)需要等待系統(tǒng)喚醒之后才會(huì)重新恢復(fù)監(jiān)測(cè)
- 想要被Watchdog監(jiān)測(cè)的對(duì)象需要實(shí)現(xiàn)Watchdog.Monitor接口的monitor()方法枫甲,然后調(diào)用addMonitor()方法
- 其實(shí)framework里面的Watchdog實(shí)現(xiàn)除了能監(jiān)控線程死鎖以外還能夠監(jiān)控線程卡頓,addMonitor()方法是監(jiān)控線程死鎖的扼褪,而addThread()方法是監(jiān)控線程卡頓的
Watchdog線程死鎖監(jiān)控實(shí)現(xiàn)
- Watchdog監(jiān)控線程死鎖需要被監(jiān)控的對(duì)象實(shí)現(xiàn)Watchdog.Monitor接口的monitor()方法想幻,然后再調(diào)用addMonitor()方法,例如ActivityManagerService:
public final class ActivityManagerService extends ActivityManagerNative
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
public ActivityManagerService(Context systemContext) {
Watchdog.getInstance().addMonitor(this);
}
public void monitor() {
synchronized (this) { }
}
// ...
}
如上是從ActivityManagerService提取出來關(guān)于Watchdog監(jiān)控ActivityManagerService這個(gè)對(duì)象鎖的相關(guān)代碼话浇,而監(jiān)控的實(shí)現(xiàn)如下脏毯,Watchdog是一個(gè)線程對(duì)象,start這個(gè)線程之后就會(huì)每次wait 30s后檢查一次幔崖,如此不斷的循環(huán)檢查:
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor);
}
}
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
String traceFileNameAmendment = "_SystemServer_WDT" + mTraceDateFormat.format(new Date());
if (tracesPath != null && tracesPath.length() != 0) {
File traceRenameFile = new File(tracesPath);
String newTracesPath;
int lpos = tracesPath.lastIndexOf (".");
if (-1 != lpos)
newTracesPath = tracesPath.substring (0, lpos) + traceFileNameAmendment + tracesPath.substring (lpos);
else
newTracesPath = tracesPath + traceFileNameAmendment;
traceRenameFile.renameTo(new File(newTracesPath));
tracesPath = newTracesPath;
}
final File newFd = new File(tracesPath);
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, newFd, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
// At times, when user space watchdog traces don't give an indication on
// which component held a lock, because of which other threads are blocked,
// (thereby causing Watchdog), crash the device to analyze RAM dumps
boolean crashOnWatchdog = SystemProperties
.getBoolean("persist.sys.crashOnWatchdog", false);
if (crashOnWatchdog) {
// Trigger the kernel to dump all blocked threads, and backtraces
// on all CPUs to the kernel log
Slog.e(TAG, "Triggering SysRq for system_server watchdog");
doSysRq('w');
doSysRq('l');
// wait until the above blocked threads be dumped into kernel log
SystemClock.sleep(3000);
// now try to crash the target
doSysRq('c');
}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
首先食店,ActivityManagerService調(diào)用addMonitor()方法把自己添加到了Watchdog的mMonitorChecker對(duì)象中,這是Watchdog的一個(gè)全局變量赏寇,這個(gè)全部變量在Watchdog的構(gòu)造方法中已經(jīng)事先初始化好并添加到mHandlerCheckers:ArrayList<HandlerChecker>這個(gè)監(jiān)控對(duì)象列表中了吉嫩,mMonitorChecker是一個(gè)HandlerChecker類的實(shí)例對(duì)象,代碼如下:
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
public void addMonitor(Monitor monitor) {
mMonitors.add(monitor);
}
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
public boolean isOverdueLocked() {
return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
}
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
public Thread getThread() {
return mHandler.getLooper().getThread();
}
public String getName() {
return mName;
}
public String describeBlockedStateLocked() {
if (mCurrentMonitor == null) {
return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
} else {
return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
+ " on " + mName + " (" + getThread().getName() + ")";
}
}
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
HandlerChecker類中的mMonitors也是監(jiān)控對(duì)象列表嗅定,這里是監(jiān)控所有實(shí)現(xiàn)了Watchdog.Monitor接口的監(jiān)控對(duì)象自娩,而那些沒有實(shí)現(xiàn)Watchdog.Monitor接口的對(duì)象則會(huì)單獨(dú)創(chuàng)建一個(gè)HandlerChecker類并add到Watchdog的mHandlerCheckers監(jiān)控列表中,當(dāng)Watchdog線程開始健康那個(gè)的時(shí)候就回去遍歷mHandlerCheckers列表渠退,并逐一的調(diào)用HandlerChecker的scheduleCheckLocked方法:
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
HandlerChecker這個(gè)類中有幾個(gè)比較重要的標(biāo)志忙迁,一個(gè)是mCompleted,標(biāo)識(shí)著本次監(jiān)控掃描是否在指定時(shí)間內(nèi)完成智什,mStartTime標(biāo)識(shí)本次開始掃描的時(shí)間mHandler动漾,則是被監(jiān)控的線程的handler,scheduleCheckLocked是開啟本次對(duì)與改線程的監(jiān)控荠锭,里面理所當(dāng)然的會(huì)把mCompleted置為false并設(shè)置開始時(shí)間旱眯,可以看到,監(jiān)控原理就是向被監(jiān)控的線程的Handler的消息隊(duì)列中post一個(gè)任務(wù),也就是HandlerChecker本身删豺,然后HandlerChecker這個(gè)任務(wù)就會(huì)在被監(jiān)控的線程對(duì)應(yīng)Handler維護(hù)的消息隊(duì)列中被執(zhí)行共虑,如果消息隊(duì)列因?yàn)槟骋粋€(gè)任務(wù)卡住,那么HandlerChecker這個(gè)任務(wù)就無法及時(shí)的執(zhí)行到呀页,超過了指定的時(shí)間后就會(huì)被認(rèn)為當(dāng)前被監(jiān)控的這個(gè)線程發(fā)生了卡死(死鎖造成的卡死或者執(zhí)行耗時(shí)任務(wù)造成的卡死)妈拌,在HandlerChecker這個(gè)任務(wù)中:
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
首先遍歷mMonitors列表中的監(jiān)控對(duì)象并調(diào)用monitor()方法來開啟監(jiān)控,通常在被監(jiān)控對(duì)象實(shí)現(xiàn)的monitor()方法都是按照如下實(shí)現(xiàn)的:
public void monitor() {
synchronized (this) { }
}
即監(jiān)控某一個(gè)死鎖蓬蝶,然后就是本次監(jiān)控完成尘分,mCompleted設(shè)置為true,而當(dāng)所有的scheduleCheckLocked都執(zhí)行完了之后丸氛,Watchdog就開始wait培愁,而且一定要wait for 30s,這里有一個(gè)實(shí)現(xiàn)細(xì)節(jié):
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
原先缓窜,我看到這段代碼的時(shí)候定续,首先關(guān)注到SystemClock.uptimeMillis()在設(shè)備休眠的時(shí)候是不計(jì)時(shí)的,因此猜測(cè)會(huì)不會(huì)是因?yàn)樵O(shè)備休眠了禾锤,wait也停止了私股,Watchdog在wait到15s的時(shí)候設(shè)備休眠了,并且連續(xù)休眠30分鐘后才又被喚醒恩掷,那么這時(shí)候wait會(huì)不會(huì)馬上被喚醒倡鲸,答案是:正常情況下wait會(huì)繼續(xù),知道直到剩下的15s也wait完成后才會(huì)喚醒螃成,所以我疑惑了旦签,于是查看下下Thread的wait()方法的接口文檔,終于找到如下解釋:
A thread can also wake up without being notified, interrupted, or
* timing out, a so-called <i>spurious wakeup</i>. While this will rarely
* occur in practice, applications must guard against it by testing for
* the condition that should have caused the thread to be awakened, and
* continuing to wait if the condition is not satisfied. In other words,
* waits should always occur in loops, like this one:
* <pre>
* synchronized (obj) {
* while (<condition does not hold>)
* obj.wait(timeout);
* ... // Perform action appropriate to condition
* }
* </pre>
大致意思是說當(dāng)Thread在wait的時(shí)候除了會(huì)被主動(dòng)喚醒(notify或者notifyAll)寸宏,中斷(interrupt),或者wait的時(shí)間到期而喚醒偿曙,還有可能被假喚醒氮凝,而這種假喚醒在實(shí)踐中發(fā)生的幾率非常低,不過針對(duì)這種假喚醒望忆,程序需要通過驗(yàn)證喚醒條件來區(qū)分線程是真的喚醒還是假的喚醒罩阵,如果是假的喚醒那么就繼續(xù)wait直到真喚醒,事實(shí)上启摄,在我們實(shí)際的開發(fā)過程中確實(shí)要注意這種微小的細(xì)節(jié)稿壁,可能99%的情況下不會(huì)發(fā)生,但是要是遇到1%的情況發(fā)生之后歉备,那么這個(gè)問題將會(huì)是非常隱晦的傅是,而且在查找問題的時(shí)候也會(huì)變得很困難,很奇怪,為什么線程好好的wait過程中突然被喚醒了呢喧笔,甚至可能懷疑我們以前對(duì)于線程wait在設(shè)備休眠狀態(tài)下的執(zhí)行情況帽驯?,廢話就扯到這里书闸,繼續(xù)來研究Watchdog機(jī)制尼变,在Watchdog等待30s之后會(huì)調(diào)用evaluateCheckerCompletionLocked()方法來檢測(cè)被監(jiān)控對(duì)象的運(yùn)行情況:
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
通過調(diào)用HandlerChecker的getCompletionStateLocked來獲取每一個(gè)HandlerChecker的監(jiān)控狀態(tài):
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
從這里,我們就看到了其實(shí)是通過mCompleted這個(gè)標(biāo)志來區(qū)分30s之前和30s之后的不通狀態(tài)浆劲,因?yàn)?0s之前對(duì)被監(jiān)控的線程對(duì)應(yīng)的Handler的消息對(duì)了中post了一個(gè)HandlerChecker任務(wù)嫌术,然后mCompleted = false,等待了30s后牌借,如果HandlerChecker被及時(shí)的執(zhí)行了蛉威,那么mCompleted = true表示任務(wù)及時(shí)執(zhí)行完畢,而如果發(fā)現(xiàn)mCompleted = false那就說明HandlerChecker依然未被執(zhí)行走哺,當(dāng)mCompleted = false的時(shí)候蚯嫌,會(huì)繼續(xù)檢測(cè)HandlerChecker任務(wù)的執(zhí)行時(shí)間,如果在喚醒狀態(tài)下的執(zhí)行時(shí)間小于30秒丙躏,那重新post監(jiān)控等待择示,如果在30秒到60秒之間,那就會(huì)dump出一些堆棧信息晒旅,然后重新post監(jiān)控等待栅盲,當(dāng)?shù)却龝r(shí)間已經(jīng)超過60秒了离唐,那就認(rèn)為這是異常情況了(要么死鎖晰赞,要么耗時(shí)任務(wù)太久),這時(shí)候就會(huì)搜集各種相關(guān)信息或杠,例如代碼堆棧信息鱼鼓,kernel信息拟烫,cpu信息等,生成trace文件迄本,保存相關(guān)信息到dropbox文件夾下硕淑,然后殺死該進(jìn)程,到這里監(jiān)控就結(jié)束了
Watchdog線程卡頓監(jiān)控實(shí)現(xiàn)
之前我們提到Watchdog監(jiān)控的實(shí)現(xiàn)是通過post一個(gè)HandlerChecker到線程對(duì)應(yīng)的Handler對(duì)的消息對(duì)了中的嘉赎,而死鎖的監(jiān)控對(duì)象都是保存在HandlerChecker的mMonitors列表中的置媳,所以外部調(diào)用addMonitor()方法,最終都會(huì)add到Watchdog的全局變量mMonitorChecker中的監(jiān)控列表公条,一次所有線程的死鎖監(jiān)控都由mMonitorChecker來負(fù)責(zé)實(shí)現(xiàn)拇囊,那么對(duì)于線程耗時(shí)任務(wù)的監(jiān)控,Watchdog是通過addThread()方法來實(shí)現(xiàn)的:
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
addThread()方法實(shí)際上是創(chuàng)建了一個(gè)新的HandlerChecker對(duì)象靶橱,通過該對(duì)象來實(shí)現(xiàn)耗時(shí)任務(wù)的監(jiān)控寥袭,而該HandlerChecker對(duì)象的mMonitors列表實(shí)際上是空的路捧,因此在執(zhí)行任務(wù)的時(shí)候并不會(huì)執(zhí)行monitor()方法了,而是直接設(shè)置mCompleted標(biāo)志位纠永,所以可以這么解釋:Watchdog監(jiān)控者是HandlerChecker鬓长,而HandlerChecker實(shí)現(xiàn)了線程死鎖監(jiān)控和耗時(shí)任務(wù)監(jiān)控,當(dāng)有Monitor對(duì)象的時(shí)候就會(huì)同時(shí)監(jiān)控線程死鎖和耗時(shí)任務(wù)尝江,而沒有Monitor的時(shí)候就只是監(jiān)控線程的耗時(shí)任務(wù)造成的卡頓
Watchdog監(jiān)控流程
- 理解了Watchdog的監(jiān)控流程涉波,我們可以考慮是否把Watchdog機(jī)制運(yùn)用到我們實(shí)際的項(xiàng)目中去實(shí)現(xiàn)監(jiān)控在多線程場(chǎng)景中重要線程的死鎖,以及實(shí)時(shí)監(jiān)控主線程的anr的發(fā)生炭序?當(dāng)然是可以的啤覆,事實(shí)上,Watchdog的在framework中的重要作用就是監(jiān)控主要的系統(tǒng)服務(wù)器是否發(fā)生死鎖或者發(fā)生卡頓惭聂,例如監(jiān)控ActivityManagerService窗声,如果發(fā)生異常情況,那么Watchdog將會(huì)殺死進(jìn)程重啟辜纲,這樣可以保證重要的系統(tǒng)服務(wù)遇到類似問題的時(shí)候可以通過重啟來恢復(fù)笨觅,Watchdog實(shí)際上相當(dāng)于一個(gè)最后的保障,及時(shí)的dump出異常信息耕腾,異臣#恢復(fù)進(jìn)程運(yùn)行環(huán)境
- 對(duì)于應(yīng)用程序中,健康那個(gè)重要線程的死鎖問題實(shí)現(xiàn)原理可以和Watchdog保持一致
-
對(duì)于監(jiān)控應(yīng)用的anr卡頓的實(shí)現(xiàn)原理可以從Watchdog中借鑒扫俺,具體實(shí)現(xiàn)稍微有點(diǎn)不一樣苍苞,Activity是5秒發(fā)生anr,Broadcast是10秒狼纬,Service是20秒羹呵,但是實(shí)際四大組件都是運(yùn)行在主線程中的,所以可以用像Watchdog一樣疗琉,wait 30秒發(fā)起一次監(jiān)控冈欢,通過設(shè)置mCompleted標(biāo)志位來檢測(cè)post到MessageQueue的任務(wù)是否被卡住并未及時(shí)的執(zhí)行,通過mStartTime來計(jì)算出任務(wù)的執(zhí)行時(shí)間没炒,然后通過任務(wù)執(zhí)行的時(shí)間來檢測(cè)MessageQueue中其他的任務(wù)執(zhí)行是否存在耗時(shí)操作涛癌,如果發(fā)現(xiàn)執(zhí)行時(shí)間超過5秒,那么可以說明消息隊(duì)列中存在耗時(shí)任務(wù)送火,這時(shí)候可能就有anr的風(fēng)險(xiǎn),應(yīng)該及時(shí)dump線程棧信息保存先匪,然后通過大數(shù)據(jù)上報(bào)后臺(tái)分析种吸,記住這里一定是計(jì)算設(shè)備活躍的狀態(tài)下的時(shí)間,如果是設(shè)備休眠呀非,MessageQueue本來就會(huì)暫停運(yùn)行坚俗,這時(shí)候其實(shí)并不是死鎖或者卡頓
anr1.jpg
Watchdog機(jī)制總結(jié)
- 每一個(gè)線程都可以對(duì)應(yīng)一個(gè)Looper镜盯,一個(gè)Looper對(duì)應(yīng)一個(gè)MessageQueue,所以可以通過向MessageQueue中post檢測(cè)任務(wù)來預(yù)測(cè)該檢測(cè)任務(wù)是否被及時(shí)的執(zhí)行猖败,以此達(dá)到檢測(cè)線程任務(wù)卡頓的效果速缆,但是前提是該線程要先創(chuàng)建一個(gè)Looper
- Watchdog必須獨(dú)自運(yùn)行在一個(gè)單獨(dú)的線程中,這樣才可以監(jiān)控其他線程而不互相影響
- 使用Watchdog機(jī)制來實(shí)現(xiàn)在線的anr監(jiān)控可能并不能百分百準(zhǔn)確恩闻,比如5秒發(fā)生anr艺糜,在快到5秒的臨界值的時(shí)候耗時(shí)任務(wù)正好執(zhí)行完成了,這時(shí)候執(zhí)行anr檢測(cè)任務(wù)幢尚,在檢測(cè)任務(wù)執(zhí)行過程中破停,有可能Watchdog線程wait的時(shí)間也到了,這時(shí)候發(fā)現(xiàn)檢測(cè)任務(wù)還沒執(zhí)行完于是就報(bào)了一個(gè)anr尉剩,這是不準(zhǔn)確的真慢;另一種情況可能是5秒anr已經(jīng)發(fā)生了,但是Watchdog線程檢測(cè)還沒還是wait理茎,也就是anr發(fā)生的時(shí)間和Watchdog線程wait的時(shí)間錯(cuò)開了黑界,等到下一次Watchdog線程開始wait的時(shí)候,anr已經(jīng)發(fā)生完了皂林,主線程可能已經(jīng)恢復(fù)正常朗鸠,這時(shí)候就會(huì)漏掉這次發(fā)生的anr信息搜集,所以當(dāng)anr卡頓的時(shí)間是Watchdog線程wait時(shí)間的兩倍的時(shí)候式撼,才能完整的掃描到anr并記錄童社,也就是說Watchdog的wait時(shí)間為2.5秒,這個(gè)在實(shí)際應(yīng)用中有點(diǎn)過于頻繁了著隆,如果設(shè)備不休眠扰楼,Watchdog相當(dāng)于每間隔2.5秒就會(huì)運(yùn)行一下,可能會(huì)有耗電風(fēng)險(xiǎn)