1.背景
測(cè)試同學(xué)在下午正常測(cè)試的時(shí)候突然出現(xiàn)系統(tǒng)重啟豆同,然后開(kāi)機(jī)界面一直在轉(zhuǎn)圈無(wú)法進(jìn)入Launcher筛峭。
2.調(diào)查過(guò)程
2.1 問(wèn)題初診斷
不幸的是車(chē)機(jī)adb口不好使,首先通過(guò)串口進(jìn)入在Android系統(tǒng),配置usb模式刃唐,讓adb好使,通過(guò)ADB進(jìn)入到車(chē)機(jī)。
進(jìn)入車(chē)機(jī)之后使用`uptime`指令查看系統(tǒng)啟動(dòng)時(shí)間杖们,看是Android重啟,還是QNX重啟了肩狂。
一看uptime的系統(tǒng)啟動(dòng)時(shí)間摘完,發(fā)現(xiàn)才10分多鐘,明顯是QNX重啟了傻谁。那這個(gè)問(wèn)題從一個(gè)問(wèn)題發(fā)散成兩個(gè)問(wèn)題了孝治。
QNX故障導(dǎo)致系統(tǒng)整體重啟
-
重啟后系統(tǒng)無(wú)法進(jìn)入Launcher
此文章主要記錄問(wèn)題2:系統(tǒng)服務(wù)發(fā)生死鎖導(dǎo)致無(wú)法進(jìn)入Launcher。
2.2 深入調(diào)查
2.2.1 查看進(jìn)程啟動(dòng)情況
`ps -A`查看所有進(jìn)程啟動(dòng)情況
......
mdnsr 733 1 5628 1184 poll_sche+ 0 S mdnsd
root 745 2 0 0 worker_th+ 0 I [kworker/4:3]
radio 747 1 13924 3064 poll_sche+ 0 S ipacm-diag
radio 751 1 22164 6476 futex_wai+ 0 S ipacm
# systemserver啟動(dòng)
system 761 341 4523728 162136 ep_poll 0 S system_server
root 866 2 0 0 worker_th+ 0 I [kworker/u13:2]
root 949 2 0 0 worker_th+ 0 I [kworker/1:4]
root 950 2 0 0 worker_th+ 0 I [kworker/2:3]
root 951 2 0 0 worker_th+ 0 I [kworker/2:4]
root 952 2 0 0 worker_th+ 0 I [kworker/2:5]
root 954 2 0 0 worker_th+ 0 I [kworker/2:6]
root 955 2 0 0 worker_th+ 0 I [kworker/1:5]
root 957 2 0 0 worker_th+ 0 I [kworker/1:6]
root 964 2 0 0 worker_th+ 0 I [kworker/1:7]
root 1006 2 0 0 worker_th+ 0 I [kworker/0:3]
root 1007 2 0 0 worker_th+ 0 I [kworker/1:8]
root 1008 2 0 0 worker_th+ 0 I [kworker/0:4]
root 1012 2 0 0 worker_th+ 0 I [kworker/0:5]
root 1015 2 0 0 worker_th+ 0 I [kworker/0:6]
root 1017 2 0 0 worker_th+ 0 I [kworker/0:7]
root 1020 2 0 0 worker_th+ 0 I [kworker/0:8]
root 1021 2 0 0 worker_th+ 0 I [kworker/0:9]
root 1022 2 0 0 worker_th+ 0 I [kworker/0:10]
root 1023 2 0 0 worker_th+ 0 I [kworker/0:11]
root 1025 2 0 0 worker_th+ 0 I [kworker/0:12]
# 場(chǎng)景服務(wù)啟動(dòng)
system 1045 341 4361960 67808 ep_poll 0 S com.gxa.car.scene
可以看到SystemServer已經(jīng)啟動(dòng)完畢审磁,在startOtherService已經(jīng)把管理上層場(chǎng)景的服務(wù)已經(jīng)拉起谈飒,但是后面就沒(méi)有服務(wù)起來(lái)了。因此初步判斷是有系統(tǒng)服務(wù)卡死了态蒂!
使用`kill -3 761`將當(dāng)前所有所有線程的堆棧取出來(lái)杭措。
在`/data/anr`路徑下會(huì)生成一個(gè)名為**“trace_00”**的文件。導(dǎo)出此文件钾恢,開(kāi)始分析是哪些線程發(fā)送卡死了手素。
2.2.2 分析trace日志
正常看trace一半先看main線程瘩蚪,但是看車(chē)機(jī)現(xiàn)象是一直卡在開(kāi)機(jī)界面泉懦,不會(huì)重啟,因此可以初步判斷不是主線程卡死了疹瘦,如果是主線程卡死了崩哩,watchdog會(huì)直接觸發(fā)重啟Android。不過(guò)還是看一下main線程狀態(tài)言沐。
"main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x74726b48 self=0x7b90614c00
| sysTid=729 nice=-2 cgrp=default sched=0/0 handle=0x7c16087548
| state=S schedstat=( 822945383 262318239 1924 ) utm=56 stm=25 core=1 HZ=100
| stack=0x7fdac06000-0x7fdac08000 stackSize=8MB
| held mutexes=
at com.android.server.am.ActivityManagerService.registerReceiver(ActivityManagerService.java:20857)
- waiting to lock <0x033008f9> (a com.android.server.am.ActivityManagerService) held by thread 86
at android.app.ContextImpl.registerReceiverInternal(ContextImpl.java:1488)
at android.app.ContextImpl.registerReceiver(ContextImpl.java:1449)
at com.android.server.net.NetworkStatsService.systemReady(NetworkStatsService.java:398)
at com.android.server.SystemServer.lambda$startOtherServices$4(SystemServer.java:1851)
at com.android.server.-$$Lambda$SystemServer$s9erd2iGXiS7bbg_mQJUxyVboQM.run(lambda:-1)
at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:15282)
at com.android.server.SystemServer.startOtherServices(SystemServer.java:1777)
at com.android.server.SystemServer.run(SystemServer.java:444)
at com.android.server.SystemServer.main(SystemServer.java:303)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:838)
main線程registerReceiver()方法中等待<0x033008f9>這把鎖琢锋。順藤摸瓜辕漂,看看哪個(gè)線程持有這把鎖。
"Binder:729_7" prio=5 tid=86 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12e80020 self=0x7b87c51000
| sysTid=1259 nice=-2 cgrp=default sched=0/0 handle=0x7b6e7df4f0
| state=S schedstat=( 3735363 3189481 17 ) utm=0 stm=0 core=2 HZ=100
| stack=0x7b6e6e4000-0x7b6e6e6000 stackSize=1009KB
| held mutexes=
at com.android.server.wm.WindowManagerService.deferSurfaceLayout(WindowManagerService.java:2894)
- waiting to lock <0x09464241> (a com.android.server.wm.WindowHashMap) held by thread 20
at com.android.server.am.ActivityManagerService.handleAppDiedLocked(ActivityManagerService.java:5928)
at com.android.server.am.ActivityManagerService.appDiedLocked(ActivityManagerService.java:6107)
at com.android.server.am.ActivityManagerService$AppDeathRecipient.binderDied(ActivityManagerService.java:1885)
- locked <0x033008f9> (a com.android.server.am.ActivityManagerService)
at android.os.BinderProxy.sendDeathNotice(Binder.java:1193)
名為Binder:729_7的86號(hào)線程持有main線程的<0x033008f9>鎖吴超,但是該線程執(zhí)行deferSurfaceLayout方法依賴<0x09464241>鎖(鎖的名字WindowHashMap的實(shí)例钉嘹,看源碼mWindowMap這個(gè)對(duì)象)釋放。繼續(xù)順藤摸瓜鲸阻。
"android.display" prio=5 tid=20 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12c40f88 self=0x7b907ddc00
| sysTid=806 nice=-4 cgrp=default sched=0/0 handle=0x7b74e0a4f0
| state=S schedstat=( 28085321 36714886 210 ) utm=1 stm=1 core=3 HZ=100
| stack=0x7b74d07000-0x7b74d09000 stackSize=1041KB
| held mutexes=
at com.android.server.policy.PhoneWindowManager.canDismissBootAnimation(PhoneWindowManager.java:7637)
- waiting to lock <0x0a6033e6> (a java.lang.Object) held by thread 13
at com.android.server.wm.WindowManagerService.performEnableScreen(WindowManagerService.java:3455)
- locked <0x09464241> (a com.android.server.wm.WindowHashMap)
at com.android.server.wm.WindowManagerService.access$1100(WindowManagerService.java:272)
at com.android.server.wm.WindowManagerService$H.handleMessage(WindowManagerService.java:4861)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
android.display線程持有<0x09464241>鎖跋涣,需要釋放該鎖,依賴執(zhí)行canDismissBootAnimation()方法釋放<0x0a6033e6>鎖(源碼中該鎖名字叫mLock),繼續(xù)順藤摸瓜鸟悴。
"android.ui" prio=5 tid=13 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12c408b0 self=0x7b907d7c00
| sysTid=799 nice=-2 cgrp=default sched=0/0 handle=0x7b756324f0
| state=S schedstat=( 24267703 15746873 101 ) utm=2 stm=0 core=0 HZ=100
| stack=0x7b7552f000-0x7b75531000 stackSize=1041KB
| held mutexes=
at com.android.server.wm.WindowManagerService$LocalService.waitForAllWindowsDrawn(WindowManagerService.java:7370)
- waiting to lock <0x09464241> (a com.android.server.wm.WindowHashMap) held by thread 20
at com.android.server.policy.PhoneWindowManager.finishKeyguardDrawn(PhoneWindowManager.java:6885)
at com.android.server.policy.PhoneWindowManager.screenTurningOn(PhoneWindowManager.java:6938)
- locked <0x0a6033e6> (a java.lang.Object)
at com.android.server.policy.PhoneWindowManager.systemBooted(PhoneWindowManager.java:7631)
at com.android.server.wm.WindowManagerService.enableScreenAfterBoot(WindowManagerService.java:3392)
at com.android.server.am.ActivityManagerService.enableScreenAfterBoot(ActivityManagerService.java:7993)
at com.android.server.am.ActivityManagerService.ensureBootCompleted(ActivityManagerService.java:8162)
at com.android.server.am.ActivityManagerService$UiHandler.handleMessage(ActivityManagerService.java:2033)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
at com.android.server.UiThread.run(UiThread.java:43)
android.ui線程持有<0x0a6033e6>鎖陈辱,執(zhí)行依賴waitForAllWindowsDrawn()依賴<0x09464241>釋放!O钢睢E嫣啊!
問(wèn)題終于找到了:
android.display線程持<0x09464241>mWindowMap對(duì)象鎖震贵,依賴<0x0a6033e6>mLock對(duì)象鎖釋放利赋。
android.ui線程持有<0x0a6033e6>mLock對(duì)象鎖,依賴<0x09464241>mWindowMap對(duì)象鎖釋放猩系。
發(fā)生循環(huán)依賴了媚送。導(dǎo)致死鎖的場(chǎng)景發(fā)生!?艿椤塘偎!
解決思路:發(fā)生死鎖一般解決辦法就是翻看源碼解開(kāi)其中一個(gè)方法的鎖即可。
2.3源碼調(diào)查
從上面分析看主要是執(zhí)行canDismissBootAnimation()和waitForAllWindowsDrawn()中造成循環(huán)依賴了拿霉。因此直接調(diào)查這兩個(gè)方法吟秩。
/frameworks/base/services/core/java/com/android/server/policy/PhoneWindowManager.java
@Override
public boolean canDismissBootAnimation() {
synchronized (mLock) {
return mKeyguardDrawComplete;
}
}
/frameworks/base/services/core/java/com/android/server/wm/WindowManagerService.java
@Override
public void waitForAllWindowsDrawn(Runnable callback, long timeout) {
boolean allWindowsDrawn = false;
synchronized (mWindowMap) {
mWaitingForDrawnCallback = callback;
getDefaultDisplayContentLocked().waitForAllWindowsDrawn();
mWindowPlacerLocked.requestTraversal();
mH.removeMessages(H.WAITING_FOR_DRAWN_TIMEOUT);
if (mWaitingForDrawn.isEmpty()) {
allWindowsDrawn = true;
} else {
mH.sendEmptyMessageDelayed(H.WAITING_FOR_DRAWN_TIMEOUT, timeout);
checkDrawnWindowsLocked();
}
}
if (allWindowsDrawn) {
callback.run();
}
}
初步一看,canDismissBootAnimation()方法的mLock鎖只是為了保證返回mKeyguardDrawComplete局部變量的原子性绽淘。而waitForAllWindowsDrawn()方法的mWindowMap是為了保證那么大一段代碼的原子性峰尝。
解決辦法很簡(jiǎn)單,移除mKeyguardDrawComplete返回的鎖即可收恢。
2.4 解決后代碼
/frameworks/base/services/core/java/com/android/server/policy/PhoneWindowManager.java
@Override
public boolean canDismissBootAnimation() {
return mKeyguardDrawComplete;
}
3.回顧
咋一想,居然google原生代碼也出現(xiàn)了死鎖問(wèn)題祭往,還想著是不是可以取給google提交一個(gè)bug和修復(fù)方案伦意,萬(wàn)一采納了呢,那不是牛逼了硼补。
由于當(dāng)前項(xiàng)目用的Android P的代碼驮肉,于是趕緊去看看Android Q 谷歌有沒(méi)有修復(fù)該代碼。
@Override
public boolean canDismissBootAnimation() {
return mDefaultDisplayPolicy.isKeyguardDrawComplete();
}
# frameworks/base/services/core/java/com/android/server/wm/DisplayPolicy.java
public boolean isKeyguardDrawComplete() {
return mKeyguardDrawComplete;
}
哎哎哎已骇,還是太年輕离钝,too young票编,too simple了÷芽剩看來(lái)Android P的漏洞慧域,Android Q已經(jīng)修復(fù)了,修復(fù)的方法和我開(kāi)始調(diào)查的時(shí)候修復(fù)的方法一樣浪读,還是挺開(kāi)心的昔榴。因此順利提交代碼到項(xiàng)目中了。