一次trouble-shooting
最近在查看應(yīng)用的線上日志統(tǒng)計時百匆,發(fā)現(xiàn)一個 MessageQueue.nativePollOnce()
的記錄,具體信息如下:
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:325)
at android.os.Looper.loop(Looper.java:142)
at android.os.HandlerThread.run(HandlerThread.java:65)
因為是java層的異常收集瞧毙,stack trace到這里就結(jié)束了胧华,很明顯問題實際上是在native層,所以只能通過traces.txt查看更多native層的信息了宙彪。
但是又拿不到用戶的traces.txt矩动,只好嘗試拿自己的手機(jī)來看看,打開終端释漆,通過以下命令獲取traces.txt:
adb pull /data/anr/traces.txt C:\Users\wsj\Downloads\traces.txt
結(jié)果發(fā)現(xiàn)其中鋪天蓋地的都是nativePollOnce的日志悲没,各種應(yīng)用的不同線程都有報,摘錄一段淘寶的日志:
// 顯示進(jìn)程id、ANR發(fā)生時間點示姿、ANR發(fā)生進(jìn)程包名
----- pid 32617 at 2018-08-17 19:55:27 -----
Cmd line: com.taobao.taobao
// 一些設(shè)備甜橱,內(nèi)存,GC等信息栈戳,通称癜粒可以忽略
...
// 發(fā)生ANR的堆棧信息,重要部分
DALVIK THREADS (196):
// 其他一些錯誤信息
...
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x73d92550 self=0xebc74000
| sysTid=32617 nice=-10 cgrp=default sched=0/0 handle=0xf01ae4bc
| state=S schedstat=( 6367184163 1210771881 20365 ) utm=458 stm=178 core=5 HZ=100
| stack=0xff4ae000-0xff4b0000 stackSize=8MB
| held mutexes=
kernel: (couldn't read /proc/self/task/32617/stack)
native: #00 pc 00019178 /system/lib/libc.so (syscall+28)
native: #01 pc 000b3e35 /system/lib/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+88)
native: #02 pc 0026302f /system/lib/libart.so (_ZN3art3JNI16CallObjectMethodEP7_JNIEnvP8_jobjectP10_jmethodIDz+358)
native: #03 pc 00002b57 /system/lib/libnativehelper.so (jniGetReferent+90)
native: #04 pc 000c8b09 /system/lib/libandroid_runtime.so (_ZN7android26NativeDisplayEventReceiver13dispatchVsyncExij+20)
native: #05 pc 00032555 /system/lib/libandroidfw.so (_ZN7android22DisplayEventDispatcher11handleEventEiiPv+104)
native: #06 pc 000105cd /system/lib/libutils.so (_ZN7android6Looper9pollInnerEi+572)
native: #07 pc 000102f9 /system/lib/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+32)
native: #08 pc 000e47dd /system/lib/libandroid_runtime.so (_ZN7android18NativeMessageQueue8pollOnceEP7_JNIEnvP8_jobjecti+24)
native: #09 pc 001a8df5 /system/framework/arm/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+92)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:325)
at android.os.Looper.loop(Looper.java:142)
at android.app.ActivityThread.main(ActivityThread.java:6938)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.Zygote$MethodAndArgsCaller.run(Zygote.java:327)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1374)
"com.taobao.taobao_mytaobao_preferences.sp" prio=5 tid=66 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x148ddde8 self=0xb9c7da00
| sysTid=2852 nice=0 cgrp=default sched=0/0 handle=0xbce0f970
| state=S schedstat=( 999426 52084 4 ) utm=0 stm=0 core=5 HZ=100
| stack=0xbcd0d000-0xbcd0f000 stackSize=1038KB
| held mutexes=
kernel: (couldn't read /proc/self/task/2852/stack)
native: #00 pc 0004a4d0 /system/lib/libc.so (__epoll_pwait+20)
native: #01 pc 0001bcd5 /system/lib/libc.so (epoll_pwait+60)
native: #02 pc 0001bd05 /system/lib/libc.so (epoll_wait+12)
native: #03 pc 00010407 /system/lib/libutils.so (_ZN7android6Looper9pollInnerEi+118)
native: #04 pc 000102f9 /system/lib/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+32)
native: #05 pc 000e47dd /system/lib/libandroid_runtime.so (_ZN7android18NativeMessageQueue8pollOnceEP7_JNIEnvP8_jobjecti+24)
native: #06 pc 001a8df5 /system/framework/arm/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+92)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:325)
at android.os.Looper.loop(Looper.java:142)
at android.os.HandlerThread.run(HandlerThread.java:65)
// 其他一些錯誤信息
...
----- end 32617 -----
以上是我手機(jī)中traces.txt記錄的淘寶的ANR日志摘錄子檀,我只摘出了兩條nativePollOnce
相關(guān)的镊掖,第一個是發(fā)生在 main
線程,第二個是發(fā)生在 com.taobao.taobao_mytaobao_preferences.sp
線程褂痰。來看看第二個亩进,即便是native層的部分似乎也沒有什么錯誤信息。直接看第一行native信息:
native: #00 pc 0004a4d0 /system/lib/libc.so (__epoll_pwait+20)
epoll_wait其實是native層用于實現(xiàn)等待的缩歪,在管道上等待消息寫入归薛,一有消息到來時立馬從管道中讀取出來并返回結(jié)果,這會在后面的 消息循環(huán) 中講到匪蝙。這么看來主籍,這段日志也只是說明:
目標(biāo)線程在等待下一條消息的到來。
既然這樣逛球,只好先找谷歌了崇猫,能搜到的內(nèi)容不多,stackoverflow上面有一個解釋:
Short answer:
The nativePollOnce method is used to "wait" till the next Message becomes available. If the time spent during this call is long, your main (UI) thread has no real work to do and waits for next events to process. There's no need to worry about that.
Explanation:
Because the "main" thread is responsible for drawing UI and handling various events, it's Runnable has a loop which processes all these events. The loop is managed by a Looper and its job is quite straightforward: it processes all Messages in the MessageQueue.
A Message is added to the queue for example in response to input events, as frame rendering callback or even your own Handler.post calls. Sometimes the main thread has no work to do (that is, no messages in the queue), which may happen e.g. just after finishing rendering single frame (the thread has just drawn one frame and is ready for the next one, just waits for a proper time). Two Java methods in the MessageQueue class are interesting to us: Message next() and boolean enqueueMessage(Message, long). Message next(), as its name suggest, takes and returns the next Message from the queue. If the queue is empty (and there's nothing to return), the method calls native void nativePollOnce(long, int) which blocks until a new message is added. At this point you might ask how does nativePollOnce know when to wake up. That's a very good question. When a Message is added to the queue, the framework calls the enqueueMessage method, which not only inserts the message into the queue, but also calls native static void nativeWake(long), if there's need to wake up the queue. The core magic of nativePollOnce and nativeWake happens in the native (actually, C++) code. Native MessageQueue utilizes a Linux system call named epoll, which allows to monitor a file descriptor for IO events. nativePollOnce calls epoll_wait on a certain file descriptor, whereas nativeWake writes to the descriptor, which is one of the IO operations, epoll_wait waits for. The kernel then takes out the epoll-waiting thread from the waiting state and the thread proceeds with handling the new message. If you're familiar with Java's Object.wait() and Object.notify() methods, you can imagine that nativePollOnce is a rough equivalent for Object.wait() and nativeWake for Object.notify(), except they're implemented completely differently: nativePollOnce uses epoll and Object.wait() uses futex Linux call. It's worth noticing that neither nativePollOnce nor Object.wait() waste CPU cycles, as when a thread enters either method, it becomes disabled for thread scheduling purposes (quoting the javadoc for the Object class). However, some profilers may mistakenly recognize epoll-waiting (or even Object-waiting) threads as running and consuming CPU time, which is incorrect. If those methods actually wasted CPU cycles, all idle apps would use 100% of the CPU, heating and slowing down the device.
Conclusion:
You shouldn't worry about nativePollOnce. It just indicates that processing of all Messages has been finished and the thread waits for the next one. Well, that simply means you don't give too much work to your main thread ;)
android - what is message queue native poll once in android?
似乎不需要太關(guān)注這個問題需忿,但又不是很放心,索性再看下 MessageQueue.next()
的源碼蜡歹,了解一下這個 nativePollOnce()
究竟是干嘛的屋厘,查了一番,大概是這樣:
Java層和Native層其實各有一個MessageQueue月而,Java層的MessageQueue實際上是從Native層去獲取其中的消息汗洒。而next方法顯然就是用于獲取下一條消息的,其中主要通過nativePollOnce從native層的MessageQueue中獲取父款,并且該方法會阻塞線程溢谤,如果獲取不到消息(比如消息隊列中是空的),就一直阻塞憨攒,直到獲取到消息為止世杀。
可還是沒有確切的證據(jù)證明這個異常信息不會導(dǎo)致ANR。只知道在系統(tǒng)的 traces.txt
中肝集,有無數(shù)條 nativePollOnce
的異常信息瞻坝,淘寶、微信等等各大應(yīng)用都有杏瞻,可平時使用時很少遇到真的崩潰的情況所刀。
后來發(fā)現(xiàn)衙荐,使用Android Studio開發(fā)時,debug模式下的 Thread Dump
浮创,其dump的thread信息中忧吟,居然能看到一些 nativePollOnce
的異常堆棧,發(fā)現(xiàn)以下幾個點:
-
main Thread
,LeakCanary-Heap-Dump thread
等都有報出該問題斩披,但我的應(yīng)用運行很正常 - 所有報了
nativePollOnce
的線程溜族,都處于runnable
狀態(tài)
似乎很大程度上可以證明前面 stackoverflow 上說的沒錯。
本次調(diào)查的結(jié)論
nativePollOnce
確實沒有導(dǎo)致ANR雏掠,也不會過度消耗cpu cycle斩祭,它只是說明當(dāng)前handlerthread中沒有需要處理的message,線程在等待下一條message的到來乡话,之所以會收集到這種異常堆棧摧玫,可能是系統(tǒng)錯誤的認(rèn)為這個等待時間過長,將其列入了ANR绑青。
還有一個推測诬像,以前調(diào)查過一個線上日志報的
listFiles()
采集到navite層crash的問題,非utf-8編碼的文件名闸婴,會導(dǎo)致listFiles在native層崩潰坏挠,本地也確實復(fù)現(xiàn)了該問題,調(diào)查后確定邪乍,這個問題只會在Manifest文件
中Application
下的debuggable=true
時發(fā)生降狠,當(dāng)其值為false
時,不會crash庇楞,只是listFiles()
接口返回的那個有問題的文件名不一定能準(zhǔn)確識別出來榜配,我想Google做這種機(jī)制的目的在于,開發(fā)時通過crash提示開發(fā)者有非utf-8的文件名存在吕晌,你應(yīng)該檢查是否代碼有bug蛋褥,或者考慮出現(xiàn)這種異常文件名是否正常,如果你認(rèn)為你的業(yè)務(wù)需求需要對應(yīng)這種case睛驳,并且你能接受api返回的結(jié)果中無法準(zhǔn)確識別完整的文件名烙心,那么當(dāng)你的應(yīng)用上線后(debuggable=false),系統(tǒng)不會上報該異常乏沸。雖然IDE在出release版本的apk時淫茵,會強制將
debuggable
設(shè)置為true,但有些應(yīng)用的發(fā)包流程可能使用的腳本編譯打包蹬跃,有可能忘記將開發(fā)模式下的true
改為false
痘昌,我解包看了一些線上的應(yīng)用,發(fā)現(xiàn)有不少應(yīng)用就是設(shè)的true
。所以猜測這個
nativePollOnce
的異常同樣只會在debuggable=true
時統(tǒng)計到辆苔,但我沒有這種環(huán)境做驗證算灸,來采集線上日志,所以驻啤,如果你的應(yīng)用菲驴,采集到線上日志有nativePollOnce
的異常,又想要解決的話骑冗,可以先確認(rèn)一下你們的線上應(yīng)用是否設(shè)置了debuggable=true
赊瞬,如果是的話,可以優(yōu)化出包流程贼涩,將其改為false
巧涧,然后再統(tǒng)計一波線上日志,看還會不會收到nativePollOnce
的異常遥倦,如果有人驗證過了谤绳,也請告知我一下結(jié)果,我也想知道這個推測到底對不對袒哥,不勝感激缩筛。
這個問題的調(diào)查中,明白MessageQueue的工作原理也是很重要的堡称。接下來的內(nèi)容轉(zhuǎn)載自 Android應(yīng)用程序消息處理機(jī)制 瞎抛,對于MessageQueue講的非常簡單明了。
Android消息處理機(jī)制概述
Android的消息處理機(jī)制主要分為四個部分:
- 創(chuàng)建消息隊列
- 消息循環(huán)
- 消息發(fā)送
- 消息處理
主要涉及三個類:
- MessageQueue
- Looper
- Handler
創(chuàng)建消息隊列
整個創(chuàng)建過程涉及到兩個類:MessageQueue
和 Looper
却紧。它們在C++層有兩個對應(yīng)的類:NativeMessageQueue
和 Looper
桐臊。其關(guān)系如下圖所示:
+------------+ +------+
|MessageQueue+----^+Looper|
+-----+------+ +------+
|
|
|
+-----------+------+ +------+
|NativeMessageQueue+^----+Looper|
+------------------+ +------+
A----^B表示B中保存A的引用
創(chuàng)建過程如下所示:
- Looper的prepare或者prepareMainLooper靜態(tài)方法被調(diào)用,將一個Looper對象保存在ThreadLocal里面晓殊。
- Looper對象的初始化方法里豪硅,首先會新建一個MessageQueue對象。
- MessageQueue對象的初始化方法通過JNI初始化C++層的NativeMessageQueue對象挺物。
- NativeMessageQueue對象在創(chuàng)建過程中,會初始化一個C++層的Looper對象飘弧。
- C++層的Looper對象在創(chuàng)建的過程中识藤,會在內(nèi)部創(chuàng)建一個管道(pipe),并將這個管道的讀寫fd都保存在mWakeReadPipeFd和mWakeWritePipeFd中次伶。
然后新建一個epoll實例痴昧,并將兩個fd注冊進(jìn)去。 - 利用epoll的機(jī)制冠王,可以做到當(dāng)管道沒有消息時赶撰,線程睡眠在讀端的fd上,當(dāng)其他線程往管道寫數(shù)據(jù)時,本線程便會被喚醒以進(jìn)行消息處理豪娜。
消息循環(huán)
+------+ +------------+ +------------------+ +--------------+
|Looper| |MessageQueue| |NativeMessageQueue| |Looper(Native)|
+--+---+ +------+-----+ +---------+--------+ +-------+------+
| | | |
| | | |
+-------------------------------------------------------------------------------+
|[msg loop] | next() | | | |
| +------------> | | | |
| | | | | |
| | | | | |
| | | nativePollOnce() | | |
| | | pollOnce() | | |
| | +----------------> | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | pollOnce() | |
| | | +-----------------> | |
| | | | | |
| | | | | epoll_wait()
| | | | +--------+ |
| | | | | | |
| | | | | | |
| | | | | <------+ |
| | | | | awoken() |
| + + + + |
| |
| |
+-------------------------------------------------------------------------------+
- 首先通過調(diào)用Looper的loop方法開始消息監(jiān)聽餐胀。loop方法里會調(diào)用MessageQueue的next方法。next方法會堵塞線程直到有消息到來為止瘤载。
- next方法通過調(diào)用nativePollOnce方法來監(jiān)聽事件否灾。next方法內(nèi)部邏輯如下所示(簡化):
- 進(jìn)入死循環(huán),以參數(shù)timout=0調(diào)用nativePollOnce方法鸣奔。
- 如果消息隊列中有消息墨技,nativePollOnce方法會將消息保存在mMessage成員中。nativePollOnce方法返回后立刻檢查mMessage成員是否為空挎狸。
- 如果mMessage不為空龄毡,那么檢查它指定的運行時間跋核。如果比當(dāng)前時間要前,那么馬上返回這個mMessage,否則設(shè)置timeout為兩者之差搞坝,進(jìn)入下一次循環(huán)。
- 如果mMessage為空表牢,那么設(shè)置timeout為-1污抬,即下次循環(huán)nativePollOnce永久堵塞。
- nativePollOnce方法內(nèi)部利用epoll機(jī)制在之前建立的管道上等待數(shù)據(jù)寫入侥蒙。接收到數(shù)據(jù)后馬上讀取并返回結(jié)果暗膜。
消息發(fā)送
+-------+ +------------+ +------------------+ +--------------+
|Handler| |MessageQueue| |NativeMessageQueue| |Looper(Native)|
+--+----+ +-----+------+ +---------+--------+ +-------+------+
| | | |
| | | |
sendMessage()| | | |
+----------> | | | |
| | | |
|enqueueMessage()| | |
+--------------> | | |
| | | |
| | | |
| | | |
| | nativeWake() | |
| | wake() | |
| +------------------> | |
| | | |
| | | wake() |
| | +------------------> |
| | | |
| | | |
| | | |write(mWakeWritePipeFd, "W", 1)
| | | |
| | | |
| | | |
| | | |
| | | |
+ + + +
消息發(fā)送過程主要由 Handler
對象來驅(qū)動。
- Handler對象在創(chuàng)建時會保存當(dāng)前線程的looper和MessageQueue鞭衩,如果傳入Callback的話也會保存起來学搜。
- 用戶調(diào)用handler對象的sendMessage方法,傳入msg對象论衍。handler通過調(diào)用MessageQueue的enqueueMessage方法將消息壓入MessageQueue瑞佩。
- enqueueMessage方法會將傳入的消息對象根據(jù)觸發(fā)時間(when)插入到message queue中。然后判斷是否要喚醒等待中的隊列坯台。
- 如果插在隊列中間炬丸。說明該消息不需要馬上處理,不需要由這個消息來喚醒隊列蜒蕾。
- 如果插在隊列頭部(或者when=0)稠炬,則表明要馬上處理這個消息。如果當(dāng)前隊列正在堵塞咪啡,則需要喚醒它進(jìn)行處理首启。
- 如果需要喚醒隊列,則通過nativeWake方法撤摸,往前面提到的管道中寫入一個"W"字符毅桃,令nativePollOnce方法返回褒纲。
消息處理
+------+ +-------+
|Looper| |Handler|
+--+---+ +---+---+
| |
| |
loop() | |
[after next()] |
+---------> | |
| |
|dispatchMessage()
+-------------> |
| |
| |
| | handleMessage()
| +-------+
| | |
| | |
| | <-----+
| | (callback or subclass)
| |
+ +
Looper對象的loop方法里面的queue.next方法如果返回了message,那么handler的dispatchMessage會被調(diào)用钥飞。
- 如果新建Handler的時候傳入了callback實例莺掠,那么callback的handleMessage方法會被調(diào)用。
- 如果是通過post方法向handler傳入runnable對象的代承,那么runnable對象的run方法會被調(diào)用汁蝶。
- 其他情況下,handler方法的handleMessage會被調(diào)用论悴。