Android系統(tǒng)每次發(fā)生ANR后,都會在/data/anr/目錄下面輸出一個traces.txt文件,這個文件記錄了發(fā)生問題進(jìn)程的虛擬機(jī)相關(guān)信息和線程的堆棧信息掸哑,通過這個文件我們就能分析出當(dāng)前線程正在做什么操作,繼而可以分析出ANR的原因酸舍,它的生成與Signal Catcher線程是息息相關(guān)的杯瞻,每一個從zygote派生出來的子進(jìn)程都會有一個Signal Catcher線程柑爸,可以在終端的Shell環(huán)境下執(zhí)行”ps -t &pid” 命令得到對應(yīng)pid進(jìn)程所有的子線程列表让腹,如下圖所示:
USER PID PPID VSIZE RSS WCHAN PC NAME
system 2953 2646 2784184 223904 SyS_epoll_ 7f92d20520 S system_server
system 2958 2953 2784184 223904 do_sigtime 7f92d20700 S Signal Catcher
system 2960 2953 2784184 223904 futex_wait 7f92cd3f20 S ReferenceQueueD
system 2961 2953 2784184 223904 futex_wait 7f92cd3f20 S FinalizerDaemon
system 2962 2953 2784184 223904 futex_wait 7f92cd3f20 S FinalizerWatchd
system 2963 2953 2784184 223904 futex_wait 7f92cd3f20 S HeapTaskDaemon
system 2970 2953 2784184 223904 binder_thr 7f92d20610 S Binder_1
system 2972 2953 2784184 223904 binder_thr 7f92d20610 S Binder_2
system 2985 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.bg
system 2986 2953 2784184 223904 SyS_epoll_ 7f92d20520 S ActivityManager
system 2987 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.ui
system 2988 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.fg
system 2989 2953 2784184 223904 inotify_re 7f92d20fe8 S FileObserver
system 2990 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.io
system 2991 2953 2784184 223904 SyS_epoll_ 7f92d20520 S android.display
system 2992 2953 2784184 223904 futex_wait 7f92cd3f20 S CpuTracker
system 2993 2953 2784184 223904 SyS_epoll_ 7f92d20520 S PowerManagerSer
上面打印的是system_server的線程列表肾砂,其中2958這個線程便是"Signal Catcher"線程列赎,Signal是指進(jìn)程發(fā)生問題時候Kernel發(fā)給它的信號,Signal Catcher這個線程就是在用戶空間來處理信號镐确。
Linux軟中斷信號(信號)是系統(tǒng)用來通知進(jìn)程發(fā)生了異步事件包吝,是在軟件層次上是對中斷機(jī)制的一種模擬,在原理上源葫,一個進(jìn)程收到一個信號與處理器收到一個中斷請求可以說是一樣的漏策。信號是進(jìn)程間通信機(jī)制中唯一的異步通信機(jī)制,一個進(jìn)程不必通過任何操作來等待信號的到達(dá)臼氨,事實(shí)上掺喻,進(jìn)程也不知道信號到底什么時候到達(dá)。進(jìn)程之間可以互相通過系統(tǒng)調(diào)用kill發(fā)送軟中斷信號储矩。內(nèi)核也可以因?yàn)閮?nèi)部事件而給進(jìn)程發(fā)送信號感耙,通知進(jìn)程發(fā)生了某個事件。除此之外持隧,信號機(jī)制除了基本通知功能外即硼,還可以傳遞附加信息,總之信號是一種Linux系統(tǒng)中進(jìn)程間通信手段屡拨,Linux默認(rèn)已經(jīng)給進(jìn)程的信號有處理只酥,如果你不關(guān)心信號的話褥实,默認(rèn)系統(tǒng)行為就好了,但是如果你關(guān)心某些信號裂允,例如段錯誤SIGSEGV(一般是空指針损离、內(nèi)存訪問越界的時候由系統(tǒng)發(fā)送給當(dāng)事進(jìn)程),那么你就得重新編寫信號處理函數(shù)來覆蓋系統(tǒng)默認(rèn)的行為绝编,這種機(jī)制對于程序調(diào)試來說是很重要的一種手段僻澎,因?yàn)橄襁@種段錯誤是不可預(yù)知的,它可以發(fā)生在任何地方十饥,也就是說在應(yīng)用程序的代碼里面是不能處理這種異常的窟勃,這個時候要定位問題的話,就只能依靠信號這種機(jī)制逗堵,雖然應(yīng)用程序不知道什么時候發(fā)生了段錯誤秉氧,但是系統(tǒng)底層(Kernel)是知道的,Kernel發(fā)現(xiàn)應(yīng)用程序訪問了非法地址的時候蜒秤,就會發(fā)送一個SIGSEGV信號給該進(jìn)程谬运,在該進(jìn)程從內(nèi)核空間返回到用戶空間時會檢測是否有信號等待處理,如果用戶自定義了信號處理函數(shù)垦藏,那么這個時候就會調(diào)用用戶編寫的函數(shù),這個時候就可以做很多事情了:例如dump當(dāng)前進(jìn)程的堆棧伞访、獲取系統(tǒng)的全局信息(內(nèi)存掂骏、IO、CPU)等厚掷,而這些信息對分析問題是非常重要的弟灼。
回到主題,Signal Catcher這個線程是由Android Runtime去創(chuàng)建的冒黑,在新起一個應(yīng)用進(jìn)程的時候田绑,system_server進(jìn)程會通過socket和zygote取得通信,并由zygote負(fù)責(zé)去創(chuàng)建一個子進(jìn)程抡爹,在Linux系統(tǒng)中掩驱,創(chuàng)建一個進(jìn)程一般通過fork機(jī)制,Android也不例外冬竟,zygote的子進(jìn)程起來后欧穴,默認(rèn)都會有一個main線程,在該main線程中都會調(diào)用到DidForkFromZygote@Runtime.cc這個函數(shù)泵殴,在這個函數(shù)中又會調(diào)用StartSignalCatcher@Runtime.cc這個函數(shù)涮帘,這個函數(shù)里面會新建一個SignalCatcher對象,Signal Catcher線程的起源便是來源于此笑诅。
void Runtime::StartSignalCatcher() {
if (!is_zygote_) {
signal_catcher_ = new SignalCatcher(stack_trace_file_);
}
}
在SignalCatcher的構(gòu)造函數(shù)中會調(diào)用 pthread_create來創(chuàng)建一個傳統(tǒng)意義上的Linux線程调缨,說到底Android是一個基于Linux的系統(tǒng)疮鲫,ART的線程概念直接復(fù)用了Linux的,畢竟Linux發(fā)展了這么久弦叶,線程機(jī)制這一方面已經(jīng)很成熟了俊犯,ART沒必要重復(fù)造輪子,在User空間再實(shí)現(xiàn)一套自己的線程機(jī)制湾蔓,pthread_create是類Unix操作系統(tǒng)(Unix瘫析、Linux、Mac OS X等)的創(chuàng)建線程的函數(shù)默责,它的函數(shù)原型為:
int pthread_create(pthread_t *tidp,const pthread_attr_t *attr,(void*)(*start_rtn)(void*),void *arg);
tidp 返回線程標(biāo)識符的指針贬循,attr 設(shè)置線程屬性,start_rtn 是線程運(yùn)行函數(shù)的起始地址桃序,arg 是傳遞給start_rtn的參數(shù)杖虾。在SignalCatcher的構(gòu)造函數(shù)中調(diào)用該函數(shù)的語句為:
CHECK_PTHREAD_CALL(pthread_create, (&pthread_, nullptr,&Run, this), "signal catcher thread");
CHECK_PTHREAD_CALL是一個宏定義,最終會調(diào)用pthread_create來新起一個Linux線程媒熊,從pthread_create的參數(shù)來看奇适,線程創(chuàng)建出來之后會執(zhí)行Run@SignalCatcher.cc這個函數(shù),并且把this指針也就是創(chuàng)建的SignalCatcher對象作為參數(shù)傳遞給了Run函數(shù)芦鳍,看一下Run函數(shù)的實(shí)現(xiàn):
void* SignalCatcher::Run(void* arg) {
......
Runtime* runtime = Runtime::Current();
CHECK(runtime->AttachCurrentThread("Signal Catcher", true, runtime->GetSystemThreadGroup(),//attach linux線程嚷往,使得該線程擁有調(diào)用JNI函數(shù)的能力
Thread* self = Thread::Current();
......
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT); //監(jiān)聽SIGQUIT信號
signals.Add(SIGUSR1);
while (true) {
int signal_number = signal_catcher->WaitForSignal(self, signals); //等待Kernel給進(jìn)程發(fā)送信號
if (signal_catcher->ShouldHalt()) {
runtime->DetachCurrentThread();
return nullptr;
}
switch (signal_number) {
case SIGQUIT:
signal_catcher->HandleSigQuit(); //調(diào)用HandleSigQuit去處理SIGQUIT信號
break;
......
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
在這個函數(shù)里面,首先調(diào)用 runtime->AttachCurrentThread去attach當(dāng)前線程柠衅,然后安裝信號處理函數(shù)皮仁,最后就是一個無限循環(huán),在循環(huán)里等待信號的到來菲宴,如果Kernel發(fā)送了信號給虛擬機(jī)進(jìn)程贷祈,那么就會執(zhí)行對應(yīng)信號的處理過程,這篇文章只關(guān)注SIGQUIT信號的處理喝峦,下面一步一步來分析這四個過程势誊。
- AttachCurrentThread
這個是通過調(diào)用Runtime的AttatchCurrentThread函數(shù)完成的,Runtime也只是簡單的調(diào)用了Thread類的Attach函數(shù)谣蠢,這里多出來一個Thread類粟耻,看上去像是創(chuàng)建一個thread,其實(shí)不然眉踱,在Android里面只能通過pthread_create去創(chuàng)建一個線程勋颖,這里的Thread只是Android Runtime里面的一個類,一個Thread對象創(chuàng)建之后就會被保存在線程的TLS區(qū)域勋锤,所以一個Linux線程都對應(yīng)了一個Thread對象饭玲,可以通過Thread的Current()函數(shù)來獲取當(dāng)前線程關(guān)聯(lián)的Thread對象,通過這個Thread對象就可以獲取一些重要信息叁执,例如當(dāng)前線程的Java線程狀態(tài)茄厘,Java棧幀矮冬,JNI函數(shù)指針列表等等,之所以說是Java線程狀態(tài),Java棧幀次哈,是因?yàn)锳ndroid運(yùn)行時其實(shí)是沒有自己單獨(dú)的線程機(jī)制的胎署,Java線程底層都是一個Linux線程,但是Linux線程是沒有像Watting窑滞、Blocked等狀態(tài)的琼牧,并且Linux線程也是沒有Java堆棧的,那么這些Java線程狀態(tài)和和Java棧幀必須有一個地方保存哀卫,要不然就丟失了巨坊,Thread對象就是這個理想的“儲物柜”,下面介紹Thread對象創(chuàng)建過程的時候會講到這一塊內(nèi)容此改。
bool Runtime::AttachCurrentThread(const char* thread_name, bool as_daemon, jobject thread_group, bool create_peer) {
return Thread::Attach(thread_name, as_daemon, thread_group, create_peer) != nullptr;
}
Thread* Thread::Attach(const char* thread_name, bool as_daemon, jobject thread_group,bool create_peer) {
Runtime* runtime = Runtime::Current();
......
Thread* self;
{
MutexLock mu(nullptr, *Locks::runtime_shutdown_lock_);
if (runtime->IsShuttingDownLocked()) {
......
} else {
Runtime::Current()->StartThreadBirth();
self = new Thread(as_daemon); //新建一個Thread對象
bool init_success = self->Init(runtime->GetThreadList(), runtime->GetJavaVM()); //調(diào)用init函數(shù)
Runtime::Current()->EndThreadBirth();
if (!init_success) {
delete self;
return nullptr;
}
}
}
......
self->InitStringEntryPoints();
CHECK_NE(self->GetState(), kRunnable);
self->SetState(kNative);
......
return self;
}
在Thread的attach函數(shù)里面趾撵,首先新建了一個Thread對象,然后調(diào)用Thread對象的Init過程共啃,最后通過調(diào)用self->SetState(kNative)將當(dāng)前的Java線程狀態(tài)設(shè)置為kNative狀態(tài)占调,先看一下Thread的SetState這個函數(shù),因?yàn)檫@個函數(shù)比較簡單移剪,它是用來設(shè)置Java線程狀態(tài)的究珊。
inline ThreadState Thread::SetState(ThreadState new_state) {
// Cannot use this code to change into Runnable as changing to Runnable should fail if
// old_state_and_flags.suspend_request is true.
DCHECK_NE(new_state, kRunnable);
if (kIsDebugBuild && this != Thread::Current()) {
std::string name;
GetThreadName(name);
LOG(FATAL) << "Thread \"" << name << "\"(" << this << " != Thread::Current()="
<< Thread::Current() << ") changing state to " << new_state;
}
union StateAndFlags old_state_and_flags;
old_state_and_flags.as_int = tls32_.state_and_flags.as_int;
tls32_.state_and_flags.as_struct.state = new_state;
return static_cast<ThreadState>(old_state_and_flags.as_struct.state);
}
Java線程的狀態(tài)是保存在Thread對象中的,具體來說是由該對象中的tls32_這個結(jié)構(gòu)體保存的纵苛,可以通過修改這個結(jié)構(gòu)體來設(shè)置當(dāng)前的狀態(tài)剿涮,ART目前支持的Java線程狀態(tài)列表如下,通過狀態(tài)后面的注釋赶站,大概就可以知道什么時候會進(jìn)行狀態(tài)的切換。
enum ThreadState {
// Thread.State JDWP state
kTerminated = 66, // TERMINATED TS_ZOMBIE Thread.run has returned, but Thread* still around
kRunnable, // RUNNABLE TS_RUNNING runnable
kTimedWaiting, // TIMED_WAITING TS_WAIT in Object.wait() with a timeout
kSleeping, // TIMED_WAITING TS_SLEEPING in Thread.sleep()
kBlocked, // BLOCKED TS_MONITOR blocked on a monitor
kWaiting, // WAITING TS_WAIT in Object.wait()
kWaitingForGcToComplete, // WAITING TS_WAIT blocked waiting for GC
kWaitingForCheckPointsToRun, // WAITING TS_WAIT GC waiting for checkpoints to run
kWaitingPerformingGc, // WAITING TS_WAIT performing GC
kWaitingForDebuggerSend, // WAITING TS_WAIT blocked waiting for events to be sent
kWaitingForDebuggerToAttach, // WAITING TS_WAIT blocked waiting for debugger to attach
kWaitingInMainDebuggerLoop, // WAITING TS_WAIT blocking/reading/processing debugger events
kWaitingForDebuggerSuspension, // WAITING TS_WAIT waiting for debugger suspend all
kWaitingForJniOnLoad, // WAITING TS_WAIT waiting for execution of dlopen and JNI on load code
kWaitingForSignalCatcherOutput, // WAITING TS_WAIT waiting for signal catcher IO to complete
kWaitingInMainSignalCatcherLoop, // WAITING TS_WAIT blocking/reading/processing signals
kWaitingForDeoptimization, // WAITING TS_WAIT waiting for deoptimization suspend all
kWaitingForMethodTracingStart, // WAITING TS_WAIT waiting for method tracing to start
kWaitingForVisitObjects, // WAITING TS_WAIT waiting for visiting objects
kWaitingForGetObjectsAllocated, // WAITING TS_WAIT waiting for getting the number of allocated objects
kStarting, // NEW TS_WAIT native thread started, not yet ready to run managed code
kNative, // RUNNABLE TS_RUNNING running in a JNI native method
kSuspended, // RUNNABLE TS_RUNNING suspended by GC or debugger
};
在attach函數(shù)中纺念,主要關(guān)注的是Init過程贝椿,詳細(xì)分析Init過程之前,需要大概了解一下ART執(zhí)行代碼的方式陷谱,ART相對與Dalvik一個重要的變化就是不再直接執(zhí)行字節(jié)碼烙博,而是先把字節(jié)碼翻譯成本地機(jī)器碼,這個過程是通過在安裝應(yīng)用程序的時候執(zhí)行dex2oat進(jìn)程得到一個oat文件完成的烟逊,這個oat文件一般保存在 /data/app/應(yīng)用名稱/oat/ 目錄下面渣窜, oat文件里面就包含了編譯好的機(jī)器碼,這里的編譯其實(shí)只是把dex文件中java類的方法翻譯成本地機(jī)器碼宪躯,然后在執(zhí)行的時候乔宿,不是去解釋執(zhí)行字節(jié)碼,而是找到對應(yīng)的機(jī)器碼直接執(zhí)行访雪。這樣效率就提高了详瑞, 這些機(jī)器碼不可能單獨(dú)存在掂林,有一些功能必須借助于ART運(yùn)行時,例如在heap中分配一個對象坝橡、執(zhí)行一個jni方法等泻帮,所以編譯好的本地機(jī)器碼中會引用到ART運(yùn)行時的一些方法,這就像我們編譯一個so庫文件的時候引用到了外部函數(shù)其實(shí)oat文件和so文件一樣都是ELF可執(zhí)行格式文件计寇,只是oat文件相比于標(biāo)準(zhǔn)的ELF格式文件多出了幾個section锣杂,那么在加載這些oat文件的時候需要重定位這些外部函數(shù),打開標(biāo)準(zhǔn)的so文件的時候番宁,一般用的是dlopen這個函數(shù)元莫,該函數(shù)會自動把沒有加載的so庫加載進(jìn)來,然后把這些外部函數(shù)重定位好贝淤,然而oat文件的打開方式不同柒竞,為了快速加載oat文件,ART在線程的TLS區(qū)域保存了一些函數(shù)播聪,編譯好的機(jī)器碼就是調(diào)用這些函數(shù)指針來和ART運(yùn)行時聯(lián)系朽基,這些函數(shù)就是在Thread的Init過程中初始化好的。
void Thread::InitTlsEntryPoints() {
// Insert a placeholder so we can easily tell if we call an unimplemented entry point.
uintptr_t* begin = reinterpret_cast<uintptr_t*>(&tlsPtr_.interpreter_entrypoints);
uintptr_t* end = reinterpret_cast<uintptr_t*>(reinterpret_cast<uint8_t*>(&tlsPtr_.quick_entrypoints) +
sizeof(tlsPtr_.quick_entrypoints));
for (uintptr_t* it = begin; it != end; ++it) {
*it = reinterpret_cast<uintptr_t>(UnimplementedEntryPoint);
}
InitEntryPoints(&tlsPtr_.interpreter_entrypoints, &tlsPtr_.jni_entrypoints,
&tlsPtr_.quick_entrypoints);
}
這些函數(shù)指針是保存在Thread對象里面离陶,而Thread對象是保存在線程的TLS區(qū)域里面的稼虎,所以本地機(jī)器碼可以訪問這塊TLS區(qū)域,從而拿到這些函數(shù)指針招刨。執(zhí)行了attach函數(shù)之后霎俩,一個Linux線程才真正和虛擬機(jī)運(yùn)行時關(guān)聯(lián)起來,一個Linux線程搖身一變成了Java線程沉眶,才有了自己的java線程狀態(tài)和java棧幀等數(shù)據(jù)結(jié)構(gòu)打却,那些純粹的native線程是不能執(zhí)行java代碼的,所以后面看到在dump進(jìn)程的堆棧的時候谎倔,有些線程是沒有java堆棧的柳击,只有native和kernel堆棧,就是這個原因片习。
- 安裝信號處理函數(shù)
上面分析了進(jìn)程如果想要自己處理一個信號捌肴,那么就得在代碼里面添加信號處理函數(shù),ART封裝了一個SignalSet類來安裝信號處理函數(shù)藕咏,但其實(shí)里面還是使用sigaddset状知、sigemptyset、sigwait等標(biāo)準(zhǔn)的Linux接口來實(shí)現(xiàn)對信號的處理的孽查,通過調(diào)用 signals.Add(SIGQUIT); signals.Add(SIGUSR1);就實(shí)現(xiàn)了 SIGQUIT和 SIGUSR1兩個信號的自定義處理饥悴,安裝完信號處理函數(shù)之后是一個無限循環(huán),在循環(huán)里面執(zhí)行sigwait函數(shù)來等待信號。
while (true) {
int signal_number = signal_catcher->WaitForSignal(self, signals);
if (signal_catcher->ShouldHalt()) {
runtime->DetachCurrentThread();
return nullptr;
}
switch (signal_number) {
case SIGQUIT:
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
- SIGQUIT信號的處理
發(fā)生ANR的時候铺坞,system_server進(jìn)程會執(zhí)行dumpStackTraces函數(shù)起宽,在該函數(shù)中會發(fā)送一個SIGQUIT信號給對應(yīng)的進(jìn)程,用來獲取該進(jìn)程的一些運(yùn)行時信息济榨,并最終把這些信息輸出到/data/anr/traces.txt文件里面坯沪。
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath == null || tracesPath.length() == 0) {
return null;
}
File tracesFile = new File(tracesPath);
try {
File tracesDir = tracesFile.getParentFile();
if (!tracesDir.exists()) {
tracesDir.mkdirs();
if (!SELinux.restorecon(tracesDir)) {
return null;
}
}
FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1); // drwxrwxr-x
if (clearTraces && tracesFile.exists()) tracesFile.delete();
tracesFile.createNewFile();
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
return null;
}
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}
如果一個進(jìn)程接收到了SIGQUIT信號的時候,Signal Catcher線程的signal_catcher->WaitForSignal(self, signals);這個語句就會返回擒滑,返回后接著會調(diào)用HandleSigQuit @ Signal _Watcher.cc函數(shù)來處理該信號腐晾。
void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
......
DumpCmdLine(os);
......
runtime->DumpForSigQuit(os);
......
}
......
Output(os.str());
}
Signal Catcher線程的作用是打印當(dāng)前進(jìn)程的堆棧(Java、Native丐一、Kernel)藻糖,同時還會把當(dāng)前虛擬機(jī)的一些狀態(tài)信息也打印出來,這就是我們所看到的traces.txt文件內(nèi)容库车,HandleSigQuit函數(shù)里面先建立了標(biāo)準(zhǔn)輸出流巨柒,把所有的信息都輸出到這個輸出流里面,其實(shí)也就是保存在內(nèi)存當(dāng)中柠衍,當(dāng)dump過程完了之后洋满,最后調(diào)用Output函數(shù)將輸出流的內(nèi)容保存到文件里面。
void Runtime::DumpForSigQuit(std::ostream& os) {
GetClassLinker()->DumpForSigQuit(os); //已經(jīng)加載和初始化的類珍坊、方法等信息
GetInternTable()->DumpForSigQuit(os);
GetJavaVM()->DumpForSigQuit(os);
GetHeap()->DumpForSigQuit(os); //GC信息
TrackedAllocators::Dump(os);//對象分配信息
os << "\n";
thread_list_->DumpForSigQuit(os); //線程堆棧信息
BaseMutex::DumpAll(os);
}
從Runtime的DumpForSigQuit這個函數(shù)里牺勾,大致可以看到都dump了哪些運(yùn)行時信息。dump過程里面讀取了哪些信息其實(shí)并不重要阵漏,重要的是什么時候去讀取這些信息驻民,也就是說什么條件下去dump才能保證獲取的確實(shí)是我們需要的東西,例如GC信息履怯、當(dāng)前分配了多少對象回还、線程堆棧的打印等一般都需要suspend當(dāng)前進(jìn)程里面所有的線程,接下來主要分析的就是這個suspend過程叹洲。SuspendAll是在Thread_list.cc中實(shí)現(xiàn)的柠硕,它的作用就是用來suspend當(dāng)前進(jìn)程里面所有其他的線程,SuspendAll一般發(fā)生在像GC疹味、DumpForSigQuit等過程中仅叫。
void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
Thread* self = Thread::Current();
......
++suspend_all_count_;
// Increment everybody's suspend count (except our own).
for (const auto& thread : list_) {
if (thread == self) {
continue;
}
VLOG(threads) << "requesting thread suspend: " << *thread;
thread->ModifySuspendCount(self, +1, false);
......
}
}
其實(shí)SuspendAll的實(shí)現(xiàn)過程非常簡單帜篇,其中最重要的就是thread->ModifySuspendCount(self, +1, false);這一語句糙捺,它會修改對應(yīng)Thread對象的suspend引用計數(shù),核心代碼如下:
void Thread::ModifySuspendCount(Thread* self, int delta, bool for_debugger) {
......
tls32_.suspend_count += delta;
......
if (tls32_.suspend_count == 0) {
AtomicClearFlag(kSuspendRequest);
} else {
AtomicSetFlag(kSuspendRequest);
TriggerSuspend();
}
}
因?yàn)槲覀儌魅氲膁elta的值是+1笙隙,所以會走到if語句的else分支洪灯,它首先使用原子操作設(shè)置了kSuspendRequest標(biāo)志位,代表當(dāng)前這個Thread對象有suspend請求竟痰,那么什么時候會觸發(fā)線程去檢查這個標(biāo)志位呢签钩?CheckSuspend這個函數(shù)在運(yùn)行時當(dāng)中會有好幾個地方被調(diào)用到掏呼,我們先看其中的兩個
static void GoToRunnable(Thread* self) NO_THREAD_SAFETY_ANALYSIS {
ArtMethod* native_method = *self->GetManagedStack()->GetTopQuickFrame();
bool is_fast = native_method->IsFastNative();
if (!is_fast) {
self->TransitionFromSuspendedToRunnable();
} else if (UNLIKELY(self->TestAllFlags())) {
// In fast JNI mode we never transitioned out of runnable. Perform a suspend check if there
// is a flag raised.
DCHECK(Locks::mutator_lock_->IsSharedHeld(self));
self->CheckSuspend();
}
}
extern "C" void artTestSuspendFromCode(Thread* self) SHARED_LOCKS_REQUIRED(Locks::mutator_lock_) {
// Called when suspend count check value is 0 and thread->suspend_count_ != 0
ScopedQuickEntrypointChecks sqec(self);
self->CheckSuspend();
}
GoToRunnable是在線程切換到Runnable狀態(tài)的時候會調(diào)用到,而artTestSuspendFromCode如我們前面所講的是提供給編譯好的native代碼調(diào)用的铅檩,他們都調(diào)用了Thread的CheckSuspend函數(shù)憎夷,所以只要給對應(yīng)線程的Thread對象設(shè)置了kSuspendRequest標(biāo)志位,那么這個線程基本上都是可以暫停下來的昧旨,除非因?yàn)槟承┰虍?dāng)前線程被阻塞住了并且該線程還恰好占據(jù)了Locks::mutator_lock_這個讀寫鎖拾给,導(dǎo)致調(diào)用SuspendAll的線程阻塞在這個讀寫鎖上面,最終導(dǎo)致suspend超時兔沃,如SuspendAll的如下代碼所示:
void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
......
#if HAVE_TIMED_RWLOCK
while (true) {
if (Locks::mutator_lock_->ExclusiveLockWithTimeout(self, kThreadSuspendTimeoutMs, 0)) {
break;
} else if (!long_suspend_) {
......
UnsafeLogFatalForThreadSuspendAllTimeout();
}
}
#else
Locks::mutator_lock_->ExclusiveLock(self);
#endif
......
}
接下來我們著重分析Thread的CheckSuspend這個函數(shù)蒋得,這個函數(shù)里面才會把當(dāng)前線程真正suspend住.
inline void Thread::CheckSuspend() {
DCHECK_EQ(Thread::Current(), this);
for (;;) {
if (ReadFlag(kCheckpointRequest)) {
RunCheckpointFunction();
} else if (ReadFlag(kSuspendRequest)) {
FullSuspendCheck();
} else {
break;
}
}
}
如果檢測到設(shè)置了kCheckpointRequest標(biāo)記就會執(zhí)行RunCheckpointFunction函數(shù),另外如果檢測到設(shè)置了kSuspendRequest標(biāo)記就會執(zhí)行FullSuspendCheck函數(shù)乒疏,kCheckpointRequest標(biāo)志位是用來dump線程的堆棧的额衙,分析完SuspendAll之后,我們再著重看這個標(biāo)志位的作用怕吴,這里我們繼續(xù)分析FullSuspendCheck這個函數(shù):
void Thread::FullSuspendCheck() {
VLOG(threads) << this << " self-suspending";
ATRACE_BEGIN("Full suspend check");
// Make thread appear suspended to other threads, release mutator_lock_.
tls32_.suspended_at_suspend_check = true;
TransitionFromRunnableToSuspended(kSuspended);
// Transition back to runnable noting requests to suspend, re-acquire share on mutator_lock_.
TransitionFromSuspendedToRunnable();
tls32_.suspended_at_suspend_check = false;
ATRACE_END();
VLOG(threads) << this << " self-reviving";
}
調(diào)用TransitionFromRunnableToSuspended這個函數(shù)之后窍侧,當(dāng)前Java線程就進(jìn)入了kSuspended狀態(tài),然后在調(diào)用TransitionFromSuspendedToRunnable從suspend切換到Runnable狀態(tài)的時候械哟,就會阻塞在一個條件變量上疏之,除非調(diào)用SuspendAll的線程接著又調(diào)用了ResumeAll函數(shù),要不然這些線程就會一直被阻塞住暇咆。
void ThreadList::ResumeAll() {
Thread* self = Thread::Current();
......
Locks::mutator_lock_->ExclusiveUnlock(self);
{
......
--suspend_all_count_;
// Decrement the suspend counts for all threads.
for (const auto& thread : list_) {
if (thread == self) {
continue;
}
thread->ModifySuspendCount(self, -1, false); //修改線程的suspend計數(shù)
}
......
Thread::resume_cond_->Broadcast(self);//喚醒那些等待這個條件變量的線程
}
......
}
至此我們就把SuspendAll的過程分析完了锋爪,我們上面提到過dump線程堆棧的時候并不是在設(shè)置了kSuspendRequest標(biāo)志位之后會執(zhí)行的,與它相關(guān)的是另外一個標(biāo)志位kCheckpointRequest. 接下來我們看一下Thread_list的Dump函數(shù),這個函數(shù)會在Thread_list的DumpForSigQuit中會被調(diào)用到爸业,也就是在Signal Cathcer線程處理SIGQUIT信號的過程中其骄。
void ThreadList::Dump(std::ostream& os) {
......
DumpCheckpoint checkpoint(&os);
size_t threads_running_checkpoint = RunCheckpoint(&checkpoint);
if (threads_running_checkpoint != 0) {
checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint);
}
}
這個函數(shù)里面首先創(chuàng)建了一個DumpCheckpoint對象checkpoint,然后以這個對象作為參數(shù)調(diào)用RunCheckpoint函數(shù)扯旷,RunCheckpoint會返回現(xiàn)在處于Runnable狀態(tài)的線程個數(shù)拯爽,然后調(diào)用DumpCheckpoint的WaitForThreadsToRunThroughCheckpoint函數(shù)等待這些處于Runnable狀態(tài)的線程都執(zhí)行完DumpCheckpoint的Run函數(shù),如果等待超時就會報Fatal類型的錯誤钧忽,如下所示:
void WaitForThreadsToRunThroughCheckpoint(size_t threads_running_checkpoint) {
Thread* self = Thread::Current();
ScopedThreadStateChange tsc(self, kWaitingForCheckPointsToRun);
bool timed_out = barrier_.Increment(self, threads_running_checkpoint, kDumpWaitTimeout);
if (timed_out) {
// Avoid a recursive abort.
LOG((kIsDebugBuild && (gAborting == 0)) ? FATAL : ERROR) << "Unexpected time out during dump checkpoint.";
}
}
我們接著分析RunCheckpoint這個函數(shù)毯炮,這個函數(shù)有點(diǎn)長,我們分為兩部分來分析該過程耸黑。
size_t ThreadList::RunCheckpoint(Closure* checkpoint_function) {
......
for (const auto& thread : list_) {
if (thread != self) {
while (true) {
if (thread->RequestCheckpoint(checkpoint_function)) {
kSuspendRequestcount++;
break;
} else {
if (thread->GetState() == kRunnable) {
continue;
}
thread->ModifySuspendCount(self, +1, false);
suspended_count_modified_threads.push_back(thread);
break;
}
}
}
......
return count;
}
對于那些處于Runnable狀態(tài)的線程執(zhí)行它的RequestCheckpoint函數(shù)會返回true桃煎,其他非Runnable狀態(tài)的線程則會返回false,對于這些線程就會像SuspendAll過程中一樣給它設(shè)置kSuspendRequest標(biāo)志位大刊,后面如果他們變?yōu)镽unnable狀態(tài)的時候就會先檢查這個標(biāo)志位为迈,從而進(jìn)入suspend狀態(tài),同時RunCheckpoint函數(shù)會把這些線程統(tǒng)計到suspended_count_modified_threads這個Vector變量中,在suspended_count_modified_threads這個Vector變量中的線程葫辐,Signal Catcher線程會主動觸發(fā)他們的dump堆棧過程搜锰。待會分析RunCheckpoint的第二部分的時候,我們再來看這個過程耿战,我們先分析Thread的RequestCheckpoint函數(shù)蛋叼。
bool Thread::RequestCheckpoint(Closure* function) {
......
if (old_state_and_flags.as_struct.state != kRunnable) { //如果當(dāng)前線程不為Runnable狀態(tài)就直接返回false
return false; // Fail, thread is suspended and so can't run a checkpoint.
}
uint32_t available_checkpoint = kMaxCheckpoints;
for (uint32_t i = 0 ; i < kMaxCheckpoints; ++i) {
if (tlsPtr_.checkpoint_functions[i] == nullptr) { //在數(shù)組中尋找一個還沒占據(jù)的空位
available_checkpoint = i;
break;
}
}
......
tlsPtr_.checkpoint_functions[available_checkpoint] = function; //設(shè)置數(shù)組元素的值
// Checkpoint function installed now install flag bit.
// We must be runnable to request a checkpoint.
DCHECK_EQ(old_state_and_flags.as_struct.state, kRunnable);
union StateAndFlags new_state_and_flags;
new_state_and_flags.as_int = old_state_and_flags.as_int;
new_state_and_flags.as_struct.flags |= kCheckpointRequest; //設(shè)置kCheckpointRequest標(biāo)志位
......
}
從前面Thread的CheckSuspend函數(shù)來看設(shè)置了kCheckpointRequest標(biāo)志位的線程會執(zhí)行RunCheckpointFunction這個函數(shù),RunCheckpointFunction會檢查checkpoint_functions數(shù)組是否為空剂陡,如果不為空鸦列,就會執(zhí)行元素的run函數(shù)。
void Thread::RunCheckpointFunction() {
......
for (uint32_t i = 0; i < kMaxCheckpoints; ++i) {
if (checkpoints[i] != nullptr) {
checkpoints[i]->Run(this);
found_checkpoint = true;
}
}
......
}
其實(shí)就是執(zhí)行DumpCheckpoint的Run函數(shù)鹏倘,因?yàn)镽equestCheckpoint(Closure* function)的function就是一個DumpCheckpoint對象薯嗤,它是從Thread_list的Dump函數(shù)中傳遞過來的,我們看一下DumpCheckpoint的Run函數(shù)實(shí)現(xiàn):
void Run(Thread* thread) OVERRIDE {
Thread* self = Thread::Current();
std::ostringstream local_os;
{
ScopedObjectAccess soa(self);
thread->Dump(local_os); //調(diào)用Thread的Dump函數(shù)
}
......
}
饒了一大圈,原來最終調(diào)用的還是Thread的Dump函數(shù)纤泵,這個函數(shù)就不繼續(xù)分析了骆姐,線程的Java堆棧、Native堆棧和Kernel堆棧就是在這里打印的捏题,有興趣的同學(xué)可以自己去分析玻褪。上面我們說了對于處于Runnable狀態(tài)的線程是通過調(diào)用他們的RequestCheckpoint函數(shù),然后他們自己去dump當(dāng)前堆棧的公荧,而對于那些不是處于Runnable狀態(tài)的線程我們是把它添加到了suspended_count_modified_threads這個Vector中带射,我們接著分析RunCheckpoint函數(shù)的第二部分
size_t ThreadList::RunCheckpoint(Closure* checkpoint_function) {
Thread* self = Thread::Current();
......
checkpoint_function->Run(self); //以Signal Catcher線程的Thread對象為參數(shù),主動調(diào)用DumpCheckpoint的Run函數(shù)
// Run the checkpoint on the suspended threads.
for (const auto& thread : suspended_count_modified_threads) {
.......
checkpoint_function->Run(thread);//主動調(diào)用DumpCheckpoint的Run函數(shù)
{
MutexLock mu2(self, *Locks::thread_suspend_count_lock_);
thread->ModifySuspendCount(self, -1, false);//修改suspend引用計數(shù)
}
}
......
}
對于這些不是Runnable狀態(tài)的線程循狰,他們可能不會主動去調(diào)用Run函數(shù)窟社,所以只能由Signal Catcher線程去幫他們Dump,至于DumpCheckpoint的Run函數(shù)的功能還是和Runnable狀態(tài)的線程一樣的绪钥,都是打印線程堆棧灿里。