Android的穩(wěn)定性問題中有一類問題隅很,我們暫且叫做Android Native Error,在傳統(tǒng)的叫法中率碾,它可能叫段錯誤叔营,內(nèi)存訪問異常等等,做過穩(wěn)定性的人都知道這一類問題分析難度是比較大的所宰,雖然在Android里面會給出一個出錯進程相關(guān)的Tombstone日志绒尊,但是也只能看到最終出錯的代碼行,但是要去分析為什么會出錯仔粥,一般就有難度了婴谱,需要足夠熟悉那一塊的代碼,能夠使用一些常用的調(diào)試手段和工具躯泰,例如GDB谭羔、Crash、Coredump文件等等麦向,后續(xù)的文章再對這些做一些分享瘟裸,這篇文章我們先看一下Native Error問題是怎么出現(xiàn)的,ARM和Linux底層又是如何處理的,做到知其所以然磕蛇,掌握了背后的原理之后才能更從容的去分析問題.
- arm v8a有兩種執(zhí)行模式:AARCH32和AARCH64景描,本文只分析AARCH64執(zhí)行模式下的異常處理過程
- arm v8a 支持4種不同的Exception Level(EL0十办、EL1秀撇、EL2、EL3)向族,Android應(yīng)用程序運行在EL0呵燕,Linux運行在EL1,其他的如虛擬化和Secure分別運行在EL2和EL3層件相,本文只分析EL0的異常處理再扭,關(guān)于其他層的異常處理和路油配置,讀者可以通過閱讀 arm v8-a官方文檔和網(wǎng)上資料繼續(xù)深入了解.
arm v8-a基礎(chǔ)知識 - Exception Level
arm v8-a的異常等級是一個很重要的概念夜矗,無論是應(yīng)用程序還是Kernel代碼都是在某個Level層級運行的泛范,分等級就意味著不同的權(quán)限,不同的視圖:
- EL0 的代碼權(quán)限是小于EL1的紊撕,EL1小于EL2罢荡,依次類推,所以Android應(yīng)用程序一般運行在EL0,而像Linux Kernel代碼就運行在EL1或者EL2区赵,但是一般不會運行在EL3惭缰,因為EL3一般是一個很小的trusted os,Linux這種不適合運行在這個層級.
EL0 :Normal user applications. EL0 corresponds to the lowest privilege level and is often described as unprivileged, whereas execution at any Exception level above EL0 is often referred to as privileged execution.
EL1:An operating system kernel typically described as privileged.
EL2:Hypervisor.
EL3:Low-level firmware, including the Secure Monitor.. - 每個異常等級所能看到的寄存器也是不同的笼才,另外即使是相同的寄存器漱受,但是在不同的Exception Level下面也會有不同的作用,例如每個Exception Level都有自己的SPSR_ELx寄存器(x=0,1,2,3),這個寄存器會保存進入ELx的PSTATE狀態(tài)信息
- 異常等級之間是可以切換的骡送,例如應(yīng)用程序調(diào)用系統(tǒng)調(diào)用昂羡,就可以主動切換到EL1層級運行,還有就是本文所要重點講述的各種異常摔踱,例如data abort紧憾,instruction abort等,都會導(dǎo)致Exception Level的切換.
arm v8-a基礎(chǔ)知識 - Execution State
arm v8有兩種執(zhí)行模式:AARCH64和AARCH32昌渤,其中AARCH64是新增的,它可以使用31個64位的通用寄存器赴穗,而AARCH32是為了兼容arm v7,它只能使用32位的通用寄存器膀息,這兩種執(zhí)行模式之間也是可以切換的般眉,例如從一個64為的進程要切到一個32位的進程執(zhí)行的時候:
- Changing to AArch32 requires going from a higher to a lower Exception level. This is the result of exiting an exception handler by executing the ERET
- Changing to AArch64 requires going from a lower to a higher Exception level. The exception can be the result of an instruction execution or an external signal.
- If, when taking an exception or returning from an exception, the Exception level remains the same, then the Execution state also cannot change.
- Both AArch64 and AArch32 Execution states have Exception levels that are similar, but there are some differences between Secure and Non-secure operation. The Execution state the processor is in when the exception is generated can limit the Exception levels available to the other Execution state.
- Where an ARMv8-A processor operates in AArch32 Execution state at a particular Exception level, it uses the same exception model as in ARMv7-A for exceptions that are taken to that Exception level.
- Code at EL3 cannot take an exception to a higher Exception level, so cannot change Execution state, except by going through a reset.
- When the processor moves from a higher to a lower Exception level, the Execution state can stay the same, or it can switch from AArch64 to AArch32.
- When moving from a lower to a higher Exception level, the Execution state can stay the same or switch from AArch32 to AArch64.
下面再用兩張圖來說明一下兩種不同執(zhí)行狀態(tài)下面對應(yīng)的Exception Level
arm v8-a基礎(chǔ)知識 - 重要寄存器
關(guān)于aarch32和aarch64兩種執(zhí)行模式下的寄存器介紹,ARMv8 架構(gòu)與指令集.學(xué)習(xí)筆記這個文章已經(jīng)有很詳細的對比了潜支,這里借用一下原作者的兩張圖表說明一下:
ARM64異常類型
有了上面的這些基礎(chǔ)知識之后甸赃,可以著重的講一下本文的重點內(nèi)容了,上面我們講過冗酿,通過系統(tǒng)調(diào)用埠对,我們可以改變CPU的Exception Level,而像系統(tǒng)調(diào)用裁替、中斷等项玛,其實有一個專業(yè)的名詞來描述,叫做異常弱判,所謂異常就是講在代碼的執(zhí)行過程中襟沮,由于某些情況或者系統(tǒng)事件需要暫時中斷代碼的執(zhí)行,轉(zhuǎn)而進入另一個代碼路徑昌腰,待處理完后开伏,重新恢復(fù)代碼的執(zhí)行,所以異吃馍蹋可能在任何情況下發(fā)生的固灵,如下圖所示:
異常類型 | 異常說明 |
---|---|
Aborts | Aborts can be generated either on failed instruction fetches (Instruction Aborts) or failed data accesses (Data Aborts). They can come from the external memory system giving an error response on a memory access (indicating perhaps that the specified address does not correspond to real memory in the system). Alternatively, the Memory Management Unit (MMU) of the core generates the abort. An OS can use MMU aborts to allocate memory to applications dynamically. An instruction that cannot be fetched causes an abort. The Instruction Abort exception is taken only if the core then tries to execute it. A Data Abort exception is caused by a load or store instruction and happens after the data read or write has been attempted. An abort is described as being synchronous if it is generated by direct execution of instructions and the return address indicates the instruction which caused it. Otherwise, an abort is described as asynchronous. In AArch64, synchronous aborts cause a Synchronous exception. Asynchronous aborts cause an SError interrupt exception. |
Reset | Reset is treated as a special case because it has its own vector that always targets the highest implemented Exception level. This vector uses an IMPLEMENTATION DEFINED address which is typically set by configuration input signals. The address can be read from the Reset Vector Base Address Register RVBAR_ELn, where n is the number of the highest implemented Exception level. All cores have a reset input and take the reset exception after they have been reset. It is the highest priority exception and cannot be masked. This exception is used to execute code on the core to initialize it, after the system has powered up. |
Exception generating instructions | Execution of these instructions can generate exceptions. They are typically executed to request a service from software that runs at a higher privilege level: The Supervisor Call (SVC) instruction enables User mode programs to request an OS service. The Hypervisor Call (HVC) instruction enables the guest OS to request hypervisor services. The Secure monitor Call (SMC) instruction enables the Normal world to request Secure world services. |
Interrupts | There are three types of interrupts, IRQ, FIQ and SError. IRQ and FIQ are general purpose compared to SError, which is associated specifically with external asynchronous Data Aborts. So typically, the term 'interrupts' refers only to IRQ and FIQ. FIQ is higher priority than IRQ. Both of these interrupts are typically associated with individual input pins for each core. External hardware asserts an interrupt request line and the corresponding exception type is raised when the current instruction finishes executing (although some instructions, those that can load multiple values, can be interrupted), assuming that the interrupt is not disabled. On almost all systems, various interrupt sources are connected using an interrupt controller. The interrupt controller arbitrates and prioritizes interrupts, and in turn, provides a serialized single signal that is then connected to the FIQ or IRQ signal of the core. Because IRQ and FIQ interrupts are not directly related to the software running on the core at any given time, they are classified as asynchronous exceptions. |
在Android Stability問題中,我們遇到的最多的也就是Abort里面的Instruction Aborts和Data Aborts劫流,從上面的表格來看巫玻,指令異常發(fā)生在這條指令的執(zhí)行階段暑认,例如使用函數(shù)指針來調(diào)用函數(shù)的時候,如果函數(shù)指針被改變成一個異常的地址值大审,導(dǎo)致那塊區(qū)域存儲的不是合法的指令蘸际,而是一些數(shù)據(jù)喂走,那么就可能發(fā)生指令異常了串前,而數(shù)據(jù)異常是發(fā)生在使用Load和Store指令來操作數(shù)據(jù)的時刻恶迈,例如使用指針來讀寫某個變量戳稽,如果指針指向的地址非法负敏,那么就可能導(dǎo)致數(shù)據(jù)異常.
ARM64異常硬件層面的行為
當(dāng)一個異常發(fā)生的時候庶艾,ARM會自動進行以下操作:
The SPSR_ELn is updated (where n is the Exception level where the exception is taken), to store the PSTATE information that is required to correctly return at the end of the exception.
PSTATE is updated to reflect the new processor status (and this can mean that the Exception level is raised, or it can stay the same).
-
The address to return to at the end of the exception is stored in ELR_ELn.
當(dāng)異常發(fā)生的時候溃论,處理器必須要響應(yīng)這個異常悠栓,也就是執(zhí)行某些異常處理代碼圈澈,在ARM64里面惫周,這些異常處理代碼是存儲在異常向量表(exception vector table)里面的,它的內(nèi)容存儲在Memory中康栈,除了EL0(EL0不處理異常)递递,每個異常等級都有自己的異常向量表,這些異常向量表的基地址被存儲在VBAR_EL3, VBAR_EL2 啥么、VBAR_EL1這幾個寄存器里面,一個典型的異常向量表如下所示登舞,另外也可以參考ARM64的啟動過程之(六):異常向量表的設(shè)定這個文章
Address | Exception type | Description |
---|---|---|
VBAR_ELn + 0x000 | Synchronous | Current EL with SP0 |
0x080 | IRQ/vIRQ | Current EL with SP0 |
0x100 | FIQ/vFIQ | Current EL with SP0 |
0x180 | SError/vSError | Current EL with SP0 |
0x200 | Synchronous | Current EL with SPx |
0x280 | IRQ/vIRQ | Current EL with SPx |
0x300 | FIQ/vFIQ | Current EL with SPx |
0x380 | SError/vSError | Current EL with SPx |
0x400 | Synchronous | Lower EL using AArch64 |
0x480 | IRQ/vIRQ | Lower EL using AArch64 |
0x500 | FIQ/vFIQ | Lower EL using AArch64 |
0x580 | SError/vSError | Lower EL using AArch64 |
0x600 | Synchronous | Lower EL using AArch32 |
0x680 | IRQ/vIRQ | Lower EL using AArch32 |
0x700 | FIQ/vFIQ | Lower EL using AArch32 |
0x780 | SError/vSError | Lower EL using AArch32 |
arm64 Linux的異常響應(yīng)
ARM64 Linux的異常向量表定義在 kernel-src/arch/arm64/kernel/entry.S里面,如下所示悬荣,所以如果在EL0也就是在應(yīng)用層發(fā)生了Data Aborts或者Instruction Aborts菠秒,都會暫停當(dāng)前代碼的執(zhí)行,轉(zhuǎn)而執(zhí)行el0_sync這個地址的代碼氯迂,相應(yīng)的在EL0層如果發(fā)生了中斷践叠,CPU就會被重定向到el0_irq來執(zhí)行.
/*
* Exception vectors.
*/
.align 11
ENTRY(vectors)
ventry el1_sync_invalid // Synchronous EL1t
ventry el1_irq_invalid // IRQ EL1t
ventry el1_fiq_invalid // FIQ EL1t
ventry el1_error_invalid // Error EL1t
ventry el1_sync // Synchronous EL1h
ventry el1_irq // IRQ EL1h
ventry el1_fiq_invalid // FIQ EL1h
ventry el1_error_invalid // Error EL1h
ventry el0_sync // Synchronous 64-bit EL0
ventry el0_irq // IRQ 64-bit EL0
ventry el0_fiq_invalid // FIQ 64-bit EL0
ventry el0_error_invalid // Error 64-bit EL0
#ifdef CONFIG_COMPAT
ventry el0_sync_compat // Synchronous 32-bit EL0
ventry el0_irq_compat // IRQ 32-bit EL0
ventry el0_fiq_invalid_compat // FIQ 32-bit EL0
ventry el0_error_invalid_compat // Error 32-bit EL0
#else
ventry el0_sync_invalid // Synchronous 32-bit EL0
ventry el0_irq_invalid // IRQ 32-bit EL0
ventry el0_fiq_invalid // FIQ 32-bit EL0
ventry el0_error_invalid // Error 32-bit EL0
#endif
END(vectors)
/*
* EL0 mode handlers.
*/
.align 6
el0_sync:
kernel_entry 0
mrs x25, esr_el1 // read the syndrome register
lsr x24, x25, #ESR_ELx_EC_SHIFT // exception class 從ESR寄存器得到具體的異常信息,以便選擇合適的代碼處理
cmp x24, #ESR_ELx_EC_SVC64 // SVC in 64-bit state如果是系統(tǒng)調(diào)用會走el0_svc
b.eq el0_svc
cmp x24, #ESR_ELx_EC_DABT_LOW // data abort in EL0 如果是EL0的變量訪問地址異常就會走el0_da
b.eq el0_da
cmp x24, #ESR_ELx_EC_IABT_LOW // instruction abort in EL0
b.eq el0_ia
cmp x24, #ESR_ELx_EC_FP_ASIMD // FP/ASIMD access
b.eq el0_fpsimd_acc
cmp x24, #ESR_ELx_EC_FP_EXC64 // FP/ASIMD exception
b.eq el0_fpsimd_exc
cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
b.eq el0_sys
cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
b.eq el0_sp_pc
cmp x24, #ESR_ELx_EC_PC_ALIGN // pc alignment exception
b.eq el0_sp_pc
cmp x24, #ESR_ELx_EC_UNKNOWN // unknown exception in EL0
b.eq el0_undef
cmp x24, #ESR_ELx_EC_BREAKPT_LOW // debug exception in EL0
b.ge el0_dbg
b el0_inv
el0_dbg:
/*
* Debug exception handling
*/
tbnz x24, #0, el0_inv // EL0 only
mrs x0, far_el1
mov x1, x25
mov x2, sp
bl do_debug_exception
enable_dbg
ct_user_exit
b ret_to_user
el0_inv:
enable_dbg
ct_user_exit
mov x0, sp
mov x1, #BAD_SYNC
mov x2, x25
bl bad_mode
b ret_to_user
el0_da: //變量內(nèi)存訪問異常一般會走這個路徑
/*
* Data abort handling
*/
mrs x26, far_el1
// enable interrupts before calling the main handler
enable_dbg_and_irq
ct_user_exit
bic x0, x26, #(0xff << 56) //注意這里的x0嚼蚀、x1禁灼、x2是用來給do_mem_abort傳遞參數(shù)的
mov x1, x25
mov x2, sp
bl do_mem_abort //調(diào)用do_mem_abort進一步處理
b ret_to_user //返回用戶態(tài)執(zhí)行
do_mem_abort @kernel-src/arch/arm64/mm/fault.c
asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
struct pt_regs *regs)
{
const struct fault_info *inf = fault_info + (esr & 63);
struct siginfo info;
if (!inf->fn(addr, esr, regs)) //通過定義的數(shù)組來嘗試處理這個異常
return;
pr_alert("Unhandled fault: %s (0x%08x) at 0x%016lx\n",
inf->name, esr, addr);
info.si_signo = inf->sig;
info.si_errno = 0;
info.si_code = inf->code;
info.si_addr = (void __user *)addr;
arm64_notify_die("", regs, &info, esr); //如果上面沒有處理成功,那么發(fā)送signal給相應(yīng)的進程
}
static struct fault_info {
int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
int sig;
int code;
const char *name;
} fault_info[] = {
{ do_bad, SIGBUS, 0, "ttbr address size fault" },
{ do_bad, SIGBUS, 0, "level 1 address size fault" },
{ do_bad, SIGBUS, 0, "level 2 address size fault" },
{ do_bad, SIGBUS, 0, "level 3 address size fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 0 translation fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 1 translation fault" },
{ do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 2 translation fault" },
{ do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" },
{ do_bad, SIGBUS, 0, "unknown 8" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 access flag fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 access flag fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 access flag fault" },
{ do_bad, SIGBUS, 0, "unknown 12" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" },
{ do_bad, SIGBUS, 0, "synchronous external abort" },
{ do_bad, SIGBUS, 0, "unknown 17" },
{ do_bad, SIGBUS, 0, "unknown 18" },
{ do_bad, SIGBUS, 0, "unknown 19" },
{ do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous external abort (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error" },
{ do_bad, SIGBUS, 0, "unknown 25" },
{ do_bad, SIGBUS, 0, "unknown 26" },
{ do_bad, SIGBUS, 0, "unknown 27" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
{ do_bad, SIGBUS, 0, "unknown 32" },
{ do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" },
{ do_bad, SIGBUS, 0, "unknown 34" },
{ do_bad, SIGBUS, 0, "unknown 35" },
{ do_bad, SIGBUS, 0, "unknown 36" },
{ do_bad, SIGBUS, 0, "unknown 37" },
{ do_bad, SIGBUS, 0, "unknown 38" },
{ do_bad, SIGBUS, 0, "unknown 39" },
{ do_bad, SIGBUS, 0, "unknown 40" },
{ do_bad, SIGBUS, 0, "unknown 41" },
{ do_bad, SIGBUS, 0, "unknown 42" },
{ do_bad, SIGBUS, 0, "unknown 43" },
{ do_bad, SIGBUS, 0, "unknown 44" },
{ do_bad, SIGBUS, 0, "unknown 45" },
{ do_bad, SIGBUS, 0, "unknown 46" },
{ do_bad, SIGBUS, 0, "unknown 47" },
{ do_bad, SIGBUS, 0, "TLB conflict abort" },
{ do_bad, SIGBUS, 0, "unknown 49" },
{ do_bad, SIGBUS, 0, "unknown 50" },
{ do_bad, SIGBUS, 0, "unknown 51" },
{ do_bad, SIGBUS, 0, "implementation fault (lockdown abort)" },
{ do_bad, SIGBUS, 0, "implementation fault (unsupported exclusive)" },
{ do_bad, SIGBUS, 0, "unknown 54" },
{ do_bad, SIGBUS, 0, "unknown 55" },
{ do_bad, SIGBUS, 0, "unknown 56" },
{ do_bad, SIGBUS, 0, "unknown 57" },
{ do_bad, SIGBUS, 0, "unknown 58" },
{ do_bad, SIGBUS, 0, "unknown 59" },
{ do_bad, SIGBUS, 0, "unknown 60" },
{ do_bad, SIGBUS, 0, "section domain fault" },
{ do_bad, SIGBUS, 0, "page domain fault" },
{ do_bad, SIGBUS, 0, "unknown 63" },
};
一個典型的缺頁異常處理堆棧如下所示:
[<ffffff800808bbfc>] bug_handler+0x60/0x90
[<ffffff80080839f4>] brk_handler+0xf4/0x208
[<ffffff800808255c>] do_debug_exception+0x4c/0x114
[<ffffff8008085708>] el1_dbg+0x18/0x8c
[<ffffff8008b2a280>] aee_wdt_atf_entry+0xdc/0xe8
[<ffffff8008166110>] smp_call_function_many+0x254/0x2f4
[<ffffff8008166404>] on_each_cpu_mask+0x48/0xec
[<ffffff80081d5454>] drain_all_pages+0xfc/0x118
[<ffffff80081da110>] _alloc_pages_nodemask+0x764/0xc54
[<ffffff80081df0ac>] _do_page_cache_readahead+0x164/0x314
[<ffffff80081cfcc4>] filemap_fault+0x374/0x45c
[<ffffff80082bf72c>] ext4filemap_fault+0x34/0x50
[<ffffff80081fedac>] __do_fault+0x48/0xdc
[<ffffff8008202e14>] handlemm_fault+0x85c/0x1160
[<ffffff800809c6b4>] do_page_fault+0x2ec/0x3c4
[<ffffff8008082354>] do_mem_abort+0x50/0x10c
[<ffffff8008085c24>] el0_da+0x18/0x1c
上面代碼的具體含義驰坊,可以參考 armv8 Linux內(nèi)核異常處理相關(guān)文件這個文章匾二,里面已經(jīng)描述的很詳細了哮独,這里就不贅述了拳芙,有一個小細節(jié)要注意一下,如果直接通過shell kill命令來發(fā)送信號比如signal 11給進程皮璧,是不會走到異常處理過程的舟扎,所以我們?nèi)タ此膖ombstone log的時候它的 fault addr為 -------- .