現(xiàn)象
- java應(yīng)用的web服務(wù)器突然掛掉骇扇,無(wú)任何jvm相關(guān)日志摔竿,重啟后不久再次掛掉
- 再次重啟,不久后機(jī)器掛掉【機(jī)器為虛擬機(jī)】
相關(guān)日志
-
到宿主機(jī)器重啟機(jī)器后少孝,查看dmesg继低,可看到有相關(guān)信息:
[46019.223344] 3065881 pages non-shared [46019.223348] Out of memory: kill process 16211 (java) score 1135790 or a child [46019.225305] Killed process 16211 (java) [46019.293729] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 [46019.293734] java cpuset=/ mems_allowed=0 [46019.293750] Pid: 2187, comm: java Not tainted 2.6.32-5-amd64 #1 [46019.293752] Call Trace: [46019.293761] [<ffffffff810b643c>] ? oom_kill_process+0x7f/0x23f [46019.293765] [<ffffffff8106bb5e>] ? timekeeping_get_ns+0xe/0x2e [46019.293768] [<ffffffff810b6960>] ? __out_of_memory+0x12a/0x141 [46019.293771] [<ffffffff810b6ab7>] ? out_of_memory+0x140/0x172 [46019.293775] [<ffffffff810ba81c>] ? __alloc_pages_nodemask+0x4ec/0x5fc [46019.293780] [<ffffffff810bbd85>] ? __do_page_cache_readahead+0x9b/0x1b4 [46019.293784] [<ffffffff810bbeba>] ? ra_submit+0x1c/0x20 [46019.293787] [<ffffffff810b4b87>] ? filemap_fault+0x17d/0x2f6 [46019.293793] [<ffffffff810cab26>] ? __do_fault+0x54/0x3c3 [46019.293796] [<ffffffff810cce7a>] ? handle_mm_fault+0x3b8/0x80f [46019.293801] [<ffffffff812ff306>] ? do_page_fault+0x2e0/0x2fc [46019.293805] [<ffffffff812fd1a5>] ? page_fault+0x25/0x30
說(shuō)明進(jìn)程占用內(nèi)存過(guò)高,被系統(tǒng)的oom_killer強(qiáng)行kill所致
猜測(cè)機(jī)器會(huì)掛的原因是:該進(jìn)程占用內(nèi)存過(guò)快過(guò)高稍走,oom_killer來(lái)不及動(dòng)作便已慘遭殃及
初步分析及方案擬定
嘗試再次復(fù)現(xiàn)袁翁,用jdb attach 到j(luò)ava進(jìn)程進(jìn)行remote debug柴底,【久經(jīng)波折后】發(fā)現(xiàn)某個(gè)分頁(yè)請(qǐng)求數(shù)據(jù)的接口會(huì)間歇性的觸發(fā)該現(xiàn)象,用vmstat觀察機(jī)器內(nèi)存使用粱胜,發(fā)現(xiàn)發(fā)生問(wèn)題時(shí)內(nèi)存下降異常迅速柄驻,約每秒100M,數(shù)十秒內(nèi)就會(huì)吃盡機(jī)器所有內(nèi)存
該java進(jìn)程Xmx配置為6G年柠,且獨(dú)占該機(jī)器凿歼。機(jī)器的內(nèi)存有11G
結(jié)合之前的經(jīng)驗(yàn),進(jìn)程占用內(nèi)存遠(yuǎn)超Xmx的現(xiàn)象冗恨,初步認(rèn)為很可能是jni所致答憔。將工程中所有最新用到j(luò)ni的地方review,未能找到明確線索掀抹。
在內(nèi)存急劇下降期間對(duì)java進(jìn)程取mem dump和jstack都未能看到異撑巴兀現(xiàn)象
-
考慮到內(nèi)存消耗的速度,決定用strace來(lái)看故障期間的系統(tǒng)調(diào)用傲武,命令為:
strace -f -t -T -e trace=all -p 20390 2<&1 | tee -a 20390.strace.log
故障發(fā)生時(shí)蓉驹,需要盡快殺掉進(jìn)程,以避免殃及系統(tǒng)掛掉揪利。而gdb正好可以將進(jìn)程suspend态兴,并同時(shí)觀察線程堆棧,可以用來(lái)輔助分析
-
嘗試后發(fā)現(xiàn)strace和gdb不能同時(shí)attach到進(jìn)程疟位,于是初步將分析方案定為:
- 出現(xiàn)故障時(shí)瞻润,先用strace打出系統(tǒng)調(diào)用,打十秒左右
- 停掉strace甜刻,用gdb attach到進(jìn)程绍撞,使進(jìn)程掛起,一方面阻止內(nèi)存的消耗得院,另一方面可用于分析
復(fù)現(xiàn)及分析
-
如上所述方案傻铣,得到故障期間系統(tǒng)調(diào)用異常的地方為:
[pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC, <unfinished ...> [pid 21751] 17:15:26 futex(0x42564324, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1419585326, 152497000}, ffffffff <unfinished ...> [pid 21747] 17:15:26 mprotect(0x7ff8e1e5d000, 32768, PROT_READ|PROT_WRITE <unfinished ...> [pid 21832] 17:15:26 <... clock_gettime resumed> {19389, 378140695}) = 0 <0.000111> [pid 21747] 17:15:26 <... mprotect resumed> ) = 0 <0.000096> [pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC, {19389, 378329426}) = 0 <0.000045> [pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC, {19389, 378442545}) = 0 <0.000044> [pid 21832] 17:15:26 gettimeofday({1419585326, 103206}, NULL) = 0 <0.000046> [pid 21832] 17:15:26 futex(0x7ff9b878d1e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1419585326, 153206000}, ffffffff <unfinished ...> [pid 21747] 17:15:26 mprotect(0x7ff8e1e65000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000060> [pid 21747] 17:15:26 mprotect(0x7ff8e1e6d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045> [pid 21747] 17:15:26 mprotect(0x7ff8e1e75000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000062> [pid 21747] 17:15:26 mprotect(0x7ff8e1e7d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000043> [pid 21747] 17:15:26 mprotect(0x7ff8e1e85000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000056> [pid 21747] 17:15:26 mprotect(0x7ff8e1e8d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000104> [pid 21747] 17:15:26 mprotect(0x7ff8e1e95000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044> [pid 21747] 17:15:26 mprotect(0x7ff8e1e9d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000062> [pid 21747] 17:15:26 mprotect(0x7ff8e1ea5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044> [pid 21747] 17:15:26 mprotect(0x7ff8e1ead000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000055> [pid 21747] 17:15:26 mprotect(0x7ff8e1eb5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000057> [pid 21747] 17:15:26 mprotect(0x7ff8e1ebd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045> [pid 21747] 17:15:26 mprotect(0x7ff8e1ec5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045> [pid 21747] 17:15:26 mprotect(0x7ff8e1ecd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000043> [pid 21747] 17:15:26 mprotect(0x7ff8e1ed5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044> [pid 21747] 17:15:26 mprotect(0x7ff8e1edd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000055>
可以發(fā)現(xiàn)mprotect方法調(diào)用頻繁【結(jié)合故障出現(xiàn)之間的系統(tǒng)調(diào)用進(jìn)行對(duì)比】,且全在21747的線程內(nèi)祥绞,查看doc可知每次會(huì)malloc 32K的內(nèi)存
-
用gdb suspend進(jìn)程后李命,查看21747對(duì)應(yīng)的線程搀庶,并執(zhí)行dt得到其堆棧:
Breakpoint 1, 0x00007ff9e12b14e0 in mprotect () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007ff9e12b14e0 in mprotect () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ff9e1254671 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007ff9e1255b90 in malloc () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007ff9e0cd83f8 in os::malloc(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #4 0x00007ff9e07f5f8c in ChunkPool::allocate(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #5 0x00007ff9e07f572a in Chunk::operator new(unsigned long, unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #6 0x00007ff9e07f5d11 in Arena::grow(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #7 0x00007ff9e0cbffa8 in Node::out_grow(unsigned int) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #8 0x00007ff9e07c63e1 in Node::add_out(Node*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #9 0x00007ff9e0c4b785 in PhaseIdealLoop::clone_loop(IdealLoopTree*, Node_List&, int, Node*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #10 0x00007ff9e0c4fa59 in PhaseIdealLoop::partial_peel(IdealLoopTree*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #11 0x00007ff9e0c32dac in IdealLoopTree::iteration_split_impl(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #12 0x00007ff9e0c33000 in IdealLoopTree::iteration_split(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #13 0x00007ff9e0c32f68 in IdealLoopTree::iteration_split(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #14 0x00007ff9e0c41095 in PhaseIdealLoop::build_and_optimize(bool, bool) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #15 0x00007ff9e097134f in Compile::Optimize() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #16 0x00007ff9e096de84 in Compile::Compile(ciEnv*, C2Compiler*, ciMethod*, int, bool, bool) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #17 0x00007ff9e08f0d3e in C2Compiler::compile_method(ciEnv*, ciMethod*, int) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #18 0x00007ff9e09786aa in CompileBroker::invoke_compiler_on_method(CompileTask*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #19 0x00007ff9e0977f95 in CompileBroker::compiler_thread_loop() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #20 0x00007ff9e0df0539 in compiler_thread_entry(JavaThread*, Thread*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #21 0x00007ff9e0de9a41 in JavaThread::run() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #22 0x00007ff9e0ce0d1f in java_start(Thread*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so #23 0x00007ff9e176eb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #24 0x00007ff9e12b4a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6 #25 0x0000000000000000 in ?? () (gdb) cont Continuing.
由各frame中各方法名:CompileBroker::compiler_thread_loop(), Compile::Optimize()等排抬,可以推斷出此時(shí)應(yīng)該與jvm類編譯優(yōu)化相關(guān)邏輯有關(guān)娩践。結(jié)合已有先驗(yàn)知識(shí),知道jvm對(duì)類的編譯會(huì)與調(diào)用次數(shù)等因素有關(guān)丧荐。
在mprotect處下斷點(diǎn)再cont發(fā)現(xiàn)會(huì)繼續(xù)進(jìn)入該斷點(diǎn)缆瓣,多次cont依舊如此
-
至此基本可認(rèn)定是jvm的bug,查看機(jī)器的jdk版本:
java -version java version "1.6.0_35" Java(TM) SE Runtime Environment (build 1.6.0_35-b10) Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)
而1.6最新的為1.6.0_45, 因此先不深究jvm具體的bug所在虹统,先做升級(jí)
升級(jí)是否修復(fù)該故障弓坞,且聽下回分解:)
后續(xù)
- 現(xiàn)在回頭來(lái)分析該bug觸發(fā)的邏輯隧甚,發(fā)現(xiàn)是測(cè)試同學(xué)為了便于測(cè)試,將該分頁(yè)獲取數(shù)據(jù)接口由每次獲取20條改為了每次獲取200條渡冻。
- 在web啟動(dòng)之后戚扳,若先用20的分頁(yè)進(jìn)地調(diào)用,則會(huì)讓jvm“預(yù)熱”的優(yōu)化該編譯優(yōu)化邏輯族吻,不會(huì)觸發(fā)帽借。而若在web啟動(dòng)之后立刻用200的分頁(yè)請(qǐng)求,則必然會(huì)觸發(fā)該bug
- 有必要對(duì)jdk源碼不同版本進(jìn)行對(duì)比以確認(rèn)相關(guān)邏輯是否已經(jīng)優(yōu)化超歌,當(dāng)然不排隊(duì)jdk還存在類似隱藏較深的bug
下回分解
很悲傷?嘲!巍举!升級(jí)jdk(1.6)后該bug依舊存在脆荷,說(shuō)明該bug未被修復(fù)
決定嘗試設(shè)置jit相關(guān)參數(shù)來(lái)試圖繞過(guò)該bug, 于是設(shè)置-XX:CompileThreshold=1, 發(fā)現(xiàn)bug不再能復(fù)現(xiàn)。但同時(shí)發(fā)現(xiàn)web服務(wù)啟動(dòng)時(shí)間變長(zhǎng)懊悯,原因應(yīng)該是做了大量的編譯工作蜓谋,并且web啟動(dòng)之后,cpu占用波動(dòng)較大炭分,持續(xù)較長(zhǎng)時(shí)間后才回歸平穩(wěn)————依舊是不斷在做編譯
-
另外桃焕,根據(jù)doc:
-Xint, -Xcomp, and -Xmixed The two flags -Xint and -Xcomp are not too relevant for our everyday work, but highly interesting in order to learn something about the JVM. The -Xint flag forces the JVM to execute all bytecode in interpreted mode, which comes along with a considerable slowdown, usually factor 10 or higher. On the contrary, the flag -Xcomp forces exactly the opposite behavior, that is, the JVM compiles all bytecode into native code on first use, thereby applying maximum optimization level. This sounds nice, because it completely avoids the slow interpreter. However, many applications will also suffer at least a bit from the use of -Xcomp, even if the drop in performance is not comparable with the one resulting from -Xint. The reason is that by setting-Xcomp we prevent the JVM from making use of its JIT compiler to full effect. The JIT compiler creates method usage profiles at run time and then optimizes single methods (or parts of them) step by step, and sometimes speculatively, to the actual application behavior. Some of these optimization techniques, e.g., optimistic branch prediction, cannot be applied effectively without first profiling the application. Another aspect is that methods are only getting compiled at all when they prove themselves relevant, i.e., constitute some kind of hot spot in the application. Methods that are called rarely (or even only once) are continued to be executed in interpreted mode, thus saving the compilation and optimization cost.
可以知道上述配置或是直接用Xcomp雖然看似可以避免該bug的觸發(fā),但會(huì)對(duì)性能有較大的損傷捧毛,因?yàn)閖it會(huì)根據(jù)調(diào)用次數(shù)及性能的統(tǒng)計(jì)信息來(lái)優(yōu)化bytecode观堂,如果直接comp就得不到這些統(tǒng)計(jì)信息優(yōu)化的不夠好了
處理方案
- 不做處理:考慮到之前線上服務(wù)表現(xiàn)平穩(wěn),從未觸發(fā)該bug岖妄,以及該bug要么在服務(wù)啟動(dòng)之后盡快出現(xiàn)型将,要么不會(huì)出現(xiàn)寂祥,因此暫不做優(yōu)化調(diào)整荐虐,而是上線后觀察兩三分鐘,不觸發(fā)該bug才認(rèn)為上線成功
- 給oracle報(bào)bug
- 考慮升級(jí)jdk到1.7或1.8
相關(guān)閱讀
- JVM性能優(yōu)化1-JVM簡(jiǎn)介
- JVM性能優(yōu)化2-編譯器
- JVM性能優(yōu)化3-垃圾回收
- JVM性能優(yōu)化4-C4垃圾回收
- JVM性能優(yōu)化5-Java的伸縮性
- Useful JVM Flags 1-JVM Types and Compiler Modes
- Useful JVM Flags 2-Flag Categories and JIT Compiler Diagnostics
- Useful JVM Flags 3-Printing all XX Flags and their Values
- Useful JVM Flags 4-Heap Tuning
- Useful JVM Flags 5-Young Generation Garbage Collection
- Useful JVM Flags 6-Throughput Collector
- Useful JVM Flags 7-CMS Collector
- Useful JVM Flags 8-GC Logging
致謝
- 感謝北京新觀念技術(shù)服務(wù)有限公司CEO李斯寧提供技術(shù)支持與分析討論