用gdb & strace追蹤 jdk OOM bug

現(xiàn)象

  • java應(yīng)用的web服務(wù)器突然掛掉骇扇,無(wú)任何jvm相關(guān)日志摔竿,重啟后不久再次掛掉
  • 再次重啟,不久后機(jī)器掛掉【機(jī)器為虛擬機(jī)】

相關(guān)日志

  • 到宿主機(jī)器重啟機(jī)器后少孝,查看dmesg继低,可看到有相關(guān)信息:

      [46019.223344] 3065881 pages non-shared
      [46019.223348] Out of memory: kill process 16211 (java) score 1135790 or a child
      [46019.225305] Killed process 16211 (java)
      [46019.293729] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
      [46019.293734] java cpuset=/ mems_allowed=0
      [46019.293750] Pid: 2187, comm: java Not tainted 2.6.32-5-amd64 #1
      [46019.293752] Call Trace:
      [46019.293761]  [<ffffffff810b643c>] ? oom_kill_process+0x7f/0x23f
      [46019.293765]  [<ffffffff8106bb5e>] ? timekeeping_get_ns+0xe/0x2e
      [46019.293768]  [<ffffffff810b6960>] ? __out_of_memory+0x12a/0x141
      [46019.293771]  [<ffffffff810b6ab7>] ? out_of_memory+0x140/0x172
      [46019.293775]  [<ffffffff810ba81c>] ? __alloc_pages_nodemask+0x4ec/0x5fc
      [46019.293780]  [<ffffffff810bbd85>] ? __do_page_cache_readahead+0x9b/0x1b4
      [46019.293784]  [<ffffffff810bbeba>] ? ra_submit+0x1c/0x20
      [46019.293787]  [<ffffffff810b4b87>] ? filemap_fault+0x17d/0x2f6
      [46019.293793]  [<ffffffff810cab26>] ? __do_fault+0x54/0x3c3
      [46019.293796]  [<ffffffff810cce7a>] ? handle_mm_fault+0x3b8/0x80f
      [46019.293801]  [<ffffffff812ff306>] ? do_page_fault+0x2e0/0x2fc
      [46019.293805]  [<ffffffff812fd1a5>] ? page_fault+0x25/0x30
    
  • 說(shuō)明進(jìn)程占用內(nèi)存過(guò)高,被系統(tǒng)的oom_killer強(qiáng)行kill所致

  • 猜測(cè)機(jī)器會(huì)掛的原因是:該進(jìn)程占用內(nèi)存過(guò)快過(guò)高稍走,oom_killer來(lái)不及動(dòng)作便已慘遭殃及

初步分析及方案擬定

  • 嘗試再次復(fù)現(xiàn)袁翁,用jdb attach 到j(luò)ava進(jìn)程進(jìn)行remote debug柴底,【久經(jīng)波折后】發(fā)現(xiàn)某個(gè)分頁(yè)請(qǐng)求數(shù)據(jù)的接口會(huì)間歇性的觸發(fā)該現(xiàn)象,用vmstat觀察機(jī)器內(nèi)存使用粱胜,發(fā)現(xiàn)發(fā)生問(wèn)題時(shí)內(nèi)存下降異常迅速柄驻,約每秒100M,數(shù)十秒內(nèi)就會(huì)吃盡機(jī)器所有內(nèi)存

  • 該java進(jìn)程Xmx配置為6G年柠,且獨(dú)占該機(jī)器凿歼。機(jī)器的內(nèi)存有11G

  • 結(jié)合之前的經(jīng)驗(yàn),進(jìn)程占用內(nèi)存遠(yuǎn)超Xmx的現(xiàn)象冗恨,初步認(rèn)為很可能是jni所致答憔。將工程中所有最新用到j(luò)ni的地方review,未能找到明確線索掀抹。

  • 在內(nèi)存急劇下降期間對(duì)java進(jìn)程取mem dump和jstack都未能看到異撑巴兀現(xiàn)象

  • 考慮到內(nèi)存消耗的速度,決定用strace來(lái)看故障期間的系統(tǒng)調(diào)用傲武,命令為:

      strace -f -t -T -e trace=all -p 20390 2<&1 | tee -a 20390.strace.log
    
  • 故障發(fā)生時(shí)蓉驹,需要盡快殺掉進(jìn)程,以避免殃及系統(tǒng)掛掉揪利。而gdb正好可以將進(jìn)程suspend态兴,并同時(shí)觀察線程堆棧,可以用來(lái)輔助分析

  • 嘗試后發(fā)現(xiàn)strace和gdb不能同時(shí)attach到進(jìn)程疟位,于是初步將分析方案定為:

    • 出現(xiàn)故障時(shí)瞻润,先用strace打出系統(tǒng)調(diào)用,打十秒左右
    • 停掉strace甜刻,用gdb attach到進(jìn)程绍撞,使進(jìn)程掛起,一方面阻止內(nèi)存的消耗得院,另一方面可用于分析

復(fù)現(xiàn)及分析

  • 如上所述方案傻铣,得到故障期間系統(tǒng)調(diào)用異常的地方為:

      [pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
      [pid 21751] 17:15:26 futex(0x42564324, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1419585326, 152497000}, ffffffff <unfinished ...>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e5d000, 32768, PROT_READ|PROT_WRITE <unfinished ...>
      [pid 21832] 17:15:26 <... clock_gettime resumed> {19389, 378140695}) = 0 <0.000111>
      [pid 21747] 17:15:26 <... mprotect resumed> ) = 0 <0.000096>
      [pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC, {19389, 378329426}) = 0 <0.000045>
      [pid 21832] 17:15:26 clock_gettime(CLOCK_MONOTONIC, {19389, 378442545}) = 0 <0.000044>
      [pid 21832] 17:15:26 gettimeofday({1419585326, 103206}, NULL) = 0 <0.000046>
      [pid 21832] 17:15:26 futex(0x7ff9b878d1e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1419585326, 153206000}, ffffffff <unfinished ...>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e65000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000060>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e6d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e75000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000062>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e7d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000043>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e85000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000056>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e8d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000104>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e95000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1e9d000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000062>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ea5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ead000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000055>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1eb5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000057>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ebd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ec5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000045>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ecd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000043>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1ed5000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000044>
      [pid 21747] 17:15:26 mprotect(0x7ff8e1edd000, 32768, PROT_READ|PROT_WRITE) = 0 <0.000055>
    
  • 可以發(fā)現(xiàn)mprotect方法調(diào)用頻繁【結(jié)合故障出現(xiàn)之間的系統(tǒng)調(diào)用進(jìn)行對(duì)比】,且全在21747的線程內(nèi)祥绞,查看doc可知每次會(huì)malloc 32K的內(nèi)存

  • 用gdb suspend進(jìn)程后李命,查看21747對(duì)應(yīng)的線程搀庶,并執(zhí)行dt得到其堆棧:

      Breakpoint 1, 0x00007ff9e12b14e0 in mprotect () from /lib/x86_64-linux-gnu/libc.so.6
      (gdb) bt
      #0  0x00007ff9e12b14e0 in mprotect () from /lib/x86_64-linux-gnu/libc.so.6
      #1  0x00007ff9e1254671 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
      #2  0x00007ff9e1255b90 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
      #3  0x00007ff9e0cd83f8 in os::malloc(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #4  0x00007ff9e07f5f8c in ChunkPool::allocate(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #5  0x00007ff9e07f572a in Chunk::operator new(unsigned long, unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #6  0x00007ff9e07f5d11 in Arena::grow(unsigned long) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #7  0x00007ff9e0cbffa8 in Node::out_grow(unsigned int) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #8  0x00007ff9e07c63e1 in Node::add_out(Node*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #9  0x00007ff9e0c4b785 in PhaseIdealLoop::clone_loop(IdealLoopTree*, Node_List&, int, Node*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #10 0x00007ff9e0c4fa59 in PhaseIdealLoop::partial_peel(IdealLoopTree*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #11 0x00007ff9e0c32dac in IdealLoopTree::iteration_split_impl(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #12 0x00007ff9e0c33000 in IdealLoopTree::iteration_split(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #13 0x00007ff9e0c32f68 in IdealLoopTree::iteration_split(PhaseIdealLoop*, Node_List&) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #14 0x00007ff9e0c41095 in PhaseIdealLoop::build_and_optimize(bool, bool) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #15 0x00007ff9e097134f in Compile::Optimize() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #16 0x00007ff9e096de84 in Compile::Compile(ciEnv*, C2Compiler*, ciMethod*, int, bool, bool) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #17 0x00007ff9e08f0d3e in C2Compiler::compile_method(ciEnv*, ciMethod*, int) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #18 0x00007ff9e09786aa in CompileBroker::invoke_compiler_on_method(CompileTask*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #19 0x00007ff9e0977f95 in CompileBroker::compiler_thread_loop() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #20 0x00007ff9e0df0539 in compiler_thread_entry(JavaThread*, Thread*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #21 0x00007ff9e0de9a41 in JavaThread::run() () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #22 0x00007ff9e0ce0d1f in java_start(Thread*) () from /global/install/jdk1.6.0_35/jre/lib/amd64/server/libjvm.so
      #23 0x00007ff9e176eb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
      #24 0x00007ff9e12b4a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
      #25 0x0000000000000000 in ?? ()
      (gdb) cont
      Continuing.
    
  • 由各frame中各方法名:CompileBroker::compiler_thread_loop(), Compile::Optimize()等排抬,可以推斷出此時(shí)應(yīng)該與jvm類編譯優(yōu)化相關(guān)邏輯有關(guān)娩践。結(jié)合已有先驗(yàn)知識(shí),知道jvm對(duì)類的編譯會(huì)與調(diào)用次數(shù)等因素有關(guān)丧荐。

  • 在mprotect處下斷點(diǎn)再cont發(fā)現(xiàn)會(huì)繼續(xù)進(jìn)入該斷點(diǎn)缆瓣,多次cont依舊如此

  • 至此基本可認(rèn)定是jvm的bug,查看機(jī)器的jdk版本:

      java -version
      java version "1.6.0_35"
      Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
      Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)
    
  • 而1.6最新的為1.6.0_45, 因此先不深究jvm具體的bug所在虹统,先做升級(jí)

  • 升級(jí)是否修復(fù)該故障弓坞,且聽下回分解:)

后續(xù)

  • 現(xiàn)在回頭來(lái)分析該bug觸發(fā)的邏輯隧甚,發(fā)現(xiàn)是測(cè)試同學(xué)為了便于測(cè)試,將該分頁(yè)獲取數(shù)據(jù)接口由每次獲取20條改為了每次獲取200條渡冻。
  • 在web啟動(dòng)之后戚扳,若先用20的分頁(yè)進(jìn)地調(diào)用,則會(huì)讓jvm“預(yù)熱”的優(yōu)化該編譯優(yōu)化邏輯族吻,不會(huì)觸發(fā)帽借。而若在web啟動(dòng)之后立刻用200的分頁(yè)請(qǐng)求,則必然會(huì)觸發(fā)該bug
  • 有必要對(duì)jdk源碼不同版本進(jìn)行對(duì)比以確認(rèn)相關(guān)邏輯是否已經(jīng)優(yōu)化超歌,當(dāng)然不排隊(duì)jdk還存在類似隱藏較深的bug

下回分解

  • 很悲傷?嘲!巍举!升級(jí)jdk(1.6)后該bug依舊存在脆荷,說(shuō)明該bug未被修復(fù)

  • 決定嘗試設(shè)置jit相關(guān)參數(shù)來(lái)試圖繞過(guò)該bug, 于是設(shè)置-XX:CompileThreshold=1, 發(fā)現(xiàn)bug不再能復(fù)現(xiàn)。但同時(shí)發(fā)現(xiàn)web服務(wù)啟動(dòng)時(shí)間變長(zhǎng)懊悯,原因應(yīng)該是做了大量的編譯工作蜓谋,并且web啟動(dòng)之后,cpu占用波動(dòng)較大炭分,持續(xù)較長(zhǎng)時(shí)間后才回歸平穩(wěn)————依舊是不斷在做編譯

  • 另外桃焕,根據(jù)doc:

          -Xint, -Xcomp, and -Xmixed
      The two flags -Xint and -Xcomp are not too relevant for our everyday work, but highly interesting in order to learn something about the JVM. 
      The -Xint flag forces the JVM to execute all bytecode in interpreted mode, which comes along with a considerable slowdown, usually factor 10 or higher. 
      On the contrary, the flag -Xcomp forces exactly the opposite behavior, that is, the JVM compiles all bytecode into native code on first use, thereby applying maximum optimization level. 
      This sounds nice, because it completely avoids the slow interpreter. 
      However, many applications will also suffer at least a bit from the use of -Xcomp, even if the drop in performance is not comparable with the one resulting from -Xint. 
      The reason is that by setting-Xcomp we prevent the JVM from making use of its JIT compiler to full effect.
      The JIT compiler creates method usage profiles at run time and then optimizes single methods (or parts of them) step by step, and sometimes speculatively, to the actual application behavior. 
      Some of these optimization techniques, e.g., optimistic branch prediction, cannot be applied effectively without first profiling the application. 
      Another aspect is that methods are only getting compiled at all when they prove themselves relevant, i.e., constitute some kind of hot spot in the application. 
      Methods that are called rarely (or even only once) are continued to be executed in interpreted mode, thus saving the compilation and optimization cost.
    
  • 可以知道上述配置或是直接用Xcomp雖然看似可以避免該bug的觸發(fā),但會(huì)對(duì)性能有較大的損傷捧毛,因?yàn)閖it會(huì)根據(jù)調(diào)用次數(shù)及性能的統(tǒng)計(jì)信息來(lái)優(yōu)化bytecode观堂,如果直接comp就得不到這些統(tǒng)計(jì)信息優(yōu)化的不夠好了

處理方案

  • 不做處理:考慮到之前線上服務(wù)表現(xiàn)平穩(wěn),從未觸發(fā)該bug岖妄,以及該bug要么在服務(wù)啟動(dòng)之后盡快出現(xiàn)型将,要么不會(huì)出現(xiàn)寂祥,因此暫不做優(yōu)化調(diào)整荐虐,而是上線后觀察兩三分鐘,不觸發(fā)該bug才認(rèn)為上線成功
  • 給oracle報(bào)bug
  • 考慮升級(jí)jdk到1.7或1.8

相關(guān)閱讀

致謝

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末丸凭,一起剝皮案震驚了整個(gè)濱河市福扬,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌惜犀,老刑警劉巖铛碑,帶你破解...
    沈念sama閱讀 217,542評(píng)論 6 504
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異虽界,居然都是意外死亡汽烦,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,822評(píng)論 3 394
  • 文/潘曉璐 我一進(jìn)店門莉御,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)撇吞,“玉大人俗冻,你說(shuō)我怎么就攤上這事‰咕保” “怎么了迄薄?”我有些...
    開封第一講書人閱讀 163,912評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)煮岁。 經(jīng)常有香客問(wèn)我讥蔽,道長(zhǎng),這世上最難降的妖魔是什么画机? 我笑而不...
    開封第一講書人閱讀 58,449評(píng)論 1 293
  • 正文 為了忘掉前任冶伞,我火速辦了婚禮,結(jié)果婚禮上步氏,老公的妹妹穿的比我還像新娘碰缔。我一直安慰自己,他們只是感情好戳护,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,500評(píng)論 6 392
  • 文/花漫 我一把揭開白布金抡。 她就那樣靜靜地躺著,像睡著了一般腌且。 火紅的嫁衣襯著肌膚如雪梗肝。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,370評(píng)論 1 302
  • 那天铺董,我揣著相機(jī)與錄音巫击,去河邊找鬼。 笑死精续,一個(gè)胖子當(dāng)著我的面吹牛坝锰,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播重付,決...
    沈念sama閱讀 40,193評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼顷级,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了确垫?” 一聲冷哼從身側(cè)響起弓颈,我...
    開封第一講書人閱讀 39,074評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎删掀,沒想到半個(gè)月后翔冀,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,505評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡披泪,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,722評(píng)論 3 335
  • 正文 我和宋清朗相戀三年纤子,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,841評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡控硼,死狀恐怖跌捆,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情象颖,我是刑警寧澤佩厚,帶...
    沈念sama閱讀 35,569評(píng)論 5 345
  • 正文 年R本政府宣布,位于F島的核電站说订,受9級(jí)特大地震影響抄瓦,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜陶冷,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,168評(píng)論 3 328
  • 文/蒙蒙 一钙姊、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧埂伦,春花似錦煞额、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,783評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至基跑,卻和暖如春婚温,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背媳否。 一陣腳步聲響...
    開封第一講書人閱讀 32,918評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工栅螟, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人篱竭。 一個(gè)月前我還...
    沈念sama閱讀 47,962評(píng)論 2 370
  • 正文 我出身青樓力图,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親掺逼。 傳聞我的和親對(duì)象是個(gè)殘疾皇子吃媒,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,781評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容