現(xiàn)象
接口頻繁499, 域名報警逐漸增多.
排查
- 查看線上日志:調(diào)用服務(wù)B接口,超時非常多,最長響應(yīng)時間幾千ms.
- 找到服務(wù)B負(fù)責(zé)人進(jìn)行查看,因為沒有采用相同trace方式,只能給出大概時間范圍,但是響應(yīng)時間并未有超過100ms的.考慮是網(wǎng)絡(luò)原因.
- 找OP查看網(wǎng)絡(luò)是否出現(xiàn)問題,自己查看ping延遲不高.等待OP回復(fù).
- OP說網(wǎng)絡(luò)沒問題,但是暫時沒其他思路,沒考慮應(yīng)用內(nèi)部導(dǎo)致.
- 偶然間查看調(diào)用服務(wù)B接口的配置,發(fā)現(xiàn)兩個參數(shù)
minConnection
與maxConnection
后產(chǎn)生新思路,因為引用的是服務(wù)B的sdk,內(nèi)部是不是采用了連接池,當(dāng)時配置數(shù)量較小,難道是流量上來之后獲取連接導(dǎo)致了阻塞? - 找到服務(wù)B的同學(xué)確認(rèn)內(nèi)部實現(xiàn),經(jīng)確認(rèn)后并未采用連接池,這兩個參數(shù)沒作用,是保留參數(shù).
- 又沒什么其他思路了,考慮可能應(yīng)用內(nèi)部問題,這塊補充說明下在開始是看了gc log的發(fā)現(xiàn)一次minor gc時間在50-100ms所以排除是gc導(dǎo)致的.
- 實在沒其他猜測,gclog再看一遍吧!把gc log拿下來,使用gceasy.io進(jìn)行解析.
-
分析后發(fā)現(xiàn)產(chǎn)生了,幾次full gc,我們使用的是G1 gc,產(chǎn)生full gc是比較嚴(yán)重的現(xiàn)象.
fullgc.png - 找到發(fā)生full gc之前逐行查看,可以看到堆內(nèi)存基本上已經(jīng)滿了,對于G1 gc來說已經(jīng)是非常不健康的狀態(tài)了.
2020-03-20T11:08:24.499+0800: 1433856.095: [SoftReference, 0 refs, 0.0031353 secs]2020-03-20T11:08:24.502+0800: 1433856.098: [WeakReference, 0 refs, 0.0010686 secs]2020-03-20T11:08:24.503+0800: 1433856.099: [FinalReference, 0 refs, 0.0012377 secs]2020-03-20T11:08:24.504+0800: 1433856.101: [PhantomReference, 0 refs, 0 refs, 0.0023456 secs]2020-03-20T11:08:24.507+0800: 1433856.103: [JNI Weak Reference, 0.0000457 secs], 0.0284242 secs]
[Parallel Time: 15.4 ms, GC Workers: 18]
[GC Worker Start (ms): Min: 1433856077.7, Avg: 1433856077.9, Max: 1433856078.1, Diff: 0.3]
[Ext Root Scanning (ms): Min: 7.9, Avg: 8.8, Max: 15.2, Diff: 7.3, Sum: 157.8]
[Update RS (ms): Min: 0.0, Avg: 0.1, Max: 1.0, Diff: 1.0, Sum: 1.1]
[Processed Buffers: Min: 0, Avg: 0.1, Max: 1, Diff: 1, Sum: 2]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 3.0]
[Termination (ms): Min: 0.0, Avg: 6.0, Max: 6.8, Diff: 6.8, Sum: 107.2]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
[GC Worker Total (ms): Min: 14.9, Avg: 15.1, Max: 15.2, Diff: 0.3, Sum: 271.2]
[GC Worker End (ms): Min: 1433856093.0, Avg: 1433856093.0, Max: 1433856093.0, Diff: 0.0]
[Code Root Fixup: 0.2 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.2 ms]
[Other: 12.6 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 9.7 ms]
[Ref Enq: 0.2 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 0.4 ms]
[Humongous Reclaim: 0.1 ms]
[Free CSet: 0.0 ms]
[Eden: 0.0B(204.0M)->0.0B(204.0M) Survivors: 0.0B->0.0B Heap: 4010.2M(4096.0M)->4010.2M(4096.0M)]
# 之前指關(guān)注了時間30ms還不錯,但是沒關(guān)注內(nèi)存的分布情況 因為設(shè)置目標(biāo)暫停時間最大為200ms,為了實現(xiàn)這個暫停時間eden區(qū)域已經(jīng)減少了非常多
[Times: user=0.20 sys=0.00, real=0.03 secs]
2020-03-20T11:08:24.510+0800: 1433856.106: Total time for which application threads were stopped: 0.0454852 seconds, Stopping threads took: 0.0002360 seconds
# 嘗試擴(kuò)展堆大小 但是已經(jīng)是最大了不能在擴(kuò)展了
1433856.121: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 24 bytes]
1433856.121: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 2097152 bytes, attempted expansion amount: 2097152 bytes]
1433856.121: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded]
# 發(fā)生full gc原因是Allocation Failure分配失敗.
2020-03-20T11:08:24.525+0800: 1433856.122: [Full GC (Allocation Failure) 2020-03-20T11:08:25.194+0800: 1433856.790: [SoftReference, 4841 refs, 0.0011466 secs]2020-03-20T11:08:25.195+0800: 1433856.791: [WeakReference, 136779 refs, 0.0211710 secs]2020-03-20T11:08:25.216+0800: 1433856.813: [FinalReference, 2718 refs, 0.0040877 secs]2020-03-20T11:08:25.220+0800: 1433856.817: [PhantomReference, 0 refs, 1140 refs, 0.0003886 secs]2020-03-20T11:08:25.221+0800: 1433856.817: [JNI Weak Reference, 0.0000759 secs] 4010M->788M(4096M), 2.6614008 secs]
- 繼續(xù)向上尋找蛛絲馬跡
2020-03-20T11:08:22.214+0800: 1433853.810: [SoftReference, 0 refs, 0.0065309 secs]2020-03-20T11:08:22.221+0800: 1433853.817: [WeakReference, 0 refs, 0.0045773 secs]2020-03-20T11:08:22.225+0800: 1433853.822: [FinalReference, 0 refs, 0.0010537 secs]2020-03-20T11:08:22.226+0800: 1433853.823: [PhantomReference, 0 refs, 0 refs, 0.0019795 secs]2020-03-20T11:08:22.228+0800: 1433853.825: [JNI Weak Reference, 0.0000575 secs] 1433854.362: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3837788160 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: end of GC]
(to-space exhausted), 1.3927102 secs]
# 比較要命的一句話,代表已經(jīng)沒有region給我們copy對象了
[Parallel Time: 837.7 ms, GC Workers: 18]
[GC Worker Start (ms): Min: 1433852972.1, Avg: 1433852972.2, Max: 1433852972.3, Diff: 0.2]
[Ext Root Scanning (ms): Min: 4.9, Avg: 6.1, Max: 13.7, Diff: 8.8, Sum: 110.1]
[Update RS (ms): Min: 0.9, Avg: 13.2, Max: 20.0, Diff: 19.2, Sum: 237.1]
[Processed Buffers: Min: 2, Avg: 88.1, Max: 159, Diff: 157, Sum: 1586]
[Scan RS (ms): Min: 1.7, Avg: 3.3, Max: 3.8, Diff: 2.1, Sum: 60.3]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Object Copy (ms): Min: 799.5, Avg: 814.1, Max: 827.7, Diff: 28.1, Sum: 14653.6]
[Termination (ms): Min: 0.0, Avg: 0.6, Max: 0.9, Diff: 0.9, Sum: 10.5]
[Termination Attempts: Min: 1, Avg: 1.2, Max: 3, Diff: 2, Sum: 22]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.5]
[GC Worker Total (ms): Min: 837.2, Avg: 837.4, Max: 837.6, Diff: 0.3, Sum: 15073.2]
[GC Worker End (ms): Min: 1433853809.5, Avg: 1433853809.6, Max: 1433853809.7, Diff: 0.2]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.7 ms]
[Other: 554.2 ms]
[Evacuation Failure: 533.0 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 14.8 ms]
[Ref Enq: 0.5 ms]
[Redirty Cards: 0.8 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.6 ms]
[Free CSet: 0.7 ms]
[Eden: 2312.0M(2316.0M)->0.0B(204.0M) Survivors: 38.0M->0.0B Heap: 3941.0M(4096.0M)->3610.3M(4096.0M)]
[Times: user=7.13 sys=0.00, real=1.39 secs]
1433854.363: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 3839885312 bytes, allocation request: 0 bytes, threshold: 2147483600 bytes (50.00 %), source: STW humongous allocation]
For G1 GC, an evacuation failure is very expensive -
- For successfully copied objects, G1 needs to update the references and the regions have to be tenured.
- For unsuccessfully copied objects, G1 will self-forward them and tenure the regions in place.
- 繼續(xù)向上尋找蛛絲馬跡
2020-03-20T11:08:27.709+0800: 1433859.305: [SoftReference, 0 refs, 0.0019565 secs]2020-03-20T11:08:27.711+0800: 1433859.307: [WeakReference, 13 refs, 0.0009893 secs]2020-03-20T11:08:27.712+0800: 1433859.308: [FinalReference, 102 refs, 0.0014151 secs]2020-03-20T11:08:27.713+0800: 1433859.309: [PhantomReference, 0 refs, 64 refs, 0.0021785 secs]2020-03-20T11:08:27.715+0800: 1433859.312: [JNI Weak Reference, 0.0000454 secs] 1433859.315: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 68.78 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0483084 secs]
# GC占用時間已經(jīng)到68%這么高的地步了,證明應(yīng)用執(zhí)行時間很短了,此時吞吐量是嚴(yán)重下降的.
總結(jié)
基于以上現(xiàn)象,考慮到幾種方式解決:
- 增加堆內(nèi)存大小, 給GC“喘息”的時間;
- 減小
XX:InitiatingHeapOccupancyPercent
參數(shù),我們設(shè)置的是50,可以減小來提前進(jìn)行concurrent marking cycle
,提前mix gc
; - 增加處理GC的線程數(shù)
XX:ConcGCThreads
,加快處理GC速度,減少時間; - 避免發(fā)生
to-space exhausted
問題,增加預(yù)留空間XX:G1ReservePercent
- 因為發(fā)現(xiàn)問題時正好時線上摘掉了一臺機(jī)器,那么整體來看是單臺機(jī)器流量變大導(dǎo)致的對象分配速度大于回收的速度,最終導(dǎo)致了full gc,那么我們采用了最簡單粗暴的方式擴(kuò)容,問題消失;