背景
線上有個流量比較大的服務(wù)恋拷,qps日常在2000-3000之間敢订,請求方對耗時要求比較高王污,響應(yīng)時間要求在300ms以內(nèi),服務(wù)上線以后楚午,運行也比較平穩(wěn)昭齐。某天夜里,突然有幾千個dubbo請求超時矾柜,找了幾個請求查看調(diào)用鏈路發(fā)現(xiàn)超時的地方各不相同阱驾,服務(wù)依賴的接口方均表示服務(wù)沒有異常。
這種情況下怪蔑,一般就是服務(wù)自身的問題了里覆,仔細(xì)看了下dubbo的超時,發(fā)現(xiàn)超時的都是同一個provider缆瓣,于是看了下這個出問題的provider租谈,看了下這臺機(jī)器的gc情況,發(fā)現(xiàn)超時時有一次full gc捆愁,gc耗時達(dá)到了1s以上割去。
full gc時間過長問題排查
gc日志查看
發(fā)現(xiàn)full gc時間過長以后,就去對應(yīng)的機(jī)器上把gc log下載了下來昼丑,對gc日志進(jìn)行分析呻逆。
36402.792: [GC (Allocation Failure) 36402.792: [ParNew: 2105440K->7584K(2359296K), 0.0124564 secs] 4517143K->2419327K(4980736K), 0.0129547 secs] [Times: user=0.13 sys=0.00, real=0.01 secs]
36402.810: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2411743K(2621440K)] 2419601K(4980736K), 0.0037437 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
36402.814: [CMS-concurrent-mark-start]
36403.622: [CMS-concurrent-mark: 0.800/0.808 secs] [Times: user=5.73 sys=0.26, real=0.81 secs]
36403.623: [CMS-concurrent-preclean-start]
36403.801: [CMS-concurrent-preclean: 0.177/0.178 secs] [Times: user=0.69 sys=0.05, real=0.18 secs]
36403.801: [CMS-concurrent-abortable-preclean-start]
36403.998: [GC (Allocation Failure) 36403.998: [ParNew: 2104736K->7273K(2359296K), 0.0122722 secs] 4516479K->2419078K(4980736K), 0.0127518 secs] [Times: user=0.13 sys=0.01, real=0.02 secs]
36404.672: [CMS-concurrent-abortable-preclean: 0.826/0.870 secs] [Times: user=3.39 sys=0.26, real=0.87 secs]
36404.677: [GC (CMS Final Remark) [YG occupancy: 1097431 K (2359296 K)]36404.677: [GC (CMS Final Remark) 36404.678: [ParNew: 1097431K->7946K(2359296K), 0.0146413 secs] 3509235K->2419797K(4980736K), 0.0151127 secs] [Times: user=0.17 sys=0.00, real=0.02 secs]
36404.693: [Rescan (parallel) , 0.0044661 secs]36404.697: [weak refs processing, 1.6796757 secs]36406.377: [class unloading, 0.1835045 secs]36406.560: [scrub symbol table, 0.0291043 secs]36406.590: [scrub string table, 0.0029833 secs][1 CMS-remark: 2411851K(2621440K)] 2419797K(4980736K), 2.1253433 secs] [Times: user=2.33 sys=0.00, real=2.13 secs]
36406.803: [CMS-concurrent-sweep-start]
36409.458: [GC (Allocation Failure) 36409.458: [ParNew: 2105098K->53837K(2359296K), 0.0675226 secs] 4328721K->2277488K(4980736K), 0.0681548 secs] [Times: user=0.86 sys=0.00, real=0.07 secs]
36410.035: [GC (Allocation Failure) 36410.035: [ParNew: 2150989K->63917K(2359296K), 0.0566114 secs] 4128382K->2041367K(4980736K), 0.0572665 secs] [Times: user=0.71 sys=0.00, real=0.06 secs]
36410.966: [GC (Allocation Failure) 36410.966: [ParNew: 2161069K->51123K(2359296K), 0.0144844 secs] 3640286K->1530366K(4980736K), 0.0149783 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
36411.552: [CMS-concurrent-sweep: 2.287/4.749 secs] [Times: user=23.03 sys=1.41, real=4.75 secs]
36411.552: [CMS-concurrent-reset-start]
36411.558: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
gc日志里面各項代表的意義可以看這篇文章:http://www.reibang.com/p/a5b3905bf225
我們的老年代收集算法用的是cms算法,現(xiàn)在主要看一下里面會stw的兩個階段耗時菩帝,Initial Mark階段耗時real是0.01秒咖城,耗時很低。然后就是重新標(biāo)記Final Remark階段呼奢,耗時達(dá)到了2.13s宜雀,這里面看具體的耗時,weak refs processing, 1.6796757 secs這個占用了比較久的時間握础,那么這個弱引用處理辐董,究竟在處理什么呢,就要看具體dump出來的堆了禀综。
weak refs processing處理邏輯
weak refs processing的源碼如下:
void ReferenceProcessor::process_discovered_references(
BoolObjectClosure* is_alive,
OopClosure* keep_alive,
VoidClosure* complete_gc,
AbstractRefProcTaskExecutor* task_executor) {
NOT_PRODUCT(verify_ok_to_handle_reflists());
assert(!enqueuing_is_done(), "If here enqueuing should not be complete");
// Stop treating discovered references specially.
disable_discovery();
bool trace_time = PrintGCDetails && PrintReferenceGC;
// Soft references
{
TraceTime tt("SoftReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredSoftRefs, _current_soft_ref_policy, true,
is_alive, keep_alive, complete_gc, task_executor);
}
update_soft_ref_master_clock();
// Weak references
{
TraceTime tt("WeakReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredWeakRefs, NULL, true,
is_alive, keep_alive, complete_gc, task_executor);
}
// Final references
{
TraceTime tt("FinalReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredFinalRefs, NULL, false,
is_alive, keep_alive, complete_gc, task_executor);
}
// Phantom references
{
TraceTime tt("PhantomReference", trace_time, false, gclog_or_tty);
process_discovered_reflist(_discoveredPhantomRefs, NULL, false,
is_alive, keep_alive, complete_gc, task_executor);
}
// Weak global JNI references. It would make more sense (semantically) to
// traverse these simultaneously with the regular weak references above, but
// that is not how the JDK1.2 specification is. See #4126360. Native code can
// thus use JNI weak references to circumvent the phantom references and
// resurrect a "post-mortem" object.
{
TraceTime tt("JNI Weak Reference", trace_time, false, gclog_or_tty);
if (task_executor != NULL) {
task_executor->set_single_threaded_mode();
}
process_phaseJNI(is_alive, keep_alive, complete_gc);
}
}
看JVM源碼简烘,weak refs processing主要包括SoftReference苔严、WeakReference、FinalReference孤澎、PhantomReference以及JNI Weak Reference這五種Reference對象的處理届氢,處理的主要內(nèi)容是對之前標(biāo)記的Reference對象重新處理,重新判斷是否需要標(biāo)記(不標(biāo)記就是要回收的)覆旭,如果不標(biāo)記就需要放到refqueue里退子,等待java ReferenceHandler線程處理。
所以從工作的主要內(nèi)容看各種Reference的處理時間跟reference的個數(shù)成正比型将。
dump出堆進(jìn)行分析
怎么查看堆內(nèi)都有哪些對象呢
初步定位
簡單查看對象數(shù)目可以使用下面命令寂祥,初步定位問題。
// 這里面7代表要查看進(jìn)程id
jmap -histo:live 7 > dump_7
這個dump_7中就會有堆中對象的個數(shù)和占用空間大小信息茶敏,這個由于我沒有線上機(jī)權(quán)限壤靶,并沒有執(zhí)行。
dump出堆進(jìn)行分析
這里使用如下命令可以dump出堆惊搏,這個操作會影響生產(chǎn)環(huán)境應(yīng)用贮乳,慎用
jmap -dump:format=b,file=headdump_7.hprof 7
dump出來堆文件后,將這個文件用mat打開恬惯,可以看到內(nèi)存泄漏暴露大概的地方:
里面的內(nèi)容復(fù)制出來如下:
The class "com.mysql.cj.jdbc.AbandonedConnectionCleanupThread", loaded by "org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020", occupies 119,594,176 (45.21%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]", loaded by "<system class loader>", which occupies 119,594,016 (45.21%) bytes.
Keywords:
com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020
java.util.concurrent.ConcurrentHashMap$Node[]
查看詳情可以看到向拆,connectionFinalizerPhantomRefs占用了其中大部分空間,下一節(jié)會分析下AbandonedConnectionCleanupThread類中的connectionFinalizerPhantomRefs:
AbandonedConnectionCleanupThread類分析
AbandonedConnectionCleanupThread類主要變量
AbandonedConnectionCleanupThread類維護(hù)了一個單例的單線程池酪耳,負(fù)責(zé)定期清理已回收的connection連接浓恳。
public class AbandonedConnectionCleanupThread implements Runnable {
// 存放connection對應(yīng)的幻象引用集合
private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
// 引用隊列,從這個里面取連接出來進(jìn)行釋放
private static final ReferenceQueue<MysqlConnection> referenceQueue = new ReferenceQueue<>();
private static final ExecutorService cleanupThreadExcecutorService;
private static Thread threadRef = null;
private static Lock threadRefLock = new ReentrantLock();
static {
cleanupThreadExcecutorService = Executors.newSingleThreadExecutor(r -> {
Thread t = new Thread(r, "mysql-cj-abandoned-connection-cleanup");
t.setDaemon(true);
// Tie the thread's context ClassLoader to the ClassLoader that loaded the class instead of inheriting the context ClassLoader from the current
// thread, which would happen by default.
// Application servers may use this information if they attempt to shutdown this thread. By leaving the default context ClassLoader this thread
// could end up being shut down even when it is shared by other applications and, being it statically initialized, thus, never restarted again.
ClassLoader classLoader = AbandonedConnectionCleanupThread.class.getClassLoader();
if (classLoader == null) {
// This class was loaded by the Bootstrap ClassLoader, so lets tie the thread's context ClassLoader to the System ClassLoader instead.
classLoader = ClassLoader.getSystemClassLoader();
}
t.setContextClassLoader(classLoader);
return threadRef = t;
});
cleanupThreadExcecutorService.execute(new AbandonedConnectionCleanupThread());
}
}
加入connectionFinalizerPhantomRefs分析
下面看下連接是什么時候加入到connectionFinalizerPhantomRefs的碗暗。
// com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
/**
* Tracks the finalization of a {@link MysqlConnection} object and keeps a reference to its {@link NetworkResources} so that they can be later released.
*
* @param conn
* the Connection object to track for finalization
* @param io
* the network resources to close on the connection finalization
*/
protected static void trackConnection(MysqlConnection conn, NetworkResources io) {
threadRefLock.lock();
try {
if (isAlive()) {
ConnectionFinalizerPhantomReference reference = new ConnectionFinalizerPhantomReference(conn, io, referenceQueue);
connectionFinalizerPhantomRefs.add(reference);
}
} finally {
threadRefLock.unlock();
}
}
這個trackConnection方法是在生產(chǎn)連接的時候會被調(diào)用颈将,代碼如下:
// com.mysql.cj.jdbc.ConnectionImpl
/**
* Creates a connection to a MySQL Server.
*
* @param hostInfo
* the {@link HostInfo} instance that contains the host, user and connections attributes for this connection
* @exception SQLException
* if a database access error occurs
*/
public ConnectionImpl(HostInfo hostInfo) throws SQLException {
try {
// 省去連接生成過程
} catch (CJException e1) {
throw SQLExceptionsMapping.translateException(e1, getExceptionInterceptor());
}
try {
createNewIO(false);
unSafeQueryInterceptors();
// 將連接加入到清理線程中的幻象引用集合中
AbandonedConnectionCleanupThread.trackConnection(this, this.getSession().getNetworkResources());
} catch (SQLException ex) {
cleanup(ex);
// don't clobber SQL exceptions
throw ex;
} catch (Exception ex) {
cleanup(ex);
throw SQLError
.createSQLException(
this.propertySet.getBooleanProperty(PropertyKey.paranoid).getValue() ? Messages.getString("Connection.0")
: Messages.getString("Connection.1",
new Object[] { this.session.getHostInfo().getHost(), this.session.getHostInfo().getPort() }),
MysqlErrorNumbers.SQL_STATE_COMMUNICATION_LINK_FAILURE, ex, getExceptionInterceptor());
}
}
ConnectionFinalizerPhantomReference類
下面看下ConnectionFinalizerPhantomReference連接清理幻象引用的實現(xiàn),這個類的構(gòu)造方法中有MysqlConnection 數(shù)據(jù)庫連接對象,NetworkResources 數(shù)據(jù)庫連接資源言疗,ReferenceQueue引用隊列晴圾,用于在被PhantomReference引用的connection被gc時,JVM會將PhantomReference對象扔到ReferenceQueue中噪奄。
/**
* {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
* This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
*/
private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
private NetworkResources networkResources;
ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
super(conn, refQueue);
this.networkResources = networkResources;
}
void finalizeResources() {
if (this.networkResources != null) {
try {
this.networkResources.forceClose();
} finally {
this.networkResources = null;
}
}
}
}
連接關(guān)閉的操作
放到引用隊列以后死姚,AbandonedConnectionCleanupThread中前面提到的線程池會去執(zhí)行關(guān)閉連接的操作。
public void run() {
for (;;) {
try {
checkThreadContextClassLoader();
Reference<? extends MysqlConnection> reference = referenceQueue.remove(5000);
if (reference != null) {
// 釋放連接資源
finalizeResource((ConnectionFinalizerPhantomReference) reference);
}
} catch (InterruptedException e) {
threadRefLock.lock();
try {
threadRef = null;
// Finalize remaining references.
Reference<? extends MysqlConnection> reference;
while ((reference = referenceQueue.poll()) != null) {
finalizeResource((ConnectionFinalizerPhantomReference) reference);
}
connectionFinalizerPhantomRefs.clear();
} finally {
threadRefLock.unlock();
}
return;
} catch (Exception ex) {
// Nowhere to really log this.
}
}
}
finalizeResource實現(xiàn):
/**
* Release resources from the given {@link ConnectionFinalizerPhantomReference} and remove it from the references set.
*
* @param reference
* the {@link ConnectionFinalizerPhantomReference} to finalize.
*/
private static void finalizeResource(ConnectionFinalizerPhantomReference reference) {
try {
// 調(diào)用引用中回收資源的方法
reference.finalizeResources();
reference.clear();
} finally {
// 引用從集合中刪除
connectionFinalizerPhantomRefs.remove(reference);
}
}
ConnectionFinalizerPhantomReference#finalizeResources方法實現(xiàn):
/**
* {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
* This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
*/
private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
private NetworkResources networkResources;
ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
super(conn, refQueue);
this.networkResources = networkResources;
}
void finalizeResources() {
if (this.networkResources != null) {
try {
// 關(guān)閉其中的網(wǎng)絡(luò)連接資源
this.networkResources.forceClose();
} finally {
this.networkResources = null;
}
}
}
}
看到這里大家應(yīng)該明白勤篮,是jdbc為每個connection都生成了一個ConnectionFinalizerPhantomReference都毒,目的是為了當(dāng)connection對象回收時,順便回收相關(guān)資源碰缔。這其實是一個保底操作账劲,是怕connnection資源被上層的連接池或者使用者忘記close,從而導(dǎo)致資源泄漏。
問題解決方式
現(xiàn)在知道了為什么ConnectionFinalizerPhantomReference會產(chǎn)生涤垫,下面分析下為什么會有這么多的ConnectionFinalizerPhantomReference姑尺,我們使用的是hikari線程池竟终,hikari連接池參數(shù)配置介紹見http://www.reibang.com/p/eb85103f29d6蝠猬。
我們現(xiàn)有的配置如下:
- 最小連接數(shù)沒設(shè)置,默認(rèn)會和最大連接數(shù)一樣统捶,是100
- maxLifeTime用的是默認(rèn)值榆芦,30min,30分鐘后連接會被回收
也就是說每30分鐘都會產(chǎn)生100個無效的連接喘鸟,一天會產(chǎn)生4800個ConnectionFinalizerPhantomReference匆绣。
這個問題優(yōu)化的方式有兩種:
調(diào)整數(shù)據(jù)庫連接池配置
現(xiàn)在連接數(shù)100個過多,默認(rèn)的10個就夠用了什黑,這樣連接數(shù)的產(chǎn)生就會小10倍崎淳,然后可以調(diào)整maxLifeTime,數(shù)據(jù)庫默認(rèn)的連接空閑時間是8小時愕把,maxLifeTime可以調(diào)整為6個小時拣凹,這樣每天只會產(chǎn)生40個連接,這個ConnectionFinalizerPhantomReference過多的問題也就能解決了恨豁。
刪除connectionFinalizerPhantomRefs集合
數(shù)據(jù)庫連接的釋放有連接池保證嚣镜,這個保底機(jī)制其實是多余的,可以起個定時任務(wù)線程橘蜜,去清理這個connectionFinalizerPhantomRefs集合菊匿。
下面是一個舊版本的清理示例代碼。
參考連接
PhantomReference導(dǎo)致CMS GC耗時嚴(yán)重
PhantomReference 引發(fā)的GC問題