? ? 6月13號,凌晨 00:18分Zookeeper進(jìn)程掛掉奔坟,查看zookeeper進(jìn)程在掛掉時(shí)輸出的JVM相關(guān)的錯(cuò)誤文件hs_err_pid5829.log赋荆,可以看到zookeeper進(jìn)程是因?yàn)閮?nèi)存溢出掛掉的舶得,
????理論上,zookeeper里面沒存什么東西伴嗡,不應(yīng)該會(huì)因?yàn)閮?nèi)存溢出掛掉急波,
????仔細(xì)查看該文件,在末尾瘪校,看到了當(dāng)時(shí)服務(wù)器的內(nèi)存的狀態(tài)澄暮,發(fā)現(xiàn)該臺物理內(nèi)存為256G的服務(wù)器,在zookeeper掛掉的時(shí)候只有1.7G左右的空閑內(nèi)存阱扬,感到很神奇泣懊,因?yàn)樵诜峙湓撆_服務(wù)器上的內(nèi)存的時(shí)候是預(yù)留了50G左右,不應(yīng)該只剩下那么少的內(nèi)存
? ??通過查看諦聽監(jiān)控發(fā)現(xiàn)麻惶,該臺服務(wù)器內(nèi)存不斷的升高馍刮,直到物理內(nèi)存全部消耗完,然后內(nèi)存直接降下來窃蹋,然后又慢慢的上高到最大卡啰,如此重復(fù),
????猜測該臺服務(wù)器上有某個(gè)進(jìn)程發(fā)生了內(nèi)存泄露警没,消耗了服務(wù)器的所有內(nèi)存匈辱,導(dǎo)致自己和別的進(jìn)程(比如zookeeper進(jìn)程和之前掛掉的JN進(jìn)程)因?yàn)樯暾埐坏脚渲玫膬?nèi)存資源,最終因?yàn)閮?nèi)存不足惠奸,導(dǎo)致報(bào)內(nèi)存溢出掛掉梅誓,實(shí)際上不是真正的內(nèi)存溢出,通過使用top命令佛南,發(fā)現(xiàn),yarn用戶有一個(gè)進(jìn)程嵌言,使用內(nèi)存達(dá)到80%多嗅回,也就是將近190G的內(nèi)存都被該進(jìn)程占用,通過ps命令查找到這個(gè)進(jìn)程ID最終定位該進(jìn)程為timelineserver進(jìn)程摧茴,但是很好奇的是绵载,這個(gè)進(jìn)程,我們分配了100G的堆內(nèi)存+10G的堆外內(nèi)存,理論上最多只會(huì)消耗110G左右的內(nèi)存娃豹,但是實(shí)際上卻消耗了200G內(nèi)存焚虱,另外的80G內(nèi)存消耗在哪了?(至于服務(wù)器內(nèi)存的波動(dòng)是因?yàn)閏rontab中有一個(gè)定時(shí)檢測timelineserver的存活狀態(tài)的腳本懂版,會(huì)主動(dòng)的將掛掉的timelineserver啟動(dòng))
????最開始猜測是堆外內(nèi)存泄露導(dǎo)致的鹃栽,但是如何排查?
????通過在啟動(dòng)參數(shù)中加上 -XX:NativeMemoryTracking=detail 來追蹤JVM堆外內(nèi)存(該參數(shù)會(huì)帶來5%~10%的性能消耗)躯畴,通過命令
jcmd 22215 VM.native_memory summary scale=GB 來查看JVM的內(nèi)存消耗(包括NMT內(nèi)存)民鼓,
????而使用linux的命令 ps -p 22215 -o rss,vsz是查看進(jìn)程在linux中真實(shí)消耗的內(nèi)存大小,如下圖:
????通過圖中發(fā)現(xiàn)蓬抄,即便是追蹤JVM的堆外內(nèi)存丰嘉,發(fā)現(xiàn)能夠追蹤到的內(nèi)存消耗只有105G,這符合設(shè)置的110G內(nèi)存嚷缭,說明該進(jìn)程沒有發(fā)生堆內(nèi)存泄露饮亏,也沒有發(fā)生directory內(nèi)存泄露,那么另外將近80G內(nèi)存哪去了阅爽?
????通過對JVM內(nèi)存分配模型發(fā)現(xiàn)克滴,還有一部分JNI調(diào)用的內(nèi)存,是不在JVM統(tǒng)計(jì)的范圍之內(nèi)的优床,因?yàn)镴NI調(diào)用的是C或C++代碼劝赔,C或C++代碼內(nèi)部自己的內(nèi)存申請和釋放是不受JVM管控的,不在JVM的統(tǒng)計(jì)范圍之內(nèi)胆敞,而timelineserver存儲(chǔ)數(shù)據(jù)是使用leveldb着帽,leveldb是使用C++ 語言實(shí)現(xiàn)的,因此現(xiàn)在確定是 JNI導(dǎo)致的堆外內(nèi)存泄露移层,不會(huì)C++ 語言仍翰,更不會(huì)檢測C++ 語言的內(nèi)存泄露檢測,如何排查观话?請教C++ 語言大牛予借,得到的結(jié)果,自己慢慢調(diào)試C++ 程序频蛔,找沒有釋放的內(nèi)存的地方灵迫,找JVM大牛,得到的結(jié)果是看程序?qū)崿F(xiàn)的地方晦溪,看能不能找到可疑的地方瀑粥。
????一臉蒙圈,無從下手三圆,JNI堆外內(nèi)存泄露很難排查狞换,最終選擇查看最近修改的代碼避咆,看能不能找到什么蛛絲馬跡,最終發(fā)現(xiàn)了一點(diǎn)問題修噪,看如下的代碼:
public TimelineEvents getEntityTimelines(String entityType,
??????SortedSet<String> entityIds, Long limit, Long windowStart,
??????Long windowEnd, Set<String> eventType) throws IOException {
????TimelineEvents events = new TimelineEvents();
????if (entityIds == null || entityIds.isEmpty()) {
??????return events;
????}
????// create a lexicographically-ordered map from start time to entities
????Map<byte[], List<EntityIdentifier>> startTimeMap =
????????new TreeMap<byte[], List<EntityIdentifier>>(
????????new Comparator<byte[]>() {
??????????@Override
??????????public int compare(byte[] o1, byte[] o2) {
????????????return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,
????????????????o2.length);
??????????}
????????});
????DBIterator iterator = null;
????try {
??????// look up start times for the specified entities
??????// skip entities with no start time
??????for (String entityId : entityIds) {
????????byte[] startTime = getStartTime(entityId, entityType);
????????if (startTime != null) {
??????????List<EntityIdentifier> entities = startTimeMap.get(startTime);
??????????if (entities == null) {
????????????entities = new ArrayList<EntityIdentifier>();
????????????startTimeMap.put(startTime, entities);
??????????}
??????????entities.add(new EntityIdentifier(entityId, entityType));
????????}
??????}
??????for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap
??????????.entrySet()) {
????????// look up the events matching the given parameters (limit,
????????// start time, end time, event types) for entities whose start times
????????// were found and add the entities to the return list
????????byte[] revStartTime = entry.getKey();
????????for (EntityIdentifier entityIdentifier : entry.getValue()) {
??????????EventsOfOneEntity entity = new EventsOfOneEntity();
??????????entity.setEntityId(entityIdentifier.getId());
??????????entity.setEntityType(entityType);
??????????events.addEvent(entity);
??????????KeyBuilder kb = KeyBuilder.newInstance().add(entityType)
??????????????.add(revStartTime).add(entityIdentifier.getId())
??????????????.add(EVENTS_COLUMN);
??????????byte[] prefix = kb.getBytesForLookup();
??????????if (windowEnd == null) {
????????????windowEnd = Long.MAX_VALUE;
??????????}
??????????byte[] revts = writeReverseOrderedLong(windowEnd);
??????????kb.add(revts);
??????????byte[] first = kb.getBytesForLookup();
??????????byte[] last = null;
??????????if (windowStart != null) {
????????????last = KeyBuilder.newInstance().add(prefix)
????????????????.add(writeReverseOrderedLong(windowStart)).getBytesForLookup();
??????????}
??????????if (limit == null) {
????????????limit = DEFAULT_LIMIT;
??????????}
??????????DB db = entitydb.getDBForStartTime(readReverseOrderedLong(
??????????????revStartTime, 0));
??????????if (db == null) {
????????????continue;
??????????}
??????????iterator = db.iterator();
??????????for (iterator.seek(first); entity.getEvents().size() < limit
??????????????&& iterator.hasNext(); iterator.next()) {
????????????byte[] key = iterator.peekNext().getKey();
????????????if (!prefixMatches(prefix, prefix.length, key)
????????????????|| (last != null && WritableComparator.compareBytes(key, 0,
????????????????????key.length, last, 0, last.length) > 0)) {
??????????????break;
????????????}
????????????TimelineEvent event = getEntityEvent(eventType, key, prefix.length,
????????????????iterator.peekNext().getValue());
????????????if (event != null) {
??????????????entity.addEvent(event);
????????????}
??????????}
????????}
??????}
????} finally {
??????IOUtils.cleanup(LOG, iterator);
????}
????return events;
??}
????主要看加粗的部分查库,看著iterator這個(gè)變量在finally中通過調(diào)用IOUtils.cleanup(LOG, iterator); 好像是被關(guān)閉了,但實(shí)際上finally中關(guān)閉的只是最后一個(gè)iterator指向的DBIterator對象黄琼,因?yàn)槔锩孢€有一層循環(huán)樊销,不斷的給iterator賦值別的變量,但是這些遍歷途中的DBIterator對象卻沒有被調(diào)用close()方法适荣,因此將上面的代碼改成如下的方式:
??public TimelineEvents getEntityTimelines(String entityType,
??????SortedSet<String> entityIds, Long limit, Long windowStart,
??????Long windowEnd, Set<String> eventType) throws IOException {
????TimelineEvents events = new TimelineEvents();
????if (entityIds == null || entityIds.isEmpty()) {
??????return events;
????}
????// create a lexicographically-ordered map from start time to entities
????Map<byte[], List<EntityIdentifier>> startTimeMap =
????????new TreeMap<byte[], List<EntityIdentifier>>(
????????new Comparator<byte[]>() {
??????????@Override
??????????public int compare(byte[] o1, byte[] o2) {
????????????return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,
????????????????o2.length);
??????????}
????????});
??????// look up start times for the specified entities
??????// skip entities with no start time
????for (String entityId : entityIds) {
??????byte[] startTime = getStartTime(entityId, entityType);
??????if (startTime != null) {
????????List<EntityIdentifier> entities = startTimeMap.get(startTime);
????????if (entities == null) {
??????????entities = new ArrayList<EntityIdentifier>();
??????????startTimeMap.put(startTime, entities);
????????}
????????entities.add(new EntityIdentifier(entityId, entityType));
??????}
????}
????for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap
??????????.entrySet()) {
??????// look up the events matching the given parameters (limit,
??????// start time, end time, event types) for entities whose start times
??????// were found and add the entities to the return list
??????byte[] revStartTime = entry.getKey();
??????for (EntityIdentifier entityIdentifier : entry.getValue()) {
????????EventsOfOneEntity entity = new EventsOfOneEntity();
????????entity.setEntityId(entityIdentifier.getId());
????????entity.setEntityType(entityType);
????????events.addEvent(entity);
????????KeyBuilder kb = KeyBuilder.newInstance().add(entityType)
????????????.add(revStartTime).add(entityIdentifier.getId())
????????????.add(EVENTS_COLUMN);
????????byte[] prefix = kb.getBytesForLookup();
????????if (windowEnd == null) {
??????????windowEnd = Long.MAX_VALUE;
????????}
????????byte[] revts = writeReverseOrderedLong(windowEnd);
????????kb.add(revts);
????????byte[] first = kb.getBytesForLookup();
????????byte[] last = null;
????????if (windowStart != null) {
??????????last = KeyBuilder.newInstance().add(prefix)
??????????????.add(writeReverseOrderedLong(windowStart)).getBytesForLookup();
????????}
????????if (limit == null) {
??????????limit = DEFAULT_LIMIT;
????????}
????????DB db = entitydb.getDBForStartTime(readReverseOrderedLong(
????????????revStartTime, 0));
????????if (db == null) {
??????????continue;
????????}
????????try (DBIterator iterator = db.iterator()) {
??????????for (iterator.seek(first); entity.getEvents().size() < limit
??????????????&& iterator.hasNext(); iterator.next()) {
????????????byte[] key = iterator.peekNext().getKey();
????????????if (!prefixMatches(prefix, prefix.length, key)
????????????????|| (last != null && WritableComparator.compareBytes(key, 0,
????????????????key.length, last, 0, last.length) > 0)) {
??????????????break;
????????????}
????????????TimelineEvent event = getEntityEvent(eventType, key, prefix.length,
????????????????iterator.peekNext().getValue());
????????????if (event != null) {
??????????????entity.addEvent(event);
????????????}
??????????}
????????}
??????}
????}
????return events;
??}
????同樣主要看加粗的部分现柠,使用jdk8的寫法,在最內(nèi)部的每次循環(huán)結(jié)束之后弛矛,JVM都會(huì)主動(dòng)調(diào)用iterator.close()方法(jdk8的寫法够吩,無需顯示指明調(diào)用close()方法)將iterator引用的對象的close()方法執(zhí)行關(guān)閉的操作,現(xiàn)在問題來了丈氓,為啥這個(gè)地方執(zhí)行close()方法周循,就不會(huì)發(fā)生內(nèi)存泄露了,
????通過分析DBIterator對象的close()方法
????public void close() {
????????iterator.delete();
????}
????里面調(diào)用的delete()方法如下:
????public void delete() {
????????assertAllocated();
????????IteratorJNI.delete(self);
????????self = 0;
????}
IteratorJNI.delete()方法如下:
????????@JniMethod(flags={CPP_DELETE})
????????public static final native void delete(long self);
????可以看到万俗,最終通過jni調(diào)用了底層的C++的delete()方法湾笛,做釋放的操作,這也就是說由于沒有主動(dòng)調(diào)用釋放操作闰歪,導(dǎo)致底層C++的代碼中申請的內(nèi)存不能夠釋放嚎研,最終導(dǎo)致JNI內(nèi)存泄露。
????通過修改代碼后库倘,觀察幾個(gè)小時(shí)临扮,發(fā)現(xiàn)不在發(fā)生內(nèi)存泄露現(xiàn)象了,至此問題解決教翩。