????6月13號,凌晨 00:18分Zookeeper進程掛掉酌呆,查看zookeeper進程在掛掉時輸出的JVM相關的錯誤文件hs_err_pid5829.log,可以看到zookeeper進程是因為內存溢出掛掉的,
????理論上顶捷,zookeeper里面沒存什么東西达椰,不應該會因為內存溢出掛掉翰蠢!
????仔細查看該文件,在末尾啰劲,看到了當時服務器的內存的狀態(tài)梁沧,發(fā)現該臺物理內存為256G的服務器,在zookeeper掛掉的時候只有1.7G左右的空閑內存蝇裤,感到很神奇廷支,因為在分配該臺服務器上的內存的時候是預留了50G左右,不應該只剩下那么少的內存
????通過查看服務器監(jiān)控發(fā)現栓辜,該臺服務器內存不斷的升高恋拍,直到物理內存全部消耗完,然后內存直接降下來藕甩,然后又慢慢的上高到最大施敢,如此重復,
????猜測該臺服務器上有某個進程發(fā)生了內存泄露狭莱,消耗了服務器的所有內存僵娃,導致自己和別的進程(比如zookeeper進程和之前掛掉的JN進程)因為申請不到配置的內存資源,最終因為內存不足腋妙,導致報內存溢出掛掉默怨,實際上不是真正的內存溢出,通過使用top命令骤素,發(fā)現匙睹,yarn用戶有一個進程愚屁,使用內存達到80%多,也就是將近190G的內存都被該進程占用痕檬,通過ps命令查找到這個進程ID最終定位該進程為timelineserver進程集绰,但是很好奇的是,這個進程谆棺,我們分配了100G的堆內存+10G的堆外內存栽燕,理論上最多只會消耗110G左右的內存,但是實際上卻消耗了200G內存改淑,另外的80G內存消耗在哪了碍岔?(至于服務器內存的波動是因為crontab中有一個定時檢測timelineserver的存活狀態(tài)的腳本,會主動的將掛掉的timelineserver啟動)
????通過在啟動參數中加上-XX:NativeMemoryTracking=detail來追蹤JVM堆外內存(該參數會帶來5%~10%的性能消耗)朵夏,通過命令
jcmd 22215 VM.native_memory summary scale=GB來查看JVM的內存消耗(包括NMT內存)蔼啦,
而使用linux的命令 ps -p 22215 -o rss,vsz是查看進程在linux中真實消耗的內存大小,如下圖:
????通過圖中發(fā)現仰猖,即便是追蹤JVM的堆外內存捏肢,發(fā)現能夠追蹤到的內存消耗只有105G,這符合設置的110G內存饥侵,說明該進程沒有發(fā)生堆內存泄露鸵赫,也沒有發(fā)生directory內存泄露,那么另外將近80G內存哪去了躏升?
????通過對JVM內存分配模型發(fā)現辩棒,還有一部分JNI調用的內存,是不在JVM統(tǒng)計的范圍之內的膨疏,因為JNI調用的是C或C++代碼一睁,C或C++代碼內部自己的內存申請和釋放是不受JVM管控的,不在JVM的統(tǒng)計范圍之內佃却,而timelineserver存儲數據是使用leveldb者吁,leveldb是使用C++ 語言實現的,因此現在確定是 JNI導致的堆外內存泄露饲帅,不會C++ 語言复凳,更不會檢測C++ 語言的內存泄露檢測,如何排查洒闸?請教C++ 語言大牛染坯,得到的結果均芽,自己慢慢調試C++ 程序丘逸,找沒有釋放的內存的地方,找JVM大牛掀宋,得到的結果是看程序實現的地方深纲,看能不能找到可疑的地方仲锄。
????一臉蒙圈,無從下手湃鹊,JNI堆外內存泄露很難排查儒喊,最終選擇查看最近修改的代碼,看能不能找到什么蛛絲馬跡币呵,最終發(fā)現了一點問題怀愧,看如下的代碼:
public TimelineEvents getEntityTimelines(String entityType,
??????SortedSet<String> entityIds, Long limit, Long windowStart,
??????Long windowEnd, Set<String> eventType) throws IOException {
????TimelineEvents events = new TimelineEvents();
????if (entityIds == null || entityIds.isEmpty()) {
??????return events;
????}
????// create a lexicographically-ordered map from start time to entities
????Map<byte[], List<EntityIdentifier>> startTimeMap =
????????new TreeMap<byte[], List<EntityIdentifier>>(
????????new Comparator<byte[]>() {
??????????@Override
??????????public int compare(byte[] o1, byte[] o2) {
????????????return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,
????????????????o2.length);
??????????}
????????});
????DBIterator iterator = null;
????try {
??????// look up start times for the specified entities
??????// skip entities with no start time
??????for (String entityId : entityIds) {
????????byte[] startTime = getStartTime(entityId, entityType);
????????if (startTime != null) {
??????????List<EntityIdentifier> entities = startTimeMap.get(startTime);
??????????if (entities == null) {
????????????entities = new ArrayList<EntityIdentifier>();
????????????startTimeMap.put(startTime, entities);
??????????}
??????????entities.add(new EntityIdentifier(entityId, entityType));
????????}
??????}
??????for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap
??????????.entrySet()) {
????????// look up the events matching the given parameters (limit,
????????// start time, end time, event types) for entities whose start times
????????// were found and add the entities to the return list
????????byte[] revStartTime = entry.getKey();
????????for (EntityIdentifier entityIdentifier : entry.getValue()) {
??????????EventsOfOneEntity entity = new EventsOfOneEntity();
??????????entity.setEntityId(entityIdentifier.getId());
??????????entity.setEntityType(entityType);
??????????events.addEvent(entity);
??????????KeyBuilder kb = KeyBuilder.newInstance().add(entityType)
??????????????.add(revStartTime).add(entityIdentifier.getId())
??????????????.add(EVENTS_COLUMN);
??????????byte[] prefix = kb.getBytesForLookup();
??????????if (windowEnd == null) {
????????????windowEnd = Long.MAX_VALUE;
??????????}
??????????byte[] revts = writeReverseOrderedLong(windowEnd);
??????????kb.add(revts);
??????????byte[] first = kb.getBytesForLookup();
??????????byte[] last = null;
??????????if (windowStart != null) {
????????????last = KeyBuilder.newInstance().add(prefix)
????????????????.add(writeReverseOrderedLong(windowStart)).getBytesForLookup();
??????????}
??????????if (limit == null) {
????????????limit = DEFAULT_LIMIT;
??????????}
??????????DB db = entitydb.getDBForStartTime(readReverseOrderedLong(
??????????????revStartTime, 0));
??????????if (db == null) {
????????????continue;
??????????}
??????????iterator = db.iterator();
??????????for (iterator.seek(first); entity.getEvents().size() < limit
??????????????&& iterator.hasNext(); iterator.next()) {
????????????byte[] key = iterator.peekNext().getKey();
????????????if (!prefixMatches(prefix, prefix.length, key)
????????????????|| (last != null && WritableComparator.compareBytes(key, 0,
????????????????????key.length, last, 0, last.length) > 0)) {
??????????????break;
????????????}
????????????TimelineEvent event = getEntityEvent(eventType, key, prefix.length,
????????????????iterator.peekNext().getValue());
????????????if (event != null) {
??????????????entity.addEvent(event);
????????????}
??????????}
????????}
??????}
????} finally {
??????IOUtils.cleanup(LOG, iterator);
????}
????return events;
??}
????主要看標紅的部分,看著iterator這個變量在finally中通過調用IOUtils.cleanup(LOG, iterator); 好像是被關閉了余赢,但實際上finally中關閉的只是最后一個iterator指向的DBIterator對象芯义,因為里面還有一層循環(huán),不斷的給iterator賦值別的變量妻柒,但是這些遍歷途中的DBIterator對象卻沒有被調用close()方法扛拨,因此將上面的代碼改成如下的方式:
??public TimelineEvents getEntityTimelines(String entityType,
??????SortedSet<String> entityIds, Long limit, Long windowStart,
??????Long windowEnd, Set<String> eventType) throws IOException {
????TimelineEvents events = new TimelineEvents();
????if (entityIds == null || entityIds.isEmpty()) {
??????return events;
????}
????// create a lexicographically-ordered map from start time to entities
????Map<byte[], List<EntityIdentifier>> startTimeMap =
????????new TreeMap<byte[], List<EntityIdentifier>>(
????????new Comparator<byte[]>() {
??????????@Override
??????????public int compare(byte[] o1, byte[] o2) {
????????????return WritableComparator.compareBytes(o1, 0, o1.length, o2, 0,
????????????????o2.length);
??????????}
????????});
??????// look up start times for the specified entities
??????// skip entities with no start time
????for (String entityId : entityIds) {
??????byte[] startTime = getStartTime(entityId, entityType);
??????if (startTime != null) {
????????List<EntityIdentifier> entities = startTimeMap.get(startTime);
????????if (entities == null) {
??????????entities = new ArrayList<EntityIdentifier>();
??????????startTimeMap.put(startTime, entities);
????????}
????????entities.add(new EntityIdentifier(entityId, entityType));
??????}
????}
????for (Entry<byte[], List<EntityIdentifier>> entry : startTimeMap
??????????.entrySet()) {
??????// look up the events matching the given parameters (limit,
??????// start time, end time, event types) for entities whose start times
??????// were found and add the entities to the return list
??????byte[] revStartTime = entry.getKey();
??????for (EntityIdentifier entityIdentifier : entry.getValue()) {
????????EventsOfOneEntity entity = new EventsOfOneEntity();
????????entity.setEntityId(entityIdentifier.getId());
????????entity.setEntityType(entityType);
????????events.addEvent(entity);
????????KeyBuilder kb = KeyBuilder.newInstance().add(entityType)
????????????.add(revStartTime).add(entityIdentifier.getId())
????????????.add(EVENTS_COLUMN);
????????byte[] prefix = kb.getBytesForLookup();
????????if (windowEnd == null) {
??????????windowEnd = Long.MAX_VALUE;
????????}
????????byte[] revts = writeReverseOrderedLong(windowEnd);
????????kb.add(revts);
????????byte[] first = kb.getBytesForLookup();
????????byte[] last = null;
????????if (windowStart != null) {
??????????last = KeyBuilder.newInstance().add(prefix)
??????????????.add(writeReverseOrderedLong(windowStart)).getBytesForLookup();
????????}
????????if (limit == null) {
??????????limit = DEFAULT_LIMIT;
????????}
????????DB db = entitydb.getDBForStartTime(readReverseOrderedLong(
????????????revStartTime, 0));
????????if (db == null) {
??????????continue;
????????}
????????try (DBIterator iterator = db.iterator()) {
??????????for (iterator.seek(first); entity.getEvents().size() < limit
??????????????&& iterator.hasNext(); iterator.next()) {
????????????byte[] key = iterator.peekNext().getKey();
????????????if (!prefixMatches(prefix, prefix.length, key)
????????????????|| (last != null && WritableComparator.compareBytes(key, 0,
????????????????key.length, last, 0, last.length) > 0)) {
??????????????break;
????????????}
????????????TimelineEvent event = getEntityEvent(eventType, key, prefix.length,
????????????????iterator.peekNext().getValue());
????????????if (event != null) {
??????????????entity.addEvent(event);
????????????}
??????????}
????????}
??????}
????}
????return events;
??}
同樣主要看標紅的部分,使用jdk8的寫法举塔,在最內部的每次循環(huán)結束之后绑警,JVM都會主動調用iterator.close()方法(jdk8的寫法,無需顯示指明調用close()方法)將iterator引用的對象的close()方法執(zhí)行關閉的操作央渣,現在問題來了计盒,為啥這個地方執(zhí)行close()方法,就不會發(fā)生內存泄露了芽丹,
通過分析DBIterator對象的close()方法
????public void close() {
????????iterator.delete();
????}
里面調用的delete()方法如下:
????public void delete() {
????????assertAllocated();
????????IteratorJNI.delete(self);
????????self = 0;
????}
IteratorJNI.delete()方法如下:
????????@JniMethod(flags={CPP_DELETE})
????????public static final native void delete(long self);
????可以看到章郁,最終通過jni調用了底層的C++的delete()方法,做釋放的操作志衍,這也就是說由于沒有主動調用釋放操作暖庄,導致底層C++的代碼中申請的內存不能夠釋放,最終導致JNI內存泄露楼肪。
????通過修改代碼后培廓,觀察幾個小時,發(fā)現不在發(fā)生內存泄露現象了春叫,至此問題解決肩钠。