ISSUE-2806: OOM as 1 million ledgers per entry log #409

sijie · 2021-09-24T11:09:12Z

BUG REPORT

Describe the bug
There is about 1M ledgers in per entry log. After running for a while, OOM will appear. And there is still enough memory.
There are two OOM positions, as follows:
Position 1：
2021-09-21 02:22:08,323 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception in SyncThread
java.lang.OutOfMemoryError: Java heap space
at org.apache.bookkeeper.bookie.storage.ldb.WriteCache.forEach(WriteCache.java:222) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]

Position 2：
2021-09-21 02:24:14,987 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception in SyncThread
java.lang.OutOfMemoryError: Java heap space
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap$Section.rehash(ConcurrentLongLongHashMap.java:673) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap$Section.addAndGet(ConcurrentLongLongHashMap.java:456) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap.addAndGet(ConcurrentLongLongHashMap.java:186) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.bookie.EntryLogMetadata.addLedgerSize(EntryLogMetadata.java:47) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]

main GC log:
2021-09-21T02:22:08.437+0800: 105453.194: [GC pause (G1 Humongous Allocation)
2021-09-21T02:22:08.449+0800: 105453.206: [Full GC (Allocation Failure) 10G->10G(20G), 1.9652874 secs]
[Eden: 0.0B(992.0M)->0.0B(1024.0M) Survivors: 32.0M->0.0B Heap: 10.3G(20.0G)->10.3G(20.0G)], [Metaspace: 35551K->35539K(1081344K)]
[Times: user=4.94 sys=0.00, real=1.96 secs]
2021-09-21T02:22:10.415+0800: 105455.172: [Full GC (Allocation Failure) 10G->10G(20G), 1.6151095 secs]
[Eden: 0.0B(1024.0M)->0.0B(1024.0M) Survivors: 0.0B->0.0B Heap: 10.3G(20.0G)->10.3G(20.0G)], [Metaspace: 35539K->35539K(1081344K)]
[Times: user=4.32 sys=0.00, real=1.62 secs]

The common feature is that they are allocting a humongous contiguous memory.
Position 1：
In WriteCache.forEach, about 1M entry per minute, the sortedEntries size shoud be 1M42*8=64M byte.

Position 2:
As 1M ledgers per entry log, the table size of ConcurrentLongLongHashMap should be 1M22*8=32M byte. Some times, ledger will be more than 1 M, so the memory shoud be larger than 32M.

As use G1 and the G1HeapRegionSize is 32m (the max value), there maybe no contiguous regions to allocate the humongous contiguous memory. Pre-allocated a large memory for the sortedEntries and add concurrencyLevel for the ConcurrentLongLongHashMap of EntryLogMetadata. The issue do not appear again. How about add 2 config for these?

As the meta data of all entry log is in the memory, the memory it occupied is very large. As 32MB EntryLogMetadata per entry log, the memroy shoul be serveral GB if there are hundreds of entry logs. I delete the ledgers by time. It will be delete after expire. If the entry log is not expire, it will not be delete. So the meta data is not need to load to memory. How about add a feature like this?

hangc0276 · 2021-09-26T04:24:50Z

I will address this issue

sijie added the type/bug label Sep 24, 2021

hangc0276 self-assigned this Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-2806: OOM as 1 million ledgers per entry log #409

ISSUE-2806: OOM as 1 million ledgers per entry log #409

sijie commented Sep 24, 2021

hangc0276 commented Sep 26, 2021

ISSUE-2806: OOM as 1 million ledgers per entry log #409

ISSUE-2806: OOM as 1 million ledgers per entry log #409

Comments

sijie commented Sep 24, 2021

hangc0276 commented Sep 26, 2021