Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

ISSUE-2806: OOM as 1 million ledgers per entry log #409

Open
sijie opened this issue Sep 24, 2021 · 1 comment
Open

ISSUE-2806: OOM as 1 million ledgers per entry log #409

sijie opened this issue Sep 24, 2021 · 1 comment
Assignees
Labels

Comments

@sijie
Copy link
Member

sijie commented Sep 24, 2021

Original Issue: apache#2806


BUG REPORT

Describe the bug
There is about 1M ledgers in per entry log. After running for a while, OOM will appear. And there is still enough memory.
There are two OOM positions, as follows:
Position 1:
2021-09-21 02:22:08,323 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception in SyncThread
java.lang.OutOfMemoryError: Java heap space
at org.apache.bookkeeper.bookie.storage.ldb.WriteCache.forEach(WriteCache.java:222) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]

Position 2:
2021-09-21 02:24:14,987 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception in SyncThread
java.lang.OutOfMemoryError: Java heap space
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap$Section.rehash(ConcurrentLongLongHashMap.java:673) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap$Section.addAndGet(ConcurrentLongLongHashMap.java:456) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.util.collections.ConcurrentLongLongHashMap.addAndGet(ConcurrentLongLongHashMap.java:186) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.bookie.EntryLogMetadata.addLedgerSize(EntryLogMetadata.java:47) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]

main GC log:
2021-09-21T02:22:08.437+0800: 105453.194: [GC pause (G1 Humongous Allocation)
2021-09-21T02:22:08.449+0800: 105453.206: [Full GC (Allocation Failure) 10G->10G(20G), 1.9652874 secs]
[Eden: 0.0B(992.0M)->0.0B(1024.0M) Survivors: 32.0M->0.0B Heap: 10.3G(20.0G)->10.3G(20.0G)], [Metaspace: 35551K->35539K(1081344K)]
[Times: user=4.94 sys=0.00, real=1.96 secs]
2021-09-21T02:22:10.415+0800: 105455.172: [Full GC (Allocation Failure) 10G->10G(20G), 1.6151095 secs]
[Eden: 0.0B(1024.0M)->0.0B(1024.0M) Survivors: 0.0B->0.0B Heap: 10.3G(20.0G)->10.3G(20.0G)], [Metaspace: 35539K->35539K(1081344K)]
[Times: user=4.32 sys=0.00, real=1.62 secs]

The common feature is that they are allocting a humongous contiguous memory.
Position 1:
In WriteCache.forEach, about 1M entry per minute, the sortedEntries size shoud be 1M42*8=64M byte.

Position 2:
As 1M ledgers per entry log, the table size of ConcurrentLongLongHashMap should be 1M22*8=32M byte. Some times, ledger will be more than 1 M, so the memory shoud be larger than 32M.

As use G1 and the G1HeapRegionSize is 32m (the max value), there maybe no contiguous regions to allocate the humongous contiguous memory. Pre-allocated a large memory for the sortedEntries and add concurrencyLevel for the ConcurrentLongLongHashMap of EntryLogMetadata. The issue do not appear again. How about add 2 config for these?

As the meta data of all entry log is in the memory, the memory it occupied is very large. As 32MB EntryLogMetadata per entry log, the memroy shoul be serveral GB if there are hundreds of entry logs. I delete the ledgers by time. It will be delete after expire. If the entry log is not expire, it will not be delete. So the meta data is not need to load to memory. How about add a feature like this?

@sijie sijie added the type/bug label Sep 24, 2021
@hangc0276 hangc0276 self-assigned this Sep 26, 2021
@hangc0276
Copy link

I will address this issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants