Skip to content

LMDB freelist

ledgerwatch edited this page Oct 12, 2020 · 12 revisions

In LMDB, data can be deleted in two main ways:

  1. deleting keys from a cursor
  2. removing an entire table (DBI)

Deleting keys from a cursor may cause rebalancing of the B+-tree, because one of the invariants that need to be maintained in a B+-tree says that all pages, except for the root page, must be at least half-full. So, if deleting keys makes a page less than half-full, rebalancing happens - keys and values are moved between pages. Some of these rebalancing cause some pages become empty.

Removing an entire table (DBI) is a simpler operation than deleting keys from a cursor, because it does not require any rebalancing. Any table always starts with its own root page, and pages are never shared between tables. Therefore, removing an entire table requires finding all the pages that belong to that table.

As we see from above, both ways may produce new empty pages. These empty pages are usually in middle of the database file, and we would like to fill them up with the new data, to prevent the database file from growing too fast.

Since any changes to the database can only occur as a result of a writeable transaction, a transaction object should have some transient data structures to keep the list of pages this transaction emptied. At the time of the transaction commit, these structures are converted into persistent form inside the database. There are two transient data structure each transaction object has to keep the list of the emptied page IDs:

  1. MDB_IDL mt_free_pgs; Type MDB_IDL stands for "ID list", and it is a pointer for a list of 8-byte Page IDs, starting with the length of the list at the index 0. Therefore, you can often see txn->mt_free_pgs[0] as the expression to take the length of the page ID list. This list is pre-allocated to certain size, and therefore, if something needs to be added, such list sometimes needs to be reallocated and moved.
  2. MDB_page *mt_loose_pgs; Type MDB_page is the structure describing page header. At the beginning of any page, there is a space for either page number of a pointer to another page header. Therefore, mt_loose_pgs is the head of a linked list of page headers, which is convenient for adding and remove pages one by one.

Conceptually, these two data structures of a transaction object contain the same kind of information - pages emptied by the transaction, but they are kept and maintained in different forms (slice of IDs and linked list of page headers). The reason for this is as follows. Loose pages from mt_loose_pgs are allowed to be re-used during the same transaction, whereas the pages from mt_free_pgs are not allowed to be re-used. The re-use happens in the function mdb_page_alloc. This distinction about allowed re-use drives the decisions on when emptied pages are added to one list or another. Pages that have been allocated by this current transaction (either from the expansion of the database or from recycling of the freelist), are known not to be accessed by any other transactions. Such pages end up being marked with P_DIRTY flag. Therefore, having such flag becomes necessary (but not sufficient) criteria for a page to end up in the mt_loose_pgs linked list when emptied.

  1. When an entire table is removed, the emptied pages always get added to the mt_free_pgs, ostensibly for efficiency. If there are many page IDs that were emptied, it is very unlikely that all of them have been allocated by the current transaction, so they are all added to the mt_free_pgs, even those some of them might be eligible for mt_loose_pgs. The function where appending to mt_free_pgs is performed is mdb_drop0.
  2. When rebalancing is performed, if a page has too few entries, and its neighbour page is also just at the threshold or below, the two pages are merged. This merging creates an empty page, and it is added to the mt_loose_pgs if it is marked with P_DIRTY flag and it is not part of the "special" freelist table FREE_DBI. Otherwise, it is added to the mt_free_pgs

At the transaction commit, both structures mt_free_pgs and mt_loose_pgs are effectively merged together and persisted as a record in the "special" freelist table FREE_DBI. Such record has transaction IDs as a key, and the value is the encoding of the sorted list of page IDs, 8 byte per ID. Transaction IDs increase monotonously throughout the life of the database, so they can be used as "versions" for the purposes of MVCC (Multi Version Concurrency Control). The function that persists these two data structures is mdb_freelist_save, and it has some nuances that cause performance issues when freelists become large. I will describe the mechanics of this function and the issues next...