Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event Log Rotation and Memory Growth #1581

Open
Quantumplation opened this issue Aug 21, 2024 · 2 comments
Open

Event Log Rotation and Memory Growth #1581

Quantumplation opened this issue Aug 21, 2024 · 2 comments
Labels
💭 idea An idea or feature request

Comments

@Quantumplation
Copy link
Contributor

Why

While working on the hydra-doom project, we noticed that both the on-disk state and the in memory state grew without bound (see #1572)

This meant that, at the sustained load that the hydra doom demo was producing, nodes became inoperable after just a few hours. The hack in #1572 helped, but on-disk state still needed to be rotated regularly, by hand.

This consisted of stopping the nodes, renaming the data directory, bringing the nodes back up, and then shipping the data directory off to archival storage. And this only worked because we were using offline nodes and didn't mind interrupting the head.

What

I'd like to propose that the hydra head implement checkpointing for the event log.

How

This is just a proposed implementation, feel free to adapt to better fit the intricacies of the hydra codebase.

  • The hydra node default file sync will be updated to write into event log files or directories that are named by a starting sequence number, such as data/seq-0/state or data/seq-12345/state
  • The first message in the event log will be a "checkpoint" event, which contains any state needed to recover from that point in time without regard to any messages that came before
  • After a certain number of messages written (or time interval, or bytes, etc.), the hydra node will close the previous files, create a new file/directory, write the checkpoint event to the file, drop any previous events from memory, and then emit this checkpoint event to the websocket API
  • On startup, the default file source would identify the "latest" event log directory and begin consuming events from that log; the initial checkpoint event would allow it to recover any state, such as the current UTXO, etc.
  • A new websocket message, "trigger checkpoint", would allow external orchestration to request a checkpoint if its required, such as maintenance windows, file backpressure, etc.

This would allow a 3rd party agent to detect the checkpoint and trigger any appropriate archival / backup / cleanup that was needed, without interrupting the hydra head, hydra heads would be able to recover faster after a failure, and memory usage would be kept within a bounded limit.

Again, I'm super unfamiliar with the hydra codebase, so there might be more subtleties that are needed, but I just wanted to get the ball rolling on a discussion :)

@Quantumplation Quantumplation added the 💭 idea An idea or feature request label Aug 21, 2024
@ch1bo ch1bo mentioned this issue Aug 21, 2024
4 tasks
@ch1bo
Copy link
Collaborator

ch1bo commented Aug 22, 2024

As it was only mentioned in passing in this item, we might want to scope separate item(s) about the memory growth in:

  • The API server keeps an ever growing history of output events. This could be addressed by projecting it from the base event stream (stored in state) and re-reading the persisted events on demand. The proposed checkpointing from this item here would truncate the API history too.

  • The network reliability component which keeps an ever growing outbound buffer of sent messages. The algorithm must be changed to have only a bounded resilience against network faults and consequently a bounded buffer of messages it can resend.

@ch1bo
Copy link
Collaborator

ch1bo commented Sep 10, 2024

Created #1618 to cover the API server part of tackling memory growth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💭 idea An idea or feature request
Projects
Status: Later
Development

No branches or pull requests

2 participants