Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate radius at <100% when under-utilizing space #1574

Open
carver opened this issue Oct 31, 2024 · 1 comment
Open

Estimate radius at <100% when under-utilizing space #1574

carver opened this issue Oct 31, 2024 · 1 comment

Comments

@carver
Copy link
Collaborator

carver commented Oct 31, 2024

Problem

When running trin before storage hits the allocation limit, we currently run at 100% radius. If the steady state radius will land at 3% for a particular storage allocation, then 97% of the stored content we receive early on (in the period before radius starts to shrink) will end up being deleted. During that period, 97% of the network transfer, cpu for verification, and storage I/O is wasted.

Proposal

Pre-shrink the radius

We can't know the size of the network with arbitrary precision ahead of time, but we can bake in a rough guess to the binary. This rough guess can massively shrink the wasted data.

Asymmetry in the guess

The effect of incorrectly estimating the radius is highly asymmetrical:

  • don't preshrink the radius enough, and ... we're still much better off than before
  • preshrink the radius too much, and you never store as much as you are willing to store (without us adding a more complicated feature, to bump up the radius)

So we probably want to take our radius estimate and double it. We still get a plenty large benefit, with only a small downside when our baked-in numbers get stale over time (assuming it's quite rare for a network total storage to shrink).

Real example

Our current fleet of history nodes at steady state show a ~2% radius with 35 GB allocated each, which implies a 1.75 TB total network size.

If you launch a fresh trin instance for history with 17.5 GB of space, we can estimate that the radius will end up at 1%. With a 2x buffer, we could preshrink the radius to 2%. That means that we waste ~1% of total resources compared to the status quo of ~99%. (During the period before the radius shrinks below 100%) 99x gains are good. The less storage you allocate, the more there is to gain with this change. Those are the folks who probably care the most about performance.

History approaches the correct radius relatively quickly in practice, but state takes a while, and is more resource-intensive. So state is probably where we will feel the benefit of this the most.

@pipermerriam
Copy link
Member

This is another one that will probably cause hive test failures. A way around this would be to have this value be able to be configured at the CLI so that in hive tests it can effectively be disabled. Seems potentially as simple as something like --history.total_network_data=600_000. This would allow us to take the storage limit they provided (or the default) and extrapolate the appropriate starting radius. For hive, we might also want to introduce something like --history.max_allowed_radius=auto as the default, which means it will extrapolate the value from the total network size. In hive, we can then do --history.max_allowed_radius=100 to have it do the default thing it does now and start the radius at 100%.

Note that this approach suffers from the problem of growth over time making it less accurate as time goes on but in practice my guess is that this will be a non-issue. We could choose a more complex metric to account for this like --history.network_growth_rate but I would be surprised if the complexity of such an approach was worth it. We'll be making releases regularly and we can update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants