Estimate radius at <100% when under-utilizing space #1574

carver · 2024-10-31T03:36:34Z

Problem

When running trin before storage hits the allocation limit, we currently run at 100% radius. If the steady state radius will land at 3% for a particular storage allocation, then 97% of the stored content we receive early on (in the period before radius starts to shrink) will end up being deleted. During that period, 97% of the network transfer, cpu for verification, and storage I/O is wasted.

Proposal

Pre-shrink the radius

We can't know the size of the network with arbitrary precision ahead of time, but we can bake in a rough guess to the binary. This rough guess can massively shrink the wasted data.

Asymmetry in the guess

The effect of incorrectly estimating the radius is highly asymmetrical:

don't preshrink the radius enough, and ... we're still much better off than before
preshrink the radius too much, and you never store as much as you are willing to store (without us adding a more complicated feature, to bump up the radius)

So we probably want to take our radius estimate and double it. We still get a plenty large benefit, with only a small downside when our baked-in numbers get stale over time (assuming it's quite rare for a network total storage to shrink).

Real example

Our current fleet of history nodes at steady state show a ~2% radius with 35 GB allocated each, which implies a 1.75 TB total network size.

If you launch a fresh trin instance for history with 17.5 GB of space, we can estimate that the radius will end up at 1%. With a 2x buffer, we could preshrink the radius to 2%. That means that we waste ~1% of total resources compared to the status quo of ~99%. (During the period before the radius shrinks below 100%) 99x gains are good. The less storage you allocate, the more there is to gain with this change. Those are the folks who probably care the most about performance.

History approaches the correct radius relatively quickly in practice, but state takes a while, and is more resource-intensive. So state is probably where we will feel the benefit of this the most.

pipermerriam · 2024-11-01T16:05:43Z

This is another one that will probably cause hive test failures. A way around this would be to have this value be able to be configured at the CLI so that in hive tests it can effectively be disabled. Seems potentially as simple as something like --history.total_network_data=600_000. This would allow us to take the storage limit they provided (or the default) and extrapolate the appropriate starting radius. For hive, we might also want to introduce something like --history.max_allowed_radius=auto as the default, which means it will extrapolate the value from the total network size. In hive, we can then do --history.max_allowed_radius=100 to have it do the default thing it does now and start the radius at 100%.

Note that this approach suffers from the problem of growth over time making it less accurate as time goes on but in practice my guess is that this will be a non-issue. We could choose a more complex metric to account for this like --history.network_growth_rate but I would be surprised if the complexity of such an approach was worth it. We'll be making releases regularly and we can update

carver mentioned this issue Oct 31, 2024

"Welcome Basket" for new nodes ethereum/portal-network-specs#350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate radius at <100% when under-utilizing space #1574

Estimate radius at <100% when under-utilizing space #1574

carver commented Oct 31, 2024

pipermerriam commented Nov 1, 2024

Estimate radius at <100% when under-utilizing space #1574

Estimate radius at <100% when under-utilizing space #1574

Comments

carver commented Oct 31, 2024

Problem

Proposal

Pre-shrink the radius

Asymmetry in the guess

Real example

pipermerriam commented Nov 1, 2024