-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(vm): Refactor VM benchmarks #2668
Conversation
7672cfd
to
dcf97b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of general comments:
Iai improvements are probably caused by reducing the setup complexity. Could also to do with removing some deps that have static initialization logic like RocksDB (although it should be filtered out by iai calibration logic).
If removing the DIY benchmark / extending Criterion is controversial, I can split it off from this PR. IMO, having uniform benchmark framework for running locally and Prometheus reporting is good since it allows not to define same benches multiple times. The API looks relatively straightforward. It has its drawbacks though; e.g., collected stats include warm-up iterations (so, e.g. reported means slightly differ between Criterion and Prometheus).
Metrics for a test run are in the stage Prometheus (vm_benchmark_mean_timing_seconds
etc.).
Regarding Criterion vs. DIY: DIY gives more useful data. The average time reported by criterion is a very bad metric, especially in noisy CI. Because our programs don't use any randomized algorithms (except hashmap in old VM), the minimum time is a good measurement. The newer Divan framework is as pleasant as Criterion and reports minima. I use it in the vm2 repo. |
My understanding is that in CI, we would primarily rely on metrics exported to Prometheus, which do feature the minimum iteration time (along with maximum, mean and median, and the number of iterations). They could be printed to stdout as well (added that in the latest commit). So if it's only a question of whether these values are displayed, it's not an issue. |
Detected VM performance changes
Changes in number of opcodes executed indicate that the gas price of the benchmark has changed, which causes it run out of gas at a different time. Or that it is behaving completely differently. |
What ❔
Why ❔
Makes VM benchmarks more maintainable.
Checklist
zk fmt
andzk lint
.