Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace API plugin response times vary by block height #1219

Closed
aaroncox opened this issue May 31, 2023 · 4 comments
Closed

Trace API plugin response times vary by block height #1219

aaroncox opened this issue May 31, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@aaroncox
Copy link
Contributor

During some history data processing on Jungle4, I came across some pretty consistent API performance differences while using the v1/trace_api/get_block API call. It appears that the block height being queried significantly affects the amount of time/processing required to return the data for the call.

I wrote a script to run some additional tests to reproduce my findings and below is a chart that illustrate the performance degradation I'm seeing consistently based on block height.

https://docs.google.com/spreadsheets/d/1DRaY4NsE4nU_nZ1Al8x926aVeYYthuq4qP3RumUc9xA/edit?usp=sharing

image

The red line is the trace_api and the blue line is the chain_api, all are measured by the milliseconds it took to respond to the API call. The test I ran was to sequentially query 100,000 blocks, using the v1/chain/get_block as a baseline then followed by the same sequence of blocks against the v1/trace_api/get_block endpoint. The test starts at height 39890000 and ends at 39990000.

It's worth noting that any sequence will yield the same results, this was just a random sampling.

You can see the results from the chain_api yield consistently very fast (<10ms), while the results from the trace_api seem range from very fast (<10ms) to very slow (200ms). The performance seems to be correlated to the block number, cycling every 10,000 blocks. At block 10,000 the performance is at its best and 19,999 being the worst, then repeat.

A few things to note about these results:

  • These results are from a trace_api enabled server running Jungle4, tested against both versions 3.1.3 and 4.0.1.
  • This is a bare metal server with nvme drives, not running any heavy services.
  • The HTTP worker threads are running at 100% utilization during these tests.
  • The vast majority of blocks in Jungle4 are empty, so I think its safe to assume this isn't related to serialization.
  • All configuration for the trace_api plugin are default, except for trace-no-abis = true.
  • The test script I ran were performing queries against localhost to eliminate any network latency.

Theories

After talking about this with the team at length and trying to troubleshoot anything it might be on our setup/configuration, one thing we noticed was that the trace_api stride size matches exactly with which blocks perform the best and which perform the worst. Each file the trace_api outputs has 10,000 blocks in it by default.

Our working theory at this point is that the reason that block 10,000 is fast is because it's at the beginning of the file, while block 19,999 would be at the end of the file - and potentially seeking through that file could be the reason for the delay in API response times. It's a guess at this point though, since we didn't look into or have a previous understanding of how these files and their pointers work.

Expected results

In this situation, with 99% empty blocks, our expectation of how the trace_api should function is that regardless of the block number, the time to return results should be consistent (much like the chain_api response times).

Reproduction

After discovering this issue with our history services, I then reproduced the results using this script:

https://github.com/aaroncox/getblocktest

This script has a hardcoded IP address to the Jungle4 server I was using that is running the trace_api plugin, so it can be run by anyone looking to see the results in real time. The APIClient defined on line 3 can also be edited to point at any other server running a similar configuration. Using nodejs v18+, these tests can be run by:

git clone https://github.com/aaroncox/getblocktest
npm install
node index.js
@heifner
Copy link
Member

heifner commented May 31, 2023

This is not unexpected, although the actual times are larger than I would have expected. The trace_api_plugin does a scan from beginning to end of the file searching for the block number. The default trace-slice-stride of 10,000 means that worse case it has to scan the entire file to find the block number. You can mitigate against this by using a smaller trace-slice-stride setting.

@heifner heifner added enhancement New feature or request and removed triage labels May 31, 2023
@aaroncox
Copy link
Contributor Author

aaroncox commented May 31, 2023

Yeah, the addition of something like 200ms is a lot more than I expected would occur - especially on faster drives like nvme.

Would it be possible to use an index in the trace_index_XXX_YYY.log file to prevent this scanning through the file, thus eliminating this performance drop?

I'm not opposed to dropping the trace-slice-stride size to 100 or even 10 to improve performance - but requiring operators to alter the default settings for a performance mitigation feels like just a temporary solution.

FWIW - The reason why we'd like this improvement is that our history solution (Roborovski) processes data sequentially from the trace_api like this. In real time as its keeping up with head/lib, this isn't too much of a problem - but when replaying the entire chain to create a new instance, this bottleneck significantly slows down processing time. (e.g. going from 3 days to 1 month to replay).

@heifner
Copy link
Member

heifner commented Aug 10, 2023

I do think using the block number from the file name would provide quick determination of the correct file. Or we could create a meta index of block ranges to files.

@bhazzard
Copy link

bhazzard commented Nov 2, 2023

Options to address this:

  • Faster: Could set a maximum stride size to limit performance impact.
  • Better: Introduce an index to achieve scalable performance.

However given the limited audience that this issue impacts, and the available workaround of setting a smaller stride size, we're going to close this issue for now, pending additional reports.

@bhazzard bhazzard closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2023
@bhazzard bhazzard removed the triage label Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

4 participants