test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

daniel-wong-dfinity-org · 2024-09-26T20:24:51Z

Usage

bazel test \
    --test_env=SSH_AUTH_SOCK \
    --test_output=streamed \
    --test_arg=--nocapture \
    //rs/nns/integration_tests:why_list_neurons_expensive \
&& cp -i \
  bazel-bin/rs/nns/integration_tests/test-127765624/why_list_neurons_expensive.runfiles/ic/list_neurons_*.svg \
  ./

(You need to be able to scp from [email protected].)

Limitations

The method that you want to measure must be an update, not a query. You can change a query to an update to measure it; however, make sure that you do not merge such changes, because they are very NOT backwards compatible!
You need to disable heartbeat. Otherwise, your flame graphs will have extraneous garbage in them.

Next Steps

Prime Candid's cache by calling ListNeuronsResponse::ty() in post_upgrade, as suggested by Yan.
I'm not using all the space that I allocated to ic-wasm instrument. Before, when I tried to use all of it, I think it stomped on some other VirtualMemory's. (I guess the ideal solution is for ic-wasm instrument to support MemoryManager + VirtualMemory, rather than ask for raw stable memory space, but this might be difficult.)
Check the thread in the #dev channel where I am mainly talking to Yan Chen about getting this to work. I might be able to mine that thread for additional tasks to put into this "Next Steps" section...
Make as much of this sharable as possible, because every canister developer wants this (not just NNS).

Analysis

One of the things that the bazel test command outputs is instructions vs. neurons. Using linear regression, I developed a very strong model (R^2 = 0.973):

Instructions = (1.03 * HeapNeurons + 5.08 StableMemoryNeurons + 47.5) * 1e5

This tells us that the incremental cost of fetching and serving a neuron from heap is about 5x as much as it is from stable memory.

This is also telling us that the fixed cost/overhead of list_neurons is about 4.75 million instructions.

See attached Jupyter notebook for source code: list_neurons_instructions_vs_neuron_count.ipynb.txt

Observations

When the model underestimates, the principal tends to have many neurons in stable memory. Likewise, when the model overestimates, the principal tends to have very few neurons in stable memory. Not sure why that is. This does not really make sense to me... (warning: I'm not a data scientist.)

Anedotally

Seems like there can be BIG variance in the amount of instructions needed to load a neuron from stable memory. In some cases, it seems to take 100k instructions. In other cases, it seems to take 1.5M. It seems like what's happening here is that for "fast" stable memory neurons, theres no time spent fetching auxiliary data, but for "slow" stable memory neurons, most of the time fetching the neuron is spent loading auxiliary data. E.g. the neurons of principal ID an2fu-wmt2w-7fhvx-2g6qp-pq5pc-gdxu3-jfcr5-epiid-qhdzc-xnv3a-mae. We might want to dig into this, but I don't think there's time for that if we want to wrap up this investigation this week...

I think principal ID rxnbx-mhkez-7ckve-yboel-dnguy-dos76-hpdlg-fz5bf-xm5ou-72fre-wae is a good example of some (unsurprising) phenomena that seem to generalize:

A large chunk of instructions is spent in a couple of areas:
Looking for the caller's neurons. This requires a range StableBTreeMap range scan. We maybe could speed this up if (some) of the index were in heap. It might make sense to split the index so that neurons in heap are also indexed in heap. However, this would introduce complexity.
When a neuron is in stable memory, it takes much longer to load compared to heap (already noted earlier when discussing the linear model).
Candid serialization of the response. This is significant when most of a principal's neurons are in heap. (This fact was hinted at earlier when the overhead of list_neurons was discussed in the context of the linear model.)

Possible Solutions (not necessarily recommended!)

These three areas seem to be outside the control of the NNS. We maybe could speed up our use of stable memory by using a more efficient serialization scheme, but that would introduce more complexity.

When Rust + WASM 64 support becomes mature, that might also help us avoid the cost of stable memory by keeping more data in heap. This has an important drawback though: we would have to "freeze dry" heap data when upgrading. If there is a large amount of this, upgrades become expensive. In any case, it does not seem like Rust + WASM 64 support is coming to the IC soon.

Examples

https://drive.google.com/file/d/1eBEuEKtKlmw4hpDkFYgrkX-h40jr9s_H/view?usp=sharing

💡 You will have to download this in order to be able click on boxes in this example ☝️ to zoom in on those boxes, because the Google Drive viewer does not seem to support this.

Related Work

Max's range_neurons optimization. IIUC, this is currently only used in background (data validation) tasks, so it would not help us here.

bazel/external_crates.bzl

rs/nns/integration_tests/src/why_list_neurons_expensive.rs

rs/nns/governance/canister/canister.rs

…). Running into a problem where ic-wasm instrument uses the 32-bit stable memory API, but we are already using more than 4 GiB of stable memory. Yan Chen confirmed that ic-wasm does not yet support the 32-bit API, but that can be fixed pretty easily.

… does not support the 64-bit stable memory API (the 32-bit API is deprecated). This required us to upgrade our version of walrus from 0.21.1 to 0.22.0. Not sure yet if this has bad side effects...

…g from more lax from_utf8_lossy to stricter from_utf8 + unwrap.

… ic-wasm instrument 😡

…al changes: 1. start_address is a byte offset, not page. 2. Read "name" metadata from ic-wasm instrumented WASM. This is later used to visualize. 3. Visualize. Output is at list_neurons.svg. This involved copying fn render_profiling from the ic-repl repo. (I will upload an example list_neurons.svg to GitHub shortly.)

…ection = True` to our rust_canister rules in BUILD files.

…cking.

…ve fewer (more typical) number of neurons.

…ble memory neuron count (again, per principal).

… Some unused imports, but I think everything builds at this point. My favorite change here: do not tell inferno how wide to make the svg. This is super nice, because then it is displayed using as much horizontal space as you give it.

…eem to help. Not sure why.

…format used by inferno. Includes tests that show behavior is preserved. (The next commit is going to get rid of the original impl, which was copy n' pasted, and not very self-explanatory.)

…t used by ic-wasm to the format used by inferno. Previous commit shows that the new impl is equivalent.

daniel-wong-dfinity-org marked this pull request as ready for review September 27, 2024 13:20

daniel-wong-dfinity-org requested review from a team as code owners September 27, 2024 13:20

daniel-wong-dfinity-org marked this pull request as draft September 27, 2024 13:20

basvandijk reviewed Sep 27, 2024

View reviewed changes

bazel/external_crates.bzl Outdated Show resolved Hide resolved

daniel-wong-dfinity-org commented Sep 30, 2024

View reviewed changes

rs/nns/integration_tests/src/why_list_neurons_expensive.rs Outdated Show resolved Hide resolved

daniel-wong-dfinity-org changed the title ~~test(nns): Trying to generate flame graph(s) for list_neurons.~~ test(nns): Trying to generate flame graph(s) for. Oct 7, 2024

daniel-wong-dfinity-org commented Oct 7, 2024

View reviewed changes

rs/nns/governance/canister/canister.rs Outdated Show resolved Hide resolved

daniel-wong-dfinity-org changed the title ~~test(nns): Trying to generate flame graph(s) for.~~ test(nns): Generate canister flame graph(s). Oct 8, 2024

daniel-wong-dfinity-org added 16 commits October 8, 2024 19:48

Upgrade our version of ic-wasm to 0.9.0, because the previous version…

31ebdc1

… does not support the 64-bit stable memory API (the 32-bit API is deprecated). This required us to upgrade our version of walrus from 0.21.1 to 0.22.0. Not sure yet if this has bad side effects...

Mostly, misc comment changes. The only "real" code change is switchin…

becaaa2

…g from more lax from_utf8_lossy to stricter from_utf8 + unwrap.

Hack to work around reading too far into stable memory, apparently be…

65f0e1a

… ic-wasm instrument 😡

lint

089baaf

lint

f8fc743

In order for names to be in flame graphs, we need to add `keep_name_s…

c133add

…ection = True` to our rust_canister rules in BUILD files.

Print list_neurons result.

98f0736

Find top principals by neuron count. This required some visibility ha…

3405144

…cking.

squelch some lints

017c776

Comment out some prints.

f1d8328

Generate many flame graphs to see how list_neuron scales.

0a11b97

Measure more list_neuron calls. In particular, try principals that ha…

4931f3b

…ve fewer (more typical) number of neurons.

Overhaul comments.

5e75857

Gather more performance data about list_neurons.

b6d88a9

daniel-wong-dfinity-org force-pushed the why-list_neurons-expensive-daniel-wong branch from 5bd9d89 to b6d88a9 Compare October 8, 2024 18:39

github-actions bot added the test label Oct 8, 2024

daniel-wong-dfinity-org added 3 commits October 14, 2024 16:28

Break out (per principal) neuron count into heap neuron count vs. sta…

387a0a0

…ble memory neuron count (again, per principal).

Misc comment changes.

286100a

Bug: principal's neuron counters were not being incremented.

c465666

CSV instead of TSV.

6c2ed4c

daniel-wong-dfinity-org changed the title ~~test(nns): Generate canister flame graph(s).~~ test(nns): Analyze list_neurons performance, and look for promising areas for optimization. Oct 16, 2024

daniel-wong-dfinity-org added 4 commits October 21, 2024 20:36

Try to prime the Candid cache. Yan recommended this, but it doesn't s…

5852248

…eem to help. Not sure why.

Spin out function that converts from ic wasm's trace format into the …

f97d452

…format used by inferno. Includes tests that show behavior is preserved. (The next commit is going to get rid of the original impl, which was copy n' pasted, and not very self-explanatory.)

Deleted old implementation of conversion from the trace/profile forma…

9fff7cc

…t used by ic-wasm to the format used by inferno. Previous commit shows that the new impl is equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

daniel-wong-dfinity-org commented Sep 26, 2024 •

edited

Loading

test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

Are you sure you want to change the base?

test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

Conversation

daniel-wong-dfinity-org commented Sep 26, 2024 • edited Loading

Usage

Limitations

Next Steps

Analysis

Observations

Anedotally

Possible Solutions (not necessarily recommended!)

Examples

Related Work

daniel-wong-dfinity-org commented Sep 26, 2024 •

edited

Loading