Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(nns): Analyze list_neurons performance, and look for promising areas for optimization. #1712

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

daniel-wong-dfinity-org
Copy link
Contributor

@daniel-wong-dfinity-org daniel-wong-dfinity-org commented Sep 26, 2024

Usage

bazel test \
    --test_env=SSH_AUTH_SOCK \
    --test_output=streamed \
    --test_arg=--nocapture \
    //rs/nns/integration_tests:why_list_neurons_expensive \
&& cp -i \
  bazel-bin/rs/nns/integration_tests/test-127765624/why_list_neurons_expensive.runfiles/ic/list_neurons_*.svg \
  ./

(You need to be able to scp from [email protected].)

Limitations

  1. The method that you want to measure must be an update, not a query. You can change a query to an update to measure it; however, make sure that you do not merge such changes, because they are very NOT backwards compatible!

  2. You need to disable heartbeat. Otherwise, your flame graphs will have extraneous garbage in them.

Next Steps

  1. Prime Candid's cache by calling ListNeuronsResponse::ty() in post_upgrade, as suggested by Yan.

  2. I'm not using all the space that I allocated to ic-wasm instrument. Before, when I tried to use all of it, I think it stomped on some other VirtualMemory's. (I guess the ideal solution is for ic-wasm instrument to support MemoryManager + VirtualMemory, rather than ask for raw stable memory space, but this might be difficult.)

  3. Check the thread in the #dev channel where I am mainly talking to Yan Chen about getting this to work. I might be able to mine that thread for additional tasks to put into this "Next Steps" section...

  4. Make as much of this sharable as possible, because every canister developer wants this (not just NNS).

Analysis

One of the things that the bazel test command outputs is instructions vs. neurons. Using linear regression, I developed a very strong model (R^2 = 0.973):

Instructions = (1.03 * HeapNeurons + 5.08 StableMemoryNeurons + 47.5) * 1e5

This tells us that the incremental cost of fetching and serving a neuron from heap is about 5x as much as it is from stable memory.

This is also telling us that the fixed cost/overhead of list_neurons is about 4.75 million instructions.

See attached Jupyter notebook for source code: list_neurons_instructions_vs_neuron_count.ipynb.txt

Observations

When the model underestimates, the principal tends to have many neurons in stable memory. Likewise, when the model overestimates, the principal tends to have very few neurons in stable memory. Not sure why that is. This does not really make sense to me... (warning: I'm not a data scientist.)

Anedotally

Seems like there can be BIG variance in the amount of instructions needed to load a neuron from stable memory. In some cases, it seems to take 100k instructions. In other cases, it seems to take 1.5M. It seems like what's happening here is that for "fast" stable memory neurons, theres no time spent fetching auxiliary data, but for "slow" stable memory neurons, most of the time fetching the neuron is spent loading auxiliary data. E.g. the neurons of principal ID an2fu-wmt2w-7fhvx-2g6qp-pq5pc-gdxu3-jfcr5-epiid-qhdzc-xnv3a-mae. We might want to dig into this, but I don't think there's time for that if we want to wrap up this investigation this week...

I think principal ID rxnbx-mhkez-7ckve-yboel-dnguy-dos76-hpdlg-fz5bf-xm5ou-72fre-wae is a good example of some (unsurprising) phenomena that seem to generalize:

  1. A large chunk of instructions is spent in a couple of areas:
  2. Looking for the caller's neurons. This requires a range StableBTreeMap range scan. We maybe could speed this up if (some) of the index were in heap. It might make sense to split the index so that neurons in heap are also indexed in heap. However, this would introduce complexity.
  3. When a neuron is in stable memory, it takes much longer to load compared to heap (already noted earlier when discussing the linear model).
  4. Candid serialization of the response. This is significant when most of a principal's neurons are in heap. (This fact was hinted at earlier when the overhead of list_neurons was discussed in the context of the linear model.)

Possible Solutions (not necessarily recommended!)

These three areas seem to be outside the control of the NNS. We maybe could speed up our use of stable memory by using a more efficient serialization scheme, but that would introduce more complexity.

When Rust + WASM 64 support becomes mature, that might also help us avoid the cost of stable memory by keeping more data in heap. This has an important drawback though: we would have to "freeze dry" heap data when upgrading. If there is a large amount of this, upgrades become expensive. In any case, it does not seem like Rust + WASM 64 support is coming to the IC soon.

Examples

https://drive.google.com/file/d/1eBEuEKtKlmw4hpDkFYgrkX-h40jr9s_H/view?usp=sharing

💡 You will have to download this in order to be able click on boxes in this example ☝️ to zoom in on those boxes, because the Google Drive viewer does not seem to support this.

Related Work

Max's range_neurons optimization. IIUC, this is currently only used in background (data validation) tasks, so it would not help us here.

@daniel-wong-dfinity-org daniel-wong-dfinity-org marked this pull request as ready for review September 27, 2024 13:20
@daniel-wong-dfinity-org daniel-wong-dfinity-org marked this pull request as draft September 27, 2024 13:20
@daniel-wong-dfinity-org daniel-wong-dfinity-org changed the title test(nns): Trying to generate flame graph(s) for list_neurons. test(nns): Trying to generate flame graph(s) for. Oct 7, 2024
@daniel-wong-dfinity-org daniel-wong-dfinity-org changed the title test(nns): Trying to generate flame graph(s) for. test(nns): Generate canister flame graph(s). Oct 8, 2024
…). Running into a problem where ic-wasm instrument uses the 32-bit stable memory API, but we are already using more than 4 GiB of stable memory. Yan Chen confirmed that ic-wasm does not yet support the 32-bit API, but that can be fixed pretty easily.
… does not support the 64-bit stable memory API (the 32-bit API is deprecated).

This required us to upgrade our version of walrus from 0.21.1 to 0.22.0. Not sure yet if this has bad side effects...
…g from more lax from_utf8_lossy to stricter from_utf8 + unwrap.
…al changes:

1. start_address is a byte offset, not page.
2. Read "name" metadata from ic-wasm instrumented WASM. This is later used to visualize.
3. Visualize. Output is at list_neurons.svg. This involved copying fn render_profiling from the ic-repl repo.

(I will upload an example list_neurons.svg to GitHub shortly.)
…ection = True` to our rust_canister rules in BUILD files.
@daniel-wong-dfinity-org daniel-wong-dfinity-org force-pushed the why-list_neurons-expensive-daniel-wong branch from 5bd9d89 to b6d88a9 Compare October 8, 2024 18:39
@github-actions github-actions bot added the test label Oct 8, 2024
@daniel-wong-dfinity-org daniel-wong-dfinity-org changed the title test(nns): Generate canister flame graph(s). test(nns): Analyze list_neurons performance, and look for promising areas for optimization. Oct 16, 2024
… Some unused imports, but I think everything builds at this point.

My favorite change here: do not tell inferno how wide to make the svg. This is super nice, because then it is displayed using as much horizontal space as you give it.
…format used by inferno. Includes tests that show behavior is preserved. (The next commit is going to get rid of the original impl, which was copy n' pasted, and not very self-explanatory.)
…t used by ic-wasm to the format used by inferno. Previous commit shows that the new impl is equivalent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants