Future MapForwardSimulator Performance Improvements #501

coreyostrove · 2024-10-23T01:23:40Z

With the PR #496 just about wrapped up, I wanted to jot down some ideas for future improvements to the performance of the MapForwardSimulator. There is a chance one of these changes (the first one) may wind up included in that PR, but the rest will likely be left for future enterprising devs. Where I have an estimate for the speedup a change would give I have listed it.

If other folks have ideas on how to do things better with the map forward simulator feel free to chime in to this discussion.

Circuit parameter dependence (20-33% speed up): In the context of running GST circuits it is often the case that many circuits only depend on a subset of the model's parameters. As a corollary to this, when computing Jacobians it is wasteful to compute derivatives with respect to parameters upon which a circuit does not depend. The infrastructure for determining this parameter dependence information is being added in Making 2Q GST Go Vroom Vroom #496, and it was also implemented on that branch successfully for the slower python implementation of the forward simulator. I did try to implement this for the faster cython version (you can see that here if curious: 4b44ed5), but wasn't able to get it working in a manner that reused rho vector caches, so this resulted in a lot of additional memory allocations which made the change an overall wash. I have strong reason to believe this is mostly just a skill issue on my part, and that with an hour of pair programming with someone who was more proficient with C++ (i.e. proficient at all) we could get this working in a way that avoided these extra memory allocations and thus gave a win. The 20-33% speedup figure comes from looking at the average number of parameters a GST circuit depends upon for experiment designs for the 1-qubit XYI gate set, and the two-qubit XYCPHASE gate set, respectively. In the python implementation (which does use this trick and doesn't have the memory allocation issues) we do more or less saturate this speedup in practice based on my profiling.
Dense effect representation caching (33% speedup): In the newest implementation of the map forward simulator code (i.e. the version in Making 2Q GST Go Vroom Vroom #496) I was surprised to discover that for error generator parameterized models roughly a third of the total time for performing probability calculations is spent in the final inner product between the effects and the propagated state. What some deeper profiling using a tool called pyspy found, which has the really cool ability to profile down into the native c/c++ extensions (cProfile and related tools can't break these calls out), was that this was because in this inner produce was actually one more unaccounted for state propagation through the error generator associated with the composed effects. Right now the caching that happens here is at the level of the propagated state (i.e. the propagated state is re-used for each of the effects which share the same error generator), but it would be much more efficient to instead cache the action of the composed effect's error generator on the effect vectors and reuse these instead, as this only needs to happen once per batch of circuits, rather than once per-circuit as currently. This change will be more involved as we'd need to modify the c evotype/rep code to add in support for caching dense effect vectors. Based on my profiling we should expect a roughly 33% speedup to the probability calculations (for the 2Q case) with this change.
Use normalization (5-10%): Right now when performing the final probability calculation we take the inner product between the propagated state rho and all of the effect vectors. We almost always are working with models where the probabilities are normalized, though, so in principle we could save a bit of work by using normalization. This of course has diminishing benefit as the number of qubits increases. Based on the same profiling described in 2, I'd ballpark this as being ~5% savings (relative to current runtime) for two-qubits, a bit more proportionally for one-qubit.
Hessian implementation updates: I presently have not directly changed anything in the implementation of the hessian calculation for the map forward simulators. It will benefit partially from the updates to the jacobian calculations, but it currently still uses the more expensive from_vector calls to do parameter updates in it's outermost loop instead of the newer more efficient few-parameter update codepath. I don't know off hand how big of a savings this would give, but it should be nontrivial. Another bit of low-hanging fruit here is the fact that the hessian code's outer loop is still happening in pure python. We should be able to get a non-trivial benefit from porting this over to a cythonized implementation like is done for the probability and jacobian calculations currently. Both of these changes would be relatively low-effort, but I'd expect them to have big benefits.
Audit memory allocations: We're at a point now where ~20% of the total runtime for probability calculations (in the 2Q case) are going into memory allocations. We should audit the code and identify any unnecessary memory allocations or otherwise inefficient memory management practices. We may well not find anything meaningful, but it'd probably be worth doing at some point.
Hybridize the map forward simulator: This one is an idea from @kevincyoung. The idea here is to effectively hybridize the map and the matrix forward simulators by adding in logic to the evaluation strategy that detects when we have very long repeated gate sequences and then either performs the direct matrix-vector state propagation as is done now, or if determined to be more efficient first uses repeated squaring to construct the effect process matrix before performing the propagation. The main benefit here would likely come from very long GST-like circuits. This would require a fairly large modification to the simulator implementation to get working.
More intelligent evaluations strategies with limited cache sizes: Right now the logic for the map simulator's evaluation strategy (handled by PrefixTable) when the cache size is limited is pretty unintelligent. It simply greedily adds prefixes to the cache until it is full and then skips caching the outputs of any other circuits. We could do a lot better here. Idea 1: Take into account the number of cache hits for the prefix state, right now we only check that it is hit at least once in making the cache determination, but we could instead give preference to hotter prefixes. Idea 2: We can add prefix expiration logic which would detect once a cached prefix is no longer required for any other circuits and then free up that spot in the cache for reuse. The use of caching makes a huge difference to the performance of the simulator, so doing this caching better for memory-limited settings could make a huge difference.
Make use of BLAS primitives: Right now the low-level matrix vector and vector-vector operations are using some handwritten c++ code for their implementations. We could instead re-implement these using BLAS primitives. Doing this portably would be a huge PITA, but we might be able to use the c api for numpy or scipy or something to manage that aspect. If there were to be benefits here, it would most likely be in taking advantage of platform-dependent SIMD extensions, CPU cache behavior, etc. that linear algebra package authors have thought about carefully, without needing to figure this out ourselves.
Faster matrix exponentials: This one is easy because we don't actually need to do anything. With the updated implementation of error generator parameter updates we're now at a point where one of the main bottlenecks is the actual matrix exponential calculation (order 50% of that subroutine in the two-qubit case). Thanks to contributions by @eendebakpt and collaborators, there is an updated matrix exponential implementation coming to scipy likely in the next release, so all we need to do is recommend folks update their installation (and if we wanted to consider bumping up minimum version requirements down the line).

The text was updated successfully, but these errors were encountered:

coreyostrove added the enhancement Request for a new feature or a change to an existing feature label Oct 23, 2024

coreyostrove self-assigned this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future MapForwardSimulator Performance Improvements #501

Future MapForwardSimulator Performance Improvements #501

coreyostrove commented Oct 23, 2024 •

edited

Loading

Future MapForwardSimulator Performance Improvements #501

Future MapForwardSimulator Performance Improvements #501

Comments

coreyostrove commented Oct 23, 2024 • edited Loading

coreyostrove commented Oct 23, 2024 •

edited

Loading