Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accel/expval #481

Merged
merged 13 commits into from
Aug 25, 2023
Merged

Accel/expval #481

merged 13 commits into from
Aug 25, 2023

Conversation

vincentmr
Copy link
Contributor

@vincentmr vincentmr commented Aug 23, 2023

Before submitting

Please complete the following checklist when submitting a PR:

  • All new features must include a unit test.
    If you've fixed a bug or added code that should be tested, add a test to the
    tests directory!

  • All new functions and code must be clearly commented and documented.
    If you do make documentation changes, make sure that the docs build and
    render correctly by running make docs.

  • Ensure that the test suite passes, by running make test.

  • Add a new entry to the .github/CHANGELOG.md file, summarizing the
    change, and including a link back to the PR.

  • Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


Context:
In the LKokkos backend, there are two ways to implement expectation values:

  1. Copy the statevector, apply the observable to the copy, compute the expectation value with a BLAS-like inner product.
  2. Accumulate the expectation value on-the-fly, applying the observable to a portion of the statevector.

The first method is currently in use, but is wasteful in a couple ways:

  • The statevector copy can require a lot of memory, which means running out with one or two fewer qubits depending on the system.
  • Computational efficiency is asymptotically memory-bound and reduce operations can achieve roughly twice the bandwidth of for loop operations on Nvidia devices, for example, which favors on-the-fly implementations.

A simple benchmarking script like the one below yields, for 29 qubits (on Perlmutter's A100)

  1. 0.330 sec.
  2. 0.012 sec.

i.e. a 27.5x speed-up.

import pennylane as qml
import time

n_wires = 29
n_repeat = 100

dev = qml.device("lightning.kokkos", wires=n_wires)

@qml.qnode(dev)
def circuit():
    return qml.expval(qml.PauliZ(n_wires//2))

t0 = time.time()
for _ in range(n_repeat):
    circuit()
dt = (time.time() - t0)/n_repeat
print(f"{dt}")

Description of the Change:

Benefits:

Possible Drawbacks:

Related GitHub Issues:

@vincentmr vincentmr changed the base branch from master to bugfix/cuda12 August 23, 2023 19:24
@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

❗ No coverage uploaded for pull request base (bugfix/cuda12@6dc7883). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             bugfix/cuda12     #481   +/-   ##
================================================
  Coverage                 ?   96.84%           
================================================
  Files                    ?      142           
  Lines                    ?    16275           
  Branches                 ?        0           
================================================
  Hits                     ?    15761           
  Misses                   ?      514           
  Partials                 ?        0           

@vincentmr vincentmr marked this pull request as ready for review August 24, 2023 12:42
@vincentmr vincentmr requested a review from mlxd August 24, 2023 12:42
Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work. Do we know if these changes will allow us to reach a larger number of qubits?
If yes, what are the numbers we have?

@vincentmr
Copy link
Contributor Author

vincentmr commented Aug 24, 2023

Amazing work. Do we know if these changes will allow us to reach a larger number of qubits? If yes, what are the numbers we have?

From what I could see on Perlmutter, we can reach 30 instead of 29 on a A100 card. The main thing is that 1- and 2-qubit expvals are much faster.

@vincentmr
Copy link
Contributor Author

vincentmr commented Aug 24, 2023

In support of this PR, I made a gist that benchmarks expval on a CPU (OMP_NUM_THREADS=64) and a GPU (A100). The results are generally faster, especially the 1- and 2-qubit operators above 20 qubits. In the following figures, inner and team stand for the inner-product-based and TeamPolicy (this PR) algorithms respectively; first stands for targeting low-index qubits (only first shown since timings are similar across first, mid and last targets); 1, 2, 3 stand for the number of wires targeted by the Hermitian unitary. The timings for 1-qubit Hermitian unitaries are similar to that of named gates (e.g. PauliZ) with an equal number of wires.

benchmarks_CPU
benchmarks_GPU

@vincentmr
Copy link
Contributor Author

Bonus with LKokkos benchmarks run on LUMI's AMD cards
benchmarks_HIP

Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more to ask. You did a great job here!

mlxd
mlxd previously approved these changes Aug 25, 2023
Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Member

@mlxd mlxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with the revision. Thanks @vincentmr

@vincentmr vincentmr merged commit fb82dea into bugfix/cuda12 Aug 25, 2023
59 checks passed
@vincentmr vincentmr deleted the accel/expval branch August 25, 2023 14:48
vincentmr added a commit that referenced this pull request Aug 25, 2023
* M  pennylane_lightning/core/src/bindings/Bindings.hpp; hack `JacobianData` to work with devices.
M  pennylane_lightning/core/src/simulators/lightning_kokkos/StateVectorKokkos.hpp; `applyMatrix` bugfix: use intermediate hostview to copy matrix data; same bugfix for `getDataVector`.
M  pennylane_lightning/core/src/simulators/lightning_kokkos/algorithms/AdjointJacobianKokkos.hpp; use copy constructor.
M  pennylane_lightning/core/src/simulators/lightning_kokkos/measurements/MeasurementsKokkos.hpp; use copy constructor.
M  pennylane_lightning/core/src/simulators/lightning_kokkos/observables/ObservablesKokkos.hpp; use copy constructor.
M  requirements-dev.txt; add clang-format-14.

* Auto update version

* Update changelog.

* Auto update version

* Auto update version

* Add an argument to adjointJacobian to avoid syncing and copying state vector data in adjoint-diff.

* Reformat

* trigger CI

* [skip ci] Update changelog.

* Auto update version

* Auto update version

* Accel/expval (#481)

* Introduce std::unordered_map<std::string, ExpValFunc> expval_funcs_.

* Introduce applyExpectationValueFunctor.

* Add binding to LKokkos expval(matrix, wires). Combine expval functor calls into two templated methods. Call specialized expval methods when possible. Remove obsolete 'Apply directly' tests.

* Update changelog.

* Add test for arbitrary expval(Hermitian).

* Add getExpectationValueMultiQubitOpFunctor.

* Add typename hint for macos.

* Add typename macos.

* Use Kokkos::ThreadVectorRange policy for innerloop in getExpectationValueMultiQubitOpFunctor.

* Couple fix for HIP.

* Use inner product scheme instead of getExpectationValueMultiQubitOpFunctor to compute multi-qubit expval.

---------

Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Amintor Dusko <[email protected]>
@vincentmr vincentmr mentioned this pull request Aug 28, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants