Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The speedup using GPU over CPU execution seems unexpectedly low #1189

Open
mtarabkhah opened this issue Oct 7, 2024 · 3 comments
Open

The speedup using GPU over CPU execution seems unexpectedly low #1189

mtarabkhah opened this issue Oct 7, 2024 · 3 comments

Comments

@mtarabkhah
Copy link

Issue description

I am benchmarking quantum circuits using Catalyst on a GPU. However, the speedup over CPU execution seems unexpectedly low.

  • Expected behavior: Much higher speedup (order of ~1000x speedup)

  • Actual behavior: ~5x speedup using GPU over CPU

  • Reproduces how often: always

  • System information:

Name: PennyLane
Version: 0.38.0
Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Train a quantum computer the same way as a neural network.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/mei/.local/lib/python3.12/site-packages
Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, packaging, pennylane-lightning, requests, rustworkx, scipy, toml, typing-extensions
Required-by: PennyLane-Catalyst, PennyLane_Lightning, PennyLane_Lightning_GPU, PennyLane_Lightning_Kokkos

Platform info:           Linux-6.8.0-45-generic-x86_64-with-glibc2.39
Python version:          3.12.4
Numpy version:           1.26.4
Scipy version:           1.12.0
Installed devices:
- lightning.gpu (PennyLane_Lightning_GPU-0.38.0)
- nvidia.custatevec (PennyLane-Catalyst-0.8.1)
- nvidia.cutensornet (PennyLane-Catalyst-0.8.1)
- oqc.cloud (PennyLane-Catalyst-0.8.1)
- softwareq.qpp (PennyLane-Catalyst-0.8.1)
- default.clifford (PennyLane-0.38.0)
- default.gaussian (PennyLane-0.38.0)
- default.mixed (PennyLane-0.38.0)
- default.qubit (PennyLane-0.38.0)
- default.qubit.autograd (PennyLane-0.38.0)
- default.qubit.jax (PennyLane-0.38.0)
- default.qubit.legacy (PennyLane-0.38.0)
- default.qubit.tf (PennyLane-0.38.0)
- default.qubit.torch (PennyLane-0.38.0)
- default.qutrit (PennyLane-0.38.0)
- default.qutrit.mixed (PennyLane-0.38.0)
- default.tensor (PennyLane-0.38.0)
- null.qubit (PennyLane-0.38.0)
- lightning.kokkos (PennyLane_Lightning_Kokkos-0.39.0.dev11)
- lightning.qubit (PennyLane_Lightning-0.38.0)

Source code and tracebacks

I have provided 2 sample code with more information on the execution times in Catalyst-GPU-QS Repo

Additional information

Here are some sample execution times for a 26-qubit GHZ circuit:

  • GHZ1 (using lightning.qubit on CPU):
    Execution time: 2.6811 seconds
  • GHZ2 (using lightning.gpu on GPU):
    Execution time: 0.5751 seconds

This results in a 4.66x speedup with the GPU version, which seems relatively low for GPU acceleration.

For comparison, running this quantum circuit in Qiskit yielded the following:

  • GPU execution in Qiskit was ~760x faster than the CPU version.
  • While Catalyst showed better CPU performance than Qiskit, its GPU performance lagged behind.

P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.

@josh146
Copy link
Member

josh146 commented Oct 7, 2024

Hi @mtarabkhah! Thanks for the benchmarking information.

Catalyst doesn't yet support lightning.gpu, but this in work in progress and coming shortly. I'm curious how you benchmarking Catalyst with GPU support?

@josh146
Copy link
Member

josh146 commented Oct 7, 2024

P.S. I have not used for loops in the creation of the circuits, as I am using code to automatically generate the circuits based on a provided list of gates. I assume this should not affect the performance.

Note that Catalyst compatible for loops (either using qml.for_loop, or qjit(autograph=True)) will in fact lead to an increase in performance, as the circuit will have a compressed representation :)

@mtarabkhah
Copy link
Author

Hi @josh146,

Thanks for your reply.

I'm currently using lightning.gpu from PennyLane for the GPU version. Is there another way to use Catalyst for GPU execution?

Could you please review the provided code and suggest ways to improve performance, particularly using Catalyst on GPU?

I also appreciate the comment about "Catalyst-compatible for loops" and will look into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants