From 183bd44327ad52302c5f1dbd0ba15e4150c24927 Mon Sep 17 00:00:00 2001
From: Shuli Shu <31480676+multiphaseCFD@users.noreply.github.com>
Date: Tue, 31 Oct 2023 10:50:14 -0400
Subject: [PATCH] Add v0.33 rc changes (#538)

* Create v0.33.0 RC branch.

* Update changelog [skip ci]

* Update pennylane_requires [skip ci]

* Change master => v0.33.0-rc0 in wheels and fix gpu ci files (removing latest pl checkouts).

* Trigger multiple GPU CI

* add sync in Hamiltonian obs

* add sync to MPILinearAlg

* add sync for adjoint tests

* sync for gate ops in MPI backend

* add more sync

* add more sync before upData

* add more syncs

* update adj unit tests

* Add LGPU docs (#525)

* init commit

* Auto update version

* add changelog

* Update readme.

* shush CI [skip ci]

* Auto update version

* Fix README links and code-blocks. [skip ci]

* Fix card links.

* Fix obs signature to match LK/LQ

* Revert card links. [skip ci]

* update docs

* update readme

* Reorder cards and add docker support section. [skip ci]

* Build with CUDA on the CI for correct API gen

* Add docker.rst [skip ci].

* Add Cuda 11.8 install

* Lower CUDA version

* Fix typo in name and paths

* Disable CUDA checks for RTD

* update readme

* update LGPU installation steps

* Turn off GPU runners.

* Update CUDA wheel builder

* add docstring in lightning_gpu.py

* Change kokkos gpu order. [skip ci]

* Fix some headings and toctrees [skip ci].

* Add GPU test workflows to plugin test matrix [sc-48529] (#528)

* update measurement

* add openmp to adjgpu

* Auto update version

* Add support for building multiple backend simulators (#497)

* Add PL_BACKEND_LIST

* Update the support

* Exclude Python bindings

* Update HermitianObs name scope conflicts

* Auto update version

* Cleanup

* Update CI to build and check C++ tests of multiple backends (Linux)

* Update changelog

* Update .github/workflows/tests_linux.yml

Co-authored-by: Vincent Michaud-Rioux <vincentm@nanoacademic.com>

* Apply code review suggestions

* Update .github/workflows/tests_linux.yml

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

---------

Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Vincent Michaud-Rioux <vincentm@nanoacademic.com>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

* add python layer & isingxy gate in the cpp layer

* add batched adjoint method

* Update DefaultQubit to DefaultQubitLegacy (#500)

* Update DefaultQubit to DefaultQubitLegacy

* Update changelog

* update pylint disable on fallback

* Auto update version

* add batch support for adjoint method

* add gitignore

* tidy up code

* Auto update version

* make format

* revert complexT delete in LKokkosBingds

* make format

* update based on tidy

* fix tidy format

* add_gpu_runners_tests

* add cuquantum_sdk path to ci workflow

* debug

* add path to cuquantum sdk

* add python layer tests in ci workflow

* ci tests

* quick fix

* skip pr ci for some workflows

* quick fix

* quick fix

* update python ci tests

* remove dependency on lightning_qubit in ci

* fix directory

* fix directory

* quick fix

* quick fix

* test for cuda-12

* update measurement

* updata cu12 workflows

* add getDataVector support to LQubitRaw

* install lightning.qubit before lightning.gpu in ci

* update test_obs

* activate all CI checks

* quick fix

* tidy up code

* tidy up code

* make format

* update ci for more tests

* tidy up code

* tidy up code

* tidy up code

* make format

* fix for codecov

* codecov fix

* quick fix

* quick fix

* quick fix

* quick test

* fix test

* fix tests

* another quick fix

* coverage fix

* update ci tests

* update ci for no binary

* codecov fix

* update adj tests for no binary case

* update python layer tests

* fix codecov

* make format

* initial commit for MPI

* revert to cu11

* enable more py tests

* update CI

* upload codecov ci

* add more tests for statevectorcudamanaged

* add more unit tests

* add more tests

* make format

* add more cpp tests

* skip cpp tests pauli param gates

* make format

* add more files to gitignore

* Auto update version

* init commit

* Trigger CI

* update gpu runner

* quick fix

* update fix

* add cpp layer for LGPU-MPI backend

* add py layer

* quick fix

* make format

* fix for fp32 support in expval calculation

* quick fix

* fix for cray_mpich_serialize_py

* copy to move for hamiltonian operation

* add unit tests for adjoint method

* add more tests

* resolve comments py layer

* remove omp support in LGPU

* update version

* Auto update version

* fix based on comments

* Add L-GPU and L-Kokkos as package extras (#515)

* Add L-GPU and L-Kokkos as package extras

* Auto update version

* Update changelog

* Temp enable the x86 wheel cache

* Return wheel storage functionality to normal

* Update readme

* Auto update version

* Trigger CI

* Update README.rst

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

---------

Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

* Auto update version

* make format

* remove sparseH

* remove applyHostMatrixGate

* Add wheel_linux_x86_64_cu11.yml (#517)

* Add wheel_linux_x86_64_cu11.yml

* echo COMPILER=g++ >>

* python3.9 => python

* reinstall g++11

* Try that

* Use env vars for versions.

* Fix var syntax.

* Hardcode versions

* Fix custatevec-cu11

* Revert triggers.

* Update changelog [skip ci]

* resolve more comments

* add more tests to non_param gates

* resolve cpp tests comments

* remove unused methods in measurement class

* remove unused methods

* resolve more comments

* add changelog and matrixhasher

* quick update

* add more tests and merge base branch

* add mpi unit tests for algorithm base class

* add more unit tests for utils

* ctor test for MPIManager

* Add mpi tests to LGPU (#519)

* Initial commit mpi tests

* Remove label guards

* Fix PL_DEVICE

* Install lightning_qubit.

* Fix ENABLE_MPI

* print cuquantum

* export cu_sdk

* revert define

* Debug cpp tests.

* Debug cpp tests.

* Fix cmake options.

* Compile with mpicxx

* Specify backend.

* Specify backend.

* Remove obsolete line.

* Specify cov backend

* Merge test/cov & try simplifying python

* if-no-files-found: error and fix python tests.

* Fix mpi find

* Install real lightning.

* Revert python tests.

* Hardcode backend values in python tests

* Install lightning_qubit with gpu in python tests

* Remove explicit mpich mentions.

* Parametrize mpilib name.

* Add openmpi tests.

* Build only openmpi python tests.

* Add timeouts

* test/test_apply.py

* Revert pull triggers.

* Clean gpu-mpi test workflows.

* Revert to 804ed24.

* Revert back.

* Update tests_linux_x86_mpi.yml [ci skip]

* Add jobs dep.

* Remove module unload

* Simplify mpi-gpu tests.

* trigger CI

* unset CFLAGS.

* set CFLAGS

* Revert triggers.

* Fix pull_request: [skip ci]

* trigger CI

* Rename test_gpu_cu11.yml -> tests_gpu_cu11.yml [skip ci]

* add CI checks for cpp unit tests

* add cpp layer ci check for mpi backend

* Auto update version

* remove redundant blank lines

* tidy up code

* Trigger CI

* remove single GPU backend tests in mpi ci

* upload codecov results

* add more unit tests

* add tests for pauli word based expval

* add more docs

* add more tests

* skip lcov for native gates

* add mpi_helpers

* add more docstrings

* add change log

* Auto update version

* Auto update version

* fix failures caused by merging

* add changelog

* Trigger multi-GPU runner

* add more fp32 tests to the measurement class

* add number of devices and mpi procs check

* Add coverage for py-mpitests. (#522)

* Initial commit mpi tests

* Remove label guards

* Fix PL_DEVICE

* Install lightning_qubit.

* Fix ENABLE_MPI

* print cuquantum

* export cu_sdk

* revert define

* Debug cpp tests.

* Debug cpp tests.

* Fix cmake options.

* Compile with mpicxx

* Specify backend.

* Specify backend.

* Remove obsolete line.

* Specify cov backend

* Merge test/cov & try simplifying python

* if-no-files-found: error and fix python tests.

* Fix mpi find

* Install real lightning.

* Revert python tests.

* Hardcode backend values in python tests

* Install lightning_qubit with gpu in python tests

* Remove explicit mpich mentions.

* Parametrize mpilib name.

* Add openmpi tests.

* Build only openmpi python tests.

* Add timeouts

* test/test_apply.py

* Revert pull triggers.

* Clean gpu-mpi test workflows.

* Revert to 804ed24.

* Revert back.

* Update tests_linux_x86_mpi.yml [ci skip]

* Add jobs dep.

* Remove module unload

* Simplify mpi-gpu tests.

* trigger CI

* unset CFLAGS.

* set CFLAGS

* Revert triggers.

* Fix pull_request: [skip ci]

* trigger CI

* Rename test_gpu_cu11.yml -> tests_gpu_cu11.yml [skip ci]

* Add coverage for py-mpitests.

* Upload mpi-gpu test coverage.

* Try other paths.

* trigger CI

* Add mpi tests.

* Fix couple tests.

* Fixx test_apply tests?

* Add MPI sparse measurements.

* Fix format.

* Add MPI_Init checks in MPIManager constructors.

* Reformat mpitests and add cov for proc > dev error.

* Refactor makefile.

* Revert to full mpirun path.

* Fix couple tests.

* Name coverage after matrix.mpilib.

* Remove oversubscribe MPI test.

* Update changelog [skip ci].

---------

Co-authored-by: Shuli <08cnbj@gmail.com>

* add more tests in obs base class

* Revert "Merge branch 'add_LGPUMPI' into add_py_LGPUMPI"

This reverts commit d3af81987fa6553d1975abf9b5aa9c17bd0edf63, reversing
changes made to 6ad1c7c8fd4cee21d7ca3b91aa349e7d1dd2e8ed.

* Fix pylint [skip ci]

* resolve comments on source codes and tidy up code

* Use CRTP to define initSV and remove initSV_MPI

* resolve more typos

* resolve more typoes

* resolve adjoint class

* remove py&pybind layer

* resolve more comments

* Remove redundant blank line

* add num mpi & ngpudevice proc check

* fix typo

* remove unused lines

* add more tests

* remove initsv_mpi

* add reset

* make format

* use_mpi as _use_mpi in QuantumScriptSerializer

* resolve more comments

* check->require

* make format

* rename mpi workflow

* Update license.

* Add GPU tests in compat workflows.

* Add pull_request triggers.

* Comment pull_request triggers except compat.

* Comment pull_request triggers except compat-latest-latest.

* shush CI [skip ci]

* Add sparseH for LGPU (#526)

* Init commit

* Fix std::endl;

* Use more generic indices in base std::size_t.

* add pybind layer

* add python layer

* Quick and dirty spham bindings.

* Add sparse_ham serialization.

* Add sparse_ham tests in tests/test_adjoint_jacobian.py'

* Bug fix sparse product.

* add sparseH

* Trigger CI

* Fix python bindings LGPU idxT

* Fix serial tests and update changelog.

* add more unit tests for sparseH base class

* Fix tidy & sparse adjoint test device name.

* Fix tidy warning for sparse_ham.

* Send backend-specific ops in respective modules.

* Fix sparse_hamiltonianmpi_c and add getWires test.

* Add sparseH diff capability in LQ.

* Add sparse Hamiltonian support for Lightning-Kokkos (#527)

* Use more generic indices in base std::size_t.

* Quick and dirty spham bindings.

* Add sparse_ham serialization.

* Add sparse_ham tests in tests/test_adjoint_jacobian.py'

* Bug fix sparse product.

* Fix python bindings LGPU idxT

* Fix serial tests and update changelog.

* Fix tidy & sparse adjoint test device name.

* Fix tidy warning for sparse_ham.

* Send backend-specific ops in respective modules.

* Fix sparse_hamiltonianmpi_c and add getWires test.

* Fix clang tidy

* Comment workflows but tidy.

* Fix tidy warn

* Add override to sp::getWires

* Restore triggers

* Update tests_linux_x86_mpi.yml

* Add constructibility tests.

* Move L-Kokkos-CUDA tests to workflow call, called from tests_gpu_cu11.yml.

* Remove GPU deadlock.

* Bug fix Python MPI.

* Upload both outputs.

* Update gcc version in format.yml.

* Update .github/CHANGELOG.md [skip ci]

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

* Update .github/workflows/tests_gpu_kokkos.yml [skip ci]

Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

* rename argn [skip ci]

* Remove unused lines [skip ci]

* Fix SparseHamiltonianBase::isEqual. [skip ci]

* Trigger CI

* Auto update version

* Trigger CI

* resolve comments

* rename dev_kokkos to dev

* Fix tidy.

---------

Co-authored-by: Vincent Michaud-Rioux <vincent.michaud-rioux@xanadu.ai>
Co-authored-by: Vincent Michaud-Rioux <vincentm@nanoacademic.com>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>

* update work flow

* resolve comments for unit tests

* add more unit tests for sparseH

* quick fix

* add fp32 tests

* tidy up code

* remove redundant lines

* add pylintrc to mpitests

* add mpitests dir to commit-config

* add mpitests to .coveragerc

* add mpitests path to coveragerc

* Add LGPU cu11 workflow to compat.

* Add all workflows to latest-latest.

* Fix jobs names.

* Fix mpitests/test_adjoint_jacobian.py

* Fix pylint in mpitests/test_apply [skip ci].

* pylint fix for mpi py_d_e_m_p tets

* tidy up cpp code

* fix codefactor

* revert skipp condition for openfermionpyscf

* codefactor fix

* add sparseH tests for mpi backend

* Install openfermion in CI workflows and fix H2 QChem integration test.

* Add LGPU_MPI tests to compat.

* update changelog

* Add gpu workflows to all compat [skip ci].

* Trigger CI

* Change cron time.

* Fix tests_lgpu_gpu_mpi name

* Fix gpu runner

* Turn off compat CPP tests.

* workflow_call => pull_request temp

* rm -rf Kokkos before mkdir

* Dont' run cpp-tests

* Use random parameters in test_integration_H2_Hamiltonian

* Use 2 contains

* Use pytest-rerunfailures in mpi_gpu step

* Change cov.xml name

* Remove rerun-failures

* Try symmetry-breaking mol close to eq.

* Add parallel True in .cov

* Revert params and add diff names for cov.xml

* Add barrier.

* Test runscript openmpi

* Fix yml format

* call bash

* Revert couple changes and remove MPI from compat workflows.

* Revert changes to src and tests.

* Revert triggers.

* Auto update version

* Remove pull_req trigger from compats.

* Revert changes to MPI workflow. [skip ci]

* Trigger CI

---------

Co-authored-by: Shuli Shu <08cnbj@gmail.com>
Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ali Asadi <ali@xanadu.ai>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>
Co-authored-by: Shuli Shu <31480676+multiphaseCFD@users.noreply.github.com>

---------

Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Vincent Michaud-Rioux <vincent.michaud-rioux@xanadu.ai>
Co-authored-by: Lee J. O'Riordan <lee@xanadu.ai>
Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>
Co-authored-by: Vincent Michaud-Rioux <vincentm@nanoacademic.com>
Co-authored-by: Ali Asadi <ali@xanadu.ai>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>

* Download wheels.

* Add pytest-bench dep in requirements-dev.txt

* Fix RST formatting in README

* Fix matrix in GPU wheels.

* Build linux aarch ppc

* Add sdist in noarch wheels.

* Fix backend name in wheel_linux_x86_64_cu11.yml

* Do not upload sdist.

* quick fix py unit tests for adj mpi

* Uncomment mpi tests trigger

* Add block-list for auditwheel builds with L-GPU (#534)

* Add block-list for auditwheel builds with L-GPU

* Update mode permission for auditwheel

* Update .github/CHANGELOG.md [skip ci]

Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>

* Update .github/CHANGELOG.md [skip ci]

Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>

* Remove custatevec from req [skip ci]

* remove unneccessary barrier in cpp backend

* turn on H2_Ham tests for LK&LQ

* Fix changelog and link in README. [skip ci]

* Use long lightning titles. [skip ci]

* Remove broken links [skip ci].

* Remove stray div end

* Revert wheel files triggers.

* test (#535)

* test

* Trigger MPI CI

* Use runscript with openmpi.

* Install pytest-xdist.

* Use coverage directly in mpi tests.

* Revert trigger comment.

* Remove placeholder diff.md [skip ci]

---------

Co-authored-by: Shuli Shu <08cnbj@gmail.com>

* udpate changelog

* Trigger MGPU CI

* Auto update version

* update typo

* update changelog and mpi_gpu.yml

---------

Co-authored-by: Vincent Michaud-Rioux <vincent.michaud-rioux@xanadu.ai>
Co-authored-by: Dev version update bot <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Lee J. O'Riordan <lee@xanadu.ai>
Co-authored-by: Lee James O'Riordan <mlxd@users.noreply.github.com>
Co-authored-by: Vincent Michaud-Rioux <vincentm@nanoacademic.com>
Co-authored-by: Ali Asadi <ali@xanadu.ai>
Co-authored-by: Amintor Dusko <87949283+AmintorDusko@users.noreply.github.com>
---
 .coveragerc                                   |   3 +
 .github/CHANGELOG.md                          |  38 +-
 .github/workflows/tests_gpu_cu11.yml          |   1 -
 .github/workflows/tests_linux_x86_mpi_gpu.yml |  25 +-
 .github/workflows/wheel_linux_x86_64_cu11.yml |   6 +-
 .github/workflows/wheel_noarch.yml            |  37 +-
 .readthedocs.yml                              |   5 +-
 README.rst                                    | 336 +++++++++++-------
 bin/auditwheel                                |  25 ++
 doc/code/__init__.rst                         |   9 +-
 doc/docker.rst                                |   3 +
 doc/index.rst                                 |   9 +-
 doc/installation.rst                          |   5 +
 doc/lightning_gpu/device.rst                  | 284 +++++++++++++++
 doc/lightning_gpu/installation.rst            |   3 +
 doc/lightning_gpu/package.rst                 |  19 +
 doc/lightning_qubit/development/index.rst     |   2 +-
 doc/lightning_qubit/device.rst                |   3 -
 doc/requirements.txt                          |   2 +
 mpitests/test_adjoint_jacobian.py             |  45 ++-
 pennylane_lightning/core/_version.py          |   2 +-
 pennylane_lightning/core/lightning_base.py    |   4 +-
 .../lightning_gpu/StateVectorCudaMPI.hpp      |   8 +
 .../algorithms/AdjointJacobianGPUMPI.hpp      |   4 +-
 .../observables/ObservablesGPUMPI.hpp         |   4 +-
 .../lightning_gpu/utils/MPILinearAlg.hpp      |   3 +-
 .../lightning_gpu/lightning_gpu.py            |  95 ++++-
 requirements-dev.txt                          |   4 +-
 requirements.txt                              |   2 +-
 tests/test_adjoint_jacobian.py                |   2 +-
 30 files changed, 788 insertions(+), 200 deletions(-)
 create mode 100755 bin/auditwheel
 create mode 100644 doc/docker.rst
 create mode 100644 doc/lightning_gpu/device.rst
 create mode 100644 doc/lightning_gpu/installation.rst
 create mode 100644 doc/lightning_gpu/package.rst

diff --git a/.coveragerc b/.coveragerc
index e9d7866fff..e7744687dd 100644
--- a/.coveragerc
+++ b/.coveragerc
@@ -5,6 +5,9 @@ omit =
     tests/*
     mpitests/*
 
+[coverage:run]
+parallel = true
+
 [report]
 # Regexes for lines to exclude from consideration
 exclude_lines =
diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
index 0ce1d532ef..9eb46730fc 100644
--- a/.github/CHANGELOG.md
+++ b/.github/CHANGELOG.md
@@ -6,6 +6,9 @@
 
 ### Improvements
 
+* Add MPI synchronization in places to safely handle communicated data.
+  [(#538)](https://github.com/PennyLaneAI/pennylane-lightning/pull/538)
+
 * Add release option in compatibility cron jobs to test the release candidates of PennyLane and the Lightning plugins against one another.
   [(#531)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/531)
 
@@ -16,18 +19,22 @@
 
 ### Bug fixes
 
+* Fix MPI Python unit tests for the adjoint method.
+  [(#538)](https://github.com/PennyLaneAI/pennylane-lightning/pull/538)
+
 ### Contributors
 
 This release contains contributions from (in alphabetical order):
 
-Vincent Michaud-Rioux
-
----
+Vincent Michaud-Rioux, Shuli Shu
 
 # Release 0.33.0
 
 ### New features since last release
 
+* Add documentation updates for the `lightning.gpu` backend.
+  [(#525)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/525)
+
 * Add `SparseHamiltonian` support for Lightning-Qubit and Lightning-GPU.
   [(#526)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/526)
 
@@ -40,7 +47,7 @@ Vincent Michaud-Rioux
 * Integrate the distributed C++ backend of Lightning-GPU into the Lightning monorepo.
   [(#514)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/514)
 
-* Integrate Lightning-GPU into the Lightning monorepo. The new backend is named `lightning_gpu` and includes all single-GPU features.
+* Integrate Lightning-GPU into the Lightning monorepo. The new backend is named `lightning.gpu` and includes all single-GPU features.
   [(#499)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/499)
 
 * Build Linux wheels for Lightning-GPU (CUDA-11).
@@ -54,10 +61,10 @@ Vincent Michaud-Rioux
 
 ### Breaking changes
 
-* Add `tests_gpu.yml` workflow to test the Lightning-Kokkos backend with CUDA-12. 
+* Add `tests_gpu.yml` workflow to test the Lightning-Kokkos backend with CUDA-12.
   [(#494)](https://github.com/PennyLaneAI/pennylane-lightning/pull/494)
 
-* Implement `LM::GeneratorDoubleExcitation`, `LM::GeneratorDoubleExcitationMinus`, `LM::GeneratorDoubleExcitationPlus` kernels. L-Qubit default kernels are now strictly from the `LM` implementation, which requires less memory and is faster for large state vectors.  
+* Implement `LM::GeneratorDoubleExcitation`, `LM::GeneratorDoubleExcitationMinus`, `LM::GeneratorDoubleExcitationPlus` kernels. Lightning-Qubit default kernels are now strictly from the `LM` implementation, which requires less memory and is faster for large state vectors.
   [(#512)](https://github.com/PennyLaneAI/pennylane-lightning/pull/512)
 
 * Add workflows validating compatibility between PennyLane and Lightning's most recent stable releases and development (latest) versions.
@@ -70,7 +77,7 @@ Vincent Michaud-Rioux
 * Cast integral-valued arrays to the device's complex type on entry in `_preprocess_state_vector` to ensure the state is correctly represented with floating-point numbers.
   [(#501)](https://github.com/PennyLaneAI/pennylane-lightning/pull/501)
 
-* Update DefaultQubit to DefaultQubitLegacy on Lightning fallback.
+* Update `DefaultQubit` to `DefaultQubitLegacy` on Lightning fallback.
   [(#500)](https://github.com/PennyLaneAI/pennylane-lightning/pull/500)
 
 * Enums defined in `GateOperation.hpp` start at `1` (previously `0`). `::BEGIN` is introduced in a few places where it was assumed `0` accordingly.
@@ -87,16 +94,16 @@ Vincent Michaud-Rioux
 * Add support for `pip install pennylane-lightning[kokkos]` for the OpenMP backend.
   [(#515)](https://github.com/PennyLaneAI/pennylane-lightning/pull/515)
 
-* Update setup.py to allow for multi-package co-existence. The PennyLane_Lightning package now is the responsible for the core functionality, and will be depended upon by all other extensions.
+* Update `setup.py` to allow for multi-package co-existence. The `PennyLane_Lightning` package now is the responsible for the core functionality, and will be depended upon by all other extensions.
   [(#504)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/504)
 
-* Refactor LKokkos `StateVectorKokkos` class to use Kokkos `RangePolicy` together with special functors in `applyMultiQubitOp` to apply 1- to 4-wire generic unitary gates. For more than 4 wires, the general implementation using Kokkos `TeamPolicy` is employed to yield the best all-around performance.
+* Redesign Lightning-Kokkos `StateVectorKokkos` class to use Kokkos `RangePolicy` together with special functors in `applyMultiQubitOp` to apply 1- to 4-wire generic unitary gates. For more than 4 wires, the general implementation using Kokkos `TeamPolicy` is employed to yield the best all-around performance.
   [(#490)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/490)
 
-* Refactor LKokkos `Measurements` class to use Kokkos `RangePolicy` together with special functors to obtain the expectation value of 1- to 4-wire generic unitary gates. For more than 4 wires, the general implementation using Kokkos `TeamPolicy` is employed to yield the best all-around performance.
+* Redesign Lightning-Kokkos `Measurements` class to use Kokkos `RangePolicy` together with special functors to obtain the expectation value of 1- to 4-wire generic unitary gates. For more than 4 wires, the general implementation using Kokkos `TeamPolicy` is employed to yield the best all-around performance.
   [(#489)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/489)
 
-* Add tests to increase LKokkos coverage.
+* Add tests to increase Lightning-Kokkos coverage.
   [(#485)](https://github.com/PennyLaneAI/pennylane-lightning/pull/485)
 
 * Add memory locality tag reporting and adjoint diff dispatch for `lightning.qubit` statevector classes.
@@ -112,13 +119,16 @@ Vincent Michaud-Rioux
 
 ### Bug fixes
 
+* Fix CI issues running python-cov with MPI. 
+  [(#535)](https://github.com/PennyLaneAI/pennylane-lightning/pull/535)
+
 * Re-add support for `pip install pennylane-lightning[gpu]`.
   [(#515)](https://github.com/PennyLaneAI/pennylane-lightning/pull/515)
 
-* Switch most L-Qubit default kernels to `LM`. Add `LM::multiQubitOp` tests, failing when targeting out-of-order wires clustered close to `num_qubits-1`. Fix the `LM::multiQubitOp` kernel implementation by introducing a generic `revWireParity` routine and replacing the `bitswap`-based implementation. Mimic the changes fixing the corresponding `multiQubitOp` and `expval` functors in L-Kokkos.
+* Switch most Lightning-Qubit default kernels to `LM`. Add `LM::multiQubitOp` tests, failing when targeting out-of-order wires clustered close to `num_qubits-1`. Fix the `LM::multiQubitOp` kernel implementation by introducing a generic `revWireParity` routine and replacing the `bitswap`-based implementation. Mimic the changes fixing the corresponding `multiQubitOp` and `expval` functors in Lightning-Kokkos.
   [(#511)](https://github.com/PennyLaneAI/pennylane-lightning/pull/511)
 
-* Fix RTD builds by removing unsupported `sytem_packages` configuration option.
+* Fix RTD builds by removing unsupported `system_packages` configuration option.
   [(#491)](https://github.com/PennyLaneAI/pennylane-lightning/pull/491)
 
 ### Contributors
@@ -133,7 +143,7 @@ Ali Asadi, Amintor Dusko, Vincent Michaud-Rioux, Lee J. O'Riordan, Shuli Shu
 
 ### New features since last release
 
-* The `lightning_kokkos` backend supports Nvidia GPU execution (with Kokkos v4 and CUDA v12).
+* The `lightning.kokkos` backend supports Nvidia GPU execution (with Kokkos v4 and CUDA v12).
   [(#477)](https://github.com/PennyLaneAI/pennylane-lightning/pull/477)
 
 * Complete overhaul of repository structure to facilitates integration of multiple backends. Refactoring efforts we directed to improve development performance, code reuse and decrease overall overhead to propagate changes through backends. New C++ modular build strategy allows for faster test builds restricted to a module. Update CI/CD actions concurrency strategy. Change minimal Python version to 3.9.
diff --git a/.github/workflows/tests_gpu_cu11.yml b/.github/workflows/tests_gpu_cu11.yml
index ab875864fa..f530f9d69d 100644
--- a/.github/workflows/tests_gpu_cu11.yml
+++ b/.github/workflows/tests_gpu_cu11.yml
@@ -79,7 +79,6 @@ jobs:
         uses: actions/checkout@v3
         with:
           path: main
-          fetch-depth: 2
 
       - uses: actions/setup-python@v4
         name: Install Python
diff --git a/.github/workflows/tests_linux_x86_mpi_gpu.yml b/.github/workflows/tests_linux_x86_mpi_gpu.yml
index e879415492..ec28514672 100644
--- a/.github/workflows/tests_linux_x86_mpi_gpu.yml
+++ b/.github/workflows/tests_linux_x86_mpi_gpu.yml
@@ -91,13 +91,15 @@ jobs:
 
       - name: Install required packages
         run: |
-          python -m pip install ninja cmake custatevec-cu11
+          python -m pip install -r requirements-dev.txt
+          python -m pip install cmake custatevec-cu11
 
       - name: Validate GPU version and installed compiler
         run: |
           source /etc/profile.d/modules.sh && module use /opt/modules && module load cuda/11.8
           which -a nvcc
           nvcc --version
+
       - name: Validate Multi-GPU packages
         run: |
           source /etc/profile.d/modules.sh && module use /opt/modules/ && module load ${{ matrix.mpilib }}
@@ -107,9 +109,6 @@ jobs:
           which -a mpicxx
           mpicxx --version
           module unload ${{ matrix.mpilib }}
-      - name: Install Latest PennyLane
-        if: inputs.pennylane-version == 'latest'
-        run: python -m pip install git+https://github.com/PennyLaneAI/pennylane.git@master
 
       - name: Build and run unit tests
         run: |
@@ -222,16 +221,11 @@ jobs:
           echo "PIP Path => $pip_path"
           echo "pip=$pip_path" >> $GITHUB_OUTPUT
 
-      - name: Install Latest PennyLane
-        # We want to install the latest PL on non workflow_call events
-        if: inputs.pennylane-version == 'latest'  || inputs.pennylane-version == ''
-        run: python -m pip install git+https://github.com/PennyLaneAI/pennylane.git@master
-
       - name: Install required packages
         run: |
           source /etc/profile.d/modules.sh && module use /opt/modules/ && module load ${{ matrix.mpilib }}
-          python -m pip install pip~=22.0
-          python -m pip install ninja cmake custatevec-cu11 pytest pytest-mock flaky pytest-cov mpi4py openfermionpyscf
+          python -m pip install -r requirements-dev.txt
+          python -m pip install custatevec-cu11 mpi4py openfermionpyscf
           SKIP_COMPILATION=True PL_BACKEND=lightning_qubit python -m pip install -e . -vv
 
       - name: Build and install package
@@ -242,12 +236,15 @@ jobs:
           CMAKE_ARGS="-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DENABLE_MPI=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=${{ env.CI_CUDA_ARCH }} -DPython_EXECUTABLE=${{ steps.python_path.outputs.python }}" \
           PL_BACKEND=lightning_gpu python -m pip install -e . --verbose
 
+      # There are issues running py-cov with MPI. A solution is to use coverage as reported
+      # [here](https://github.com/pytest-dev/pytest-cov/issues/237#issuecomment-544824228)
       - name: Run unit tests for MPI-enabled lightning.gpu device
         run: |
           source /etc/profile.d/modules.sh && module use /opt/modules/ && module load ${{ matrix.mpilib }}
-          PL_DEVICE=lightning.gpu /opt/mpi/${{ matrix.mpilib }}/bin/mpirun -np 2 python -m pytest ./mpitests $COVERAGE_FLAGS
-          mv coverage.xml coverage-${{ github.job }}-lightning_gpu_${{ matrix.mpilib }}-main.xml
-          # PL_DEVICE=lightning.gpu /opt/mpi/${{ matrix.mpilib }}/bin/mpirun --oversubscribe -n 4 pytest -s -x mpitests/test_device.py -k test_create_device $COVERAGE_FLAGS
+          PL_DEVICE=lightning.gpu /opt/mpi/${{ matrix.mpilib }}/bin/mpirun -np 2 \
+          coverage run --rcfile=.coveragerc --source=pennylane_lightning -p -m mpi4py -m pytest ./mpitests --tb=native
+          coverage combine
+          coverage xml -o coverage-${{ github.job }}-lightning_gpu_${{ matrix.mpilib }}-main.xml
 
       - name: Upload code coverage results
         uses: actions/upload-artifact@v3
diff --git a/.github/workflows/wheel_linux_x86_64_cu11.yml b/.github/workflows/wheel_linux_x86_64_cu11.yml
index 2b270c4b0e..301d4c70e0 100644
--- a/.github/workflows/wheel_linux_x86_64_cu11.yml
+++ b/.github/workflows/wheel_linux_x86_64_cu11.yml
@@ -87,7 +87,7 @@ jobs:
             LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:$CUQUANTUM_SDK \
             PL_BACKEND="${{ matrix.pl_backend }}"
 
-          CIBW_REPAIR_WHEEL_COMMAND_LINUX: "auditwheel repair -w {dest_dir} {wheel}"
+          CIBW_REPAIR_WHEEL_COMMAND_LINUX: "./bin/auditwheel repair -w {dest_dir} {wheel}"
 
           CIBW_MANYLINUX_X86_64_IMAGE: manylinux2014
 
@@ -114,6 +114,10 @@ jobs:
 
   upload-pypi:
     needs: linux-wheels-x86-64
+    strategy:
+      matrix:
+        arch: [x86_64]
+        pl_backend: ["lightning_gpu"]
     runs-on: ubuntu-latest
     if: ${{ github.event_name == 'release' || github.ref == 'refs/heads/master'}}
     steps:
diff --git a/.github/workflows/wheel_noarch.yml b/.github/workflows/wheel_noarch.yml
index d3e6622730..d30145161d 100644
--- a/.github/workflows/wheel_noarch.yml
+++ b/.github/workflows/wheel_noarch.yml
@@ -22,7 +22,7 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest]
-        pl_backend: ["lightning_kokkos", "lightning_qubit"]
+        pl_backend: ["lightning_gpu", "lightning_kokkos", "lightning_qubit"]
     timeout-minutes: 30
     name: ${{ matrix.os }} - Pure Python wheels - ${{ matrix.pl_backend }} (Python 3.9)
     runs-on: ${{ matrix.os }}
@@ -46,41 +46,66 @@ jobs:
           python -m pip install --upgrade cmake ninja
 
       - name: Build wheels
+        if: ${{ matrix.pl_backend == 'lightning_qubit'}}
         run: |
           python -m pip install --upgrade pip wheel
           cd main
-          python setup.py bdist_wheel
+          PL_BACKEND="${{ matrix.pl_backend }}" python setup.py bdist_wheel
         env:
           SKIP_COMPILATION: True
 
       - name: Validate wheels
+        if: ${{ matrix.pl_backend == 'lightning_qubit'}}
         run: |
           python -m pip install twine
           python -m twine check main/dist/*.whl
 
       - uses: actions/upload-artifact@v3
-        if: ${{ github.event_name == 'release' || github.ref == 'refs/heads/master' }}
+        if: ${{ matrix.pl_backend == 'lightning_qubit' && (github.event_name == 'release' || github.ref == 'refs/heads/master') }}
         with:
           name: pure-python-wheels-${{ matrix.pl_backend }}.zip
           path: main/dist/*.whl
 
+      - name: Build source dist
+        if: ${{ matrix.pl_backend != 'lightning_qubit'}}
+        run: |
+          python -m pip install --upgrade pip wheel
+          cd main
+          PL_BACKEND="${{ matrix.pl_backend }}" python setup.py sdist
+        env:
+          SKIP_COMPILATION: True
+
+      - uses: actions/upload-artifact@v3
+        if: ${{ matrix.pl_backend != 'lightning_qubit' && (github.event_name == 'release' || github.ref == 'refs/heads/master') }}
+        with:
+          name: pure-source-dist-${{ matrix.pl_backend }}.tar.gz
+          path: main/dist/*.tar.gz
+
   upload-pypi:
     needs: build-pure-python-wheel
     strategy:
       matrix:
-        pl_backend: ["lightning_qubit"]
+        pl_backend: ["lightning_gpu", "lightning_kokkos", "lightning_qubit"]
     runs-on: ubuntu-latest
 
-    if: ${{ github.event_name == 'release' }}
     steps:
       - uses: actions/download-artifact@v3
+        if: ${{ matrix.pl_backend == 'lightning_qubit' && github.event_name == 'release' }}
         with:
           name: pure-python-wheels-${{ matrix.pl_backend }}.zip
           path: dist
 
       - name: Upload wheels to PyPI
+        if: ${{ matrix.pl_backend == 'lightning_qubit' && github.event_name == 'release' }}
         uses: pypa/gh-action-pypi-publish@release/v1
         with:
           user: __token__
           password: ${{ secrets.TEST_PYPI_API_TOKEN }}
-          repository-url: https://test.pypi.org/legacy/
\ No newline at end of file
+          repository-url: https://test.pypi.org/legacy/
+
+      - uses: actions/download-artifact@v3
+        if: ${{ matrix.pl_backend != 'lightning_qubit' && github.event_name == 'release' }}
+        with:
+          name: pure-source-dist-${{ matrix.pl_backend }}.tar.gz
+          path: dist
+
diff --git a/.readthedocs.yml b/.readthedocs.yml
index 00a2f7e4b8..e4d85ee56b 100644
--- a/.readthedocs.yml
+++ b/.readthedocs.yml
@@ -21,8 +21,11 @@ build:
     - libopenblas-base
     - libopenblas-dev
     - graphviz
+    - nvidia-cuda-toolkit
   jobs:
     pre_install:
       - echo "setuptools~=66.0\npip~=22.0" >> ci_build_requirements.txt
     post_install:
-      - PL_BACKEND="lightning_kokkos" pip install -e . -vv
+      - rm -rf ./build && PL_BACKEND="lightning_kokkos" python setup.py bdist_wheel
+      - rm -rf ./build && PL_BACKEND="lightning_gpu" python setup.py build_ext --define="PL_DISABLE_CUDA_SAFETY=1" && PL_BACKEND="lightning_gpu" python setup.py bdist_wheel
+      - python -m pip install ./dist/*.whl
diff --git a/README.rst b/README.rst
index 3e128524fd..e1aca73ac6 100644
--- a/README.rst
+++ b/README.rst
@@ -41,60 +41,103 @@ The Lightning plugin ecosystem provides fast state-vector simulators written in
 learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
 PennyLane supports Python 3.9 and above.
 
-.. header-end-inclusion-marker-do-not-remove
+Features
+********
 
+PennyLane-Lightning high performance simulators include the following backends:
 
-Features
-========
+* ``lightning.qubit``: is a fast state-vector simulator written in C++.
+* ``lightning.gpu``: is a state-vector simulator based on the `NVIDIA cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_. It notably implements a distributed state-vector simulator based on MPI.
+* ``lightning.kokkos``: is a state-vector simulator written with `Kokkos <https://kokkos.github.io/kokkos-core-wiki/index.html>`_. It can exploit the inherent parallelism of modern processing units supporting the `OpenMP <https://www.openmp.org/>`_, `CUDA <https://developer.nvidia.com/cuda-toolkit>`_ or `HIP <https://docs.amd.com/projects/HIP/en/docs-5.3.0/index.html>`_ programming models.
+
+.. header-end-inclusion-marker-do-not-remove
+
+The following table summarizes the supported platforms and the primary installation mode:
+
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+|           | L-Qubit | L-GPU  | L-GPU (MPI) | L-Kokkos (OMP) | L-Kokkos (CUDA) | L-Kokkos (HIP) |
++===========+=========+========+=============+================+=================+================+
+| Linux x86 | pip     | pip    | source      | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Linux ARM | pip     | source |             | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Linux PPC | pip     | source |             | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| MacOS x86 | pip     |        |             | pip            |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| MacOS ARM | pip     |        |             | pip            |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Windows   | pip     |        |             |                |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
 
-* Combine Lightning's high performance simulators with PennyLane's
-  automatic differentiation and optimization.
 
 .. installation_LQubit-start-inclusion-marker-do-not-remove
 
+Lightning-Qubit installation
+****************************
 
-Lightning Qubit installation
-============================
+PyPI wheels (pip)
+=================
 
-Lightning Qubit can be installed using ``pip``:
+Lightning plugins can be installed using ``pip`` as follows
 
 .. code-block:: console
 
     $ pip install pennylane-lightning
 
-To build Lightning from source you can run
+The above command will install the Lightning-Qubit plugin (the default since it is most broadly supported).
+In order to install the Lightning-GPU and Lightning-Kokkos (OpenMP) backends, you can respectively use the following commands:
+
+.. code-block:: console
+
+    $ pip install pennylane-lightning[gpu]
+    $ pip install pennylane-lightning[kokkos]
+
+
+Install from source
+===================
+
+To build Lightning plugins from source you can run
 
 .. code-block:: console
 
-    $ pip install pybind11 pennylane-lightning --no-binary :all:
+    $ PL_BACKEND=${PL_BACKEND} pip install pybind11 pennylane-lightning --no-binary :all:
+
+where ``${PL_BACKEND}`` can be ``lightning_qubit`` (default), ``lightning_gpu`` or ``lightning_kokkos``.
+The `pybind11 <https://pybind11.readthedocs.io/en/stable/>`_ library is required to bind the C++ functionality to Python.
 
 A C++ compiler such as ``g++``, ``clang++``, or ``MSVC`` is required.
 On Debian-based systems, this can be installed via ``apt``:
 
 .. code-block:: console
 
-    $ sudo apt install g++
+    $ sudo apt -y update &&
+    $ sudo apt install g++ libomp-dev
 
+where ``libomp-dev`` is included to also install OpenMP.
 On MacOS, we recommend using the latest version of ``clang++`` and ``libomp``:
 
 .. code-block:: console
 
     $ brew install llvm libomp
 
-The `pybind11 <https://pybind11.readthedocs.io/en/stable/>`_ library is also used for binding the
-C++ functionality to Python.
+The Lightning-GPU backend has several dependencies (e.g. ``CUDA``, ``custatevec-cu11``, etc.), and hence we recommend referring to Lightning-GPU installation section.
+Similarly, for Lightning-Kokkos it is recommended to configure and install Kokkos independently as prescribed in the Lightning-Kokkos installation section.
+
+Development installation
+========================
 
-Alternatively, for development and testing, you can install by cloning the repository:
+For development and testing, you can install by cloning the repository:
 
 .. code-block:: console
 
     $ git clone https://github.com/PennyLaneAI/pennylane-lightning.git
     $ cd pennylane-lightning
     $ pip install -r requirements.txt
-    $ pip install -e .
+    $ PL_BACKEND=${PL_BACKEND} pip install -e . -vv
 
 Note that subsequent calls to ``pip install -e .`` will use cached binaries stored in the
-``build`` folder. Run ``make clean`` if you would like to recompile.
+``build`` folder. Run ``make clean`` if you would like to recompile from scratch.
 
 You can also pass ``cmake`` options with ``CMAKE_ARGS`` as follows:
 
@@ -109,26 +152,35 @@ or with ``build_ext`` and the ``--define`` flag as follows:
     $ python3 setup.py build_ext -i --define="ENABLE_OPENMP=OFF;ENABLE_BLAS=OFF"
     $ python3 setup.py develop
 
+where ``-D`` must not be included before ``;``-separated options.
 
-Testing
--------
+Compile MSVC (Windows)
+======================
+
+Lightning-Qubit can be compiled on Windows using the
+`Microsoft Visual C++ <https://visualstudio.microsoft.com/vs/features/cplusplus/>`_ compiler.
+You need `cmake <https://cmake.org/download/>`_ and appropriate Python environment
+(e.g. using `Anaconda <https://www.anaconda.com/>`_).
 
-To test that the plugin is working correctly you can test the Python code within the cloned
-repository:
+We recommend using ``[x64 (or x86)] Native Tools Command Prompt for VS [version]`` to compile the library.
+Be sure that ``cmake`` and ``python`` can be called within the prompt.
 
 .. code-block:: console
 
-    $ make test-python
+    $ cmake --version
+    $ python --version
 
-while the C++ code can be tested with
+Then a common command will work.
 
 .. code-block:: console
 
-    $ make test-cpp
+    $ pip install -r requirements.txt
+    $ pip install -e .
 
+Note that OpenMP and BLAS are disabled on this platform.
 
-CMake Support
--------------
+CMake support
+=============
 
 One can also build the plugin using CMake:
 
@@ -137,184 +189,208 @@ One can also build the plugin using CMake:
     $ cmake -S. -B build
     $ cmake --build build
 
-To test the C++ code:
+Supported options are
 
-.. code-block:: console
+- ``-DENABLE_WARNINGS:BOOL=ON``
+- ``-DENABLE_NATIVE:BOOL=ON`` (for ``-march=native``)
+- ``-DENABLE_BLAS:BOOL=ON``
+- ``-DENABLE_OPENMP:BOOL=ON``
+- ``-DENABLE_CLANG_TIDY:BOOL=ON``
 
-    $ mkdir build && cd build
-    $ cmake -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug ..
-    $ make
+Testing
+=======
 
-Other supported options are
+To test that a plugin is working correctly, test the Python code with:
 
-- ``-DENABLE_WARNINGS=ON``
-- ``-DENABLE_NATIVE=ON`` (for ``-march=native``)
-- ``-DENABLE_BLAS=ON``
-- ``-DENABLE_OPENMP=ON``
-- ``-DENABLE_CLANG_TIDY=ON``
+.. code-block:: console
 
-Compile on Windows with MSVC
-----------------------------
+    $ make test-python device=${PL_DEVICE}
 
-You can also compile Lightning on Windows using
-`Microsoft Visual C++ <https://visualstudio.microsoft.com/vs/features/cplusplus/>`_ compiler.
-You need `cmake <https://cmake.org/download/>`_ and appropriate Python environment
-(e.g. using `Anaconda <https://www.anaconda.com/>`_).
+where ``${PL_DEVICE}`` can be ``lightning.qubit`` (default), ``lightning.gpu`` or ``lightning.kokkos``.
+These differ from ``${PL_BACKEND}`` by replacing the underscore by a dot.
+The C++ code can be tested with
 
+.. code-block:: console
 
-We recommend to use ``[x64 (or x86)] Native Tools Command Prompt for VS [version]`` for compiling the library.
-Be sure that ``cmake`` and ``python`` can be called within the prompt.
+    $ PL_BACKEND=${PL_BACKEND} make test-cpp
 
+.. installation_LQubit-end-inclusion-marker-do-not-remove
 
-.. code-block:: console
+.. installation_LGPU-start-inclusion-marker-do-not-remove
 
-    $ cmake --version
-    $ python --version
+Lightning-GPU installation
+**************************
 
-Then a common command will work.
+Lightning-GPU can be installed using ``pip``:
 
 .. code-block:: console
 
-    $ pip install -r requirements.txt
-    $ pip install -e .
-
-Note that OpenMP and BLAS are disabled in this setting.
+    pip install pennylane-lightning[gpu]
 
+Lightning-GPU requires the `cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_ (only the `cuStateVec <https://docs.nvidia.com/cuda/cuquantum/latest/custatevec/index.html>`_ library is required).
+The SDK may be installed within the Python environment ``site-packages`` directory using ``pip`` or ``conda`` or the SDK library path appended to the ``LD_LIBRARY_PATH`` environment variable.
+Please see the `cuQuantum SDK`_ install guide for more information.
 
-.. installation_LQubit-end-inclusion-marker-do-not-remove
+Install Lightning-GPU from source
+=================================
 
+To install Lightning-GPU from the package sources using the direct SDK path, Lightning-Qubit should be install before Lightning-GPU:
 
-.. installation_LKokkos-start-inclusion-marker-do-not-remove
+.. code-block:: console
 
-Lightning Kokkos installation
-=============================
+    git clone https://github.com/PennyLaneAI/pennylane-lightning.git
+    cd pennylane-lightning
+    pip install -r requirements.txt
+    PL_BACKEND="lightning_qubit" pip install -e . -vv
 
-For linux systems, `lightning.kokkos` and be readily installed with an OpenMP backend by providing the optional ``[kokkos]`` tag: 
+Then the `cuStateVec`_ library can be installed and set a ``CUQUANTUM_SDK`` environment variable.
 
 .. code-block:: console
 
-    $ pip install pennylane-lightning[kokkos]
+    python -m pip install wheel custatevec-cu11
+    export CUQUANTUM_SDK=$(python -c "import site; print( f'{site.getsitepackages()[0]}/cuquantum/lib')")
 
-This can be explicitly installed through PyPI as:
+The Lightning-GPU can then be installed with ``pip``:
 
 .. code-block:: console
 
-    $ pip install pennylane-lightning-kokkos
+    PL_BACKEND="lightning_gpu" python -m pip install -e .
 
+To simplify the build, we recommend using the containerized build process described in Docker support section.
 
-Building from source
---------------------
+Install Lightning-GPU with MPI
+==============================
 
-As Kokkos enables support for many different HPC-targetted hardware platforms, `lightning.kokkos` can be built to support any of these platforms when building from source.
+Building Lightning-GPU with MPI also requires the ``NVIDIA cuQuantum SDK`` (currently supported version: `custatevec-cu11 <https://pypi.org/project/cuquantum-cu11/>`_), ``mpi4py`` and ``CUDA-aware MPI`` (Message Passing Interface).
+``CUDA-aware MPI`` allows data exchange between GPU memory spaces of different nodes without the need for CPU-mediated transfers.
+Both the ``MPICH`` and ``OpenMPI`` libraries are supported, provided they are compiled with CUDA support.
+The path to ``libmpi.so`` should be found in ``LD_LIBRARY_PATH``.
+It is recommended to install the ``NVIDIA cuQuantum SDK`` and ``mpi4py`` Python package within ``pip`` or ``conda`` inside a virtual environment.
+Please consult the `cuQuantum SDK`_ , `mpi4py <https://mpi4py.readthedocs.io/en/stable/install.html>`_,
+`MPICH <https://www.mpich.org/static/downloads/4.1.1/mpich-4.1.1-README.txt>`_, or `OpenMPI <https://www.open-mpi.org/faq/?category=buildcuda>`_ install guide for more information.
 
-We suggest first installing Kokkos with the wanted configuration following the instructions found in the `Kokkos documentation <https://kokkos.github.io/kokkos-core-wiki/building.html>`_.
-Next, append the install location to ``CMAKE_PREFIX_PATH``.
-If an installation is not found, our builder will clone and install it during the build process.
-
-The simplest way to install PennyLane-Lightning-Kokkos (OpenMP backend) is using ``pip``.
+Before installing Lightning-GPU with MPI support using the direct SDK path, please ensure Lightning-Qubit, ``CUDA-aware MPI`` and ``custatevec`` are installed and the environment variable ``CUQUANTUM_SDK`` is set properly.
+Then Lightning-GPU with MPI support can then be installed with ``pip``:
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install .
-
-or for an editable ``pip`` installation with:
+    CMAKE_ARGS="-DENABLE_MPI=ON"  PL_BACKEND="lightning_gpu" python -m pip install -e .
 
-.. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install -e .
+Test L-GPU with MPI
+===================
 
-Alternatively, you can install the Python interface with:
+You may test the Python layer of the MPI enabled plugin as follows:
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python setup.py build_ext
-   python setup.py bdist_wheel
-   pip install ./dist/PennyLane*.whl --force-reinstall
+    mpirun -np 2 python -m pytest mpitests --tb=short
 
-To build the plugin directly with CMake:
+The C++ code is tested with
 
 .. code-block:: console
 
-   cmake -B build -DKokkos_ENABLE_OPENMP=ON -DPLKOKKOS_BUILD_TESTS=ON -DPL_BACKEND=lightning_kokkos -G Ninja
-   cmake --build build
+    rm -rf ./BuildTests
+    cmake . -BBuildTests -DBUILD_TESTS=1 -DBUILD_TESTS=1 -DENABLE_MPI=ON -DCUQUANTUM_SDK=<path to sdk>
+    cmake --build ./BuildTests --verbose
+    cd ./BuildTests
+    for file in *runner_mpi ; do mpirun -np 2 ./BuildTests/$file ; done;
 
-The supported backend options are "SERIAL", "OPENMP", "THREADS", "HIP" and "CUDA" and the corresponding build options are ``-DKokkos_ENABLE_XXX=ON``, where ``XXX`` needs be replaced by the backend name, for instance ``OPENMP``.
-One can activate simultaneously one serial, one parallel CPU host (e.g. "OPENMP", "THREADS") and one parallel GPU device backend (e.g. "HIP", "CUDA"), but not two of any category at the same time.
-For "HIP" and "CUDA", the appropriate software stacks are required to enable compilation and subsequent use.
-Similarly, the CMake option ``-DKokkos_ARCH_{...}=ON`` must also be specified to target a given architecture.
-A list of the architectures is found on the `Kokkos wiki <https://github.com/kokkos/kokkos/wiki/Macros#architectures>`_.
-Note that "THREADS" backend is not recommended since `Kokkos <https://github.com/kokkos/kokkos-core-wiki/blob/17f08a6483937c26e14ec3c93a2aa40e4ce081ce/docs/source/ProgrammingGuide/Initialization.md?plain=1#L67>`_ does not guarantee its safety.
+.. installation_LGPU-end-inclusion-marker-do-not-remove
 
+.. installation_LKokkos-start-inclusion-marker-do-not-remove
 
-Testing
-=======
+Lightning-Kokkos installation
+*****************************
 
-To test with the ROCm stack using a manylinux2014 container we must first mount the repository into the container:
+On linux systems, ``lightning.kokkos`` with the OpenMP backend can be installed by providing the optional ``[kokkos]`` tag:
 
 .. code-block:: console
 
-    docker run -v `pwd`:/io -it quay.io/pypa/manylinux2014_x86_64 bash
+    $ pip install pennylane-lightning[kokkos]
 
-Next, within the container, we install the ROCm software stack:
+Install Lightning-Kokkos from source
+====================================
 
-.. code-block:: console
+As Kokkos enables support for many different HPC-targeted hardware platforms, ``lightning.kokkos`` can be built to support any of these platforms when building from source.
 
-    yum install -y https://repo.radeon.com/amdgpu-install/21.40.2/rhel/7.9/amdgpu-install-21.40.2.40502-1.el7.noarch.rpm
-    amdgpu-install --usecase=hiplibsdk,rocm --no-dkms
-
-We next build the test suite, with a given AMD GPU target in mind, as listed `here <https://github.com/kokkos/kokkos/blob/master/Makefile.kokkos>`_.
+We suggest first installing Kokkos with the wanted configuration following the instructions found in the `Kokkos documentation <https://kokkos.github.io/kokkos-core-wiki/building.html>`_.
+For example, the following will build Kokkos for NVIDIA A100 cards
 
 .. code-block:: console
 
-    cd /io
-    export PATH=$PATH:/opt/rocm/bin/
-    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib
-    export CXX=/opt/rocm/hip/bin/hipcc
-    cmake -B build -DCMAKE_CXX_COMPILER=/opt/rocm/hip/bin/hipcc -DKokkos_ENABLE_HIP=ON -DPLKOKKOS_BUILD_TESTS=ON -DKokkos_ARCH_VEGA90A=ON
-    cmake --build build --verbose
+    cmake -S . -B build -G Ninja \
+        -DCMAKE_BUILD_TYPE=RelWithDebug \
+        -DCMAKE_INSTALL_PREFIX=/opt/kokkos/4.1.00/AMPERE80 \
+        -DCMAKE_CXX_STANDARD=20 \
+        -DBUILD_SHARED_LIBS:BOOL=ON \
+        -DBUILD_TESTING:BOOL=OFF \
+        -DKokkos_ENABLE_SERIAL:BOOL=ON \
+        -DKokkos_ENABLE_CUDA:BOOL=ON \
+        -DKokkos_ARCH_AMPERE80:BOOL=ON \
+        -DKokkos_ENABLE_EXAMPLES:BOOL=OFF \
+        -DKokkos_ENABLE_TESTS:BOOL=OFF \
+        -DKokkos_ENABLE_LIBDL:BOOL=OFF
+    cmake --build build && cmake --install build
+    echo export CMAKE_PREFIX_PATH=/opt/kokkos/4.1.00/AMPERE80:\$CMAKE_PREFIX_PATH
 
-We may now leave the container, and run the built test suite on a machine with access to the targeted GPU.
+Next, append the install location to ``CMAKE_PREFIX_PATH``.
+Note that the C++20 standard is required (``-DCMAKE_CXX_STANDARD=20`` option), and hence CUDA v12 is required for the CUDA backend.
+If an installation is not found, our builder will clone and install it during the build process.
 
-For a system with access to the ROCm stack outside of a manylinux container, an editable ``pip`` installation can be built and installed as:
+The simplest way to install Lightning-Kokkos (OpenMP backend) through ``pip``.
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON" PL_BACKEND="lightning_kokkos" python -m pip install -e .
+   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install .
 
-.. installation_LKokkos-end-inclusion-marker-do-not-remove
+To build the plugin directly with CMake as above:
 
-Please refer to the `plugin documentation <https://docs.pennylane.ai/projects/lightning/>`_ as
-well as to the `PennyLane documentation <https://docs.pennylane.ai/>`_ for further reference.
+.. code-block:: console
 
+   cmake -B build -DKokkos_ENABLE_OPENMP=ON -DPL_BACKEND=lightning_kokkos -G Ninja
+   cmake --build build
 
-GPU support
------------
 
-For GPU support, `PennyLane-Lightning-GPU <https://github.com/PennyLaneAI/pennylane-lightning-gpu>`_
-can be installed by providing the optional ``[gpu]`` tag:
+The supported backend options are ``SERIAL``, ``OPENMP``, ``THREADS``, ``HIP`` and ``CUDA`` and the corresponding build options are ``-DKokkos_ENABLE_XXX=ON``, where ``XXX`` needs be replaced by the backend name, for instance ``OPENMP``.
+One can activate simultaneously one serial, one parallel CPU host (e.g. ``OPENMP``, ``THREADS``) and one parallel GPU device backend (e.g. ``HIP``, ``CUDA``), but not two of any category at the same time.
+For ``HIP`` and ``CUDA``, the appropriate software stacks are required to enable compilation and subsequent use.
+Similarly, the CMake option ``-DKokkos_ARCH_{...}=ON`` must also be specified to target a given architecture.
+A list of the architectures is found on the `Kokkos wiki <https://github.com/kokkos/kokkos/wiki/Macros#architectures>`_.
+Note that ``THREADS`` backend is not recommended since `Kokkos does not guarantee its safety <https://github.com/kokkos/kokkos-core-wiki/blob/17f08a6483937c26e14ec3c93a2aa40e4ce081ce/docs/source/ProgrammingGuide/Initialization.md?plain=1#L67>`_.
 
-.. code-block:: console
+.. installation_LKokkos-end-inclusion-marker-do-not-remove
 
-    $ pip install pennylane-lightning[gpu]
+Please refer to the `plugin documentation <https://docs.pennylane.ai/projects/lightning/>`_ as
+well as to the `PennyLane documentation <https://docs.pennylane.ai/>`_ for further reference.
 
-For more information, please refer to the PennyLane Lightning GPU `documentation <https://docs.pennylane.ai/projects/lightning-gpu>`_.
+.. docker-start-inclusion-marker-do-not-remove
 
-Docker Support
---------------
+Docker support
+**************
 
-One can also build the Lightning image using Docker:
+Docker images for the various backends are found on the
+`PennyLane Docker Hub <https://hub.docker.com/repository/docker/pennylaneai/pennylane/general>`_ page, where there is also a detailed description about PennyLane Docker support.
+Briefly, one can build the Docker Lightning images using:
 
 .. code-block:: console
 
     $ git clone https://github.com/PennyLaneAI/pennylane-lightning.git
     $ cd pennylane-lightning
-    $ docker build -t lightning/base -f docker/Dockerfile .
+    $ docker build -f docker/Dockerfile --target ${TARGET} .
+
+where ``${TARGET}`` is one of the following
 
-Please refer to the `PennyLane installation <https://docs.pennylane.ai/en/stable/development/guide/installation.html#installation>`_ for detailed description about PennyLane Docker support.
+* ``wheel-lightning-qubit``
+* ``wheel-lightning-gpu``
+* ``wheel-lightning-kokkos-openmp``
+* ``wheel-lightning-kokkos-cuda``
+* ``wheel-lightning-kokkos-rocm``
 
+.. docker-end-inclusion-marker-do-not-remove
 
 Contributing
-============
+************
 
 We welcome contributions - simply fork the repository of this plugin, and then make a
 `pull request <https://help.github.com/articles/about-pull-requests/>`_ containing your contribution.
@@ -333,9 +409,8 @@ The Python code is statically analyzed with `Pylint <https://pylint.readthedocs.
 We set up a pre-commit hook (see `Git hooks <https://git-scm.com/docs/githooks>`_) to run both of these on `git commit`.
 Please make your best effort to comply with `black` and `pylint` before using disabling pragmas (e.g. `# pylint: disable=missing-function-docstring`).
 
-
 Authors
-=======
+*******
 
 Lightning is the work of `many contributors <https://github.com/PennyLaneAI/pennylane-lightning/graphs/contributors>`_.
 
@@ -348,9 +423,8 @@ If you are doing research using PennyLane and Lightning, please cite `our paper
 
 .. support-start-inclusion-marker-do-not-remove
 
-
 Support
-=======
+*******
 
 - **Source Code:** https://github.com/PennyLaneAI/pennylane-lightning
 - **Issue Tracker:** https://github.com/PennyLaneAI/pennylane-lightning/issues
@@ -362,22 +436,24 @@ by asking a question in the forum.
 .. support-end-inclusion-marker-do-not-remove
 .. license-start-inclusion-marker-do-not-remove
 
-
 License
-=======
+*******
 
-The PennyLane lightning plugin is **free** and **open source**, released under
+The Lightning plugins are **free** and **open source**, released under
 the `Apache License, Version 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
+The Lightning-GPU plugin makes use of the NVIDIA cuQuantum SDK headers to
+enable the device bindings to PennyLane, which are held to their own respective license.
 
 .. license-end-inclusion-marker-do-not-remove
 .. acknowledgements-start-inclusion-marker-do-not-remove
 
 Acknowledgements
-================
+****************
 
 PennyLane Lightning makes use of the following libraries and tools, which are under their own respective licenses:
 
 - **pybind11:** https://github.com/pybind/pybind11
 - **Kokkos Core:** https://github.com/kokkos/kokkos
+- **NVIDIA cuQuantum:** https://developer.nvidia.com/cuquantum-sdk
 
-.. acknowledgements-end-inclusion-marker-do-not-remove
\ No newline at end of file
+.. acknowledgements-end-inclusion-marker-do-not-remove
diff --git a/bin/auditwheel b/bin/auditwheel
new file mode 100755
index 0000000000..e7142fa4ab
--- /dev/null
+++ b/bin/auditwheel
@@ -0,0 +1,25 @@
+#!/usr/bin/env python3
+
+# Patch to not ship CUDA system libraries
+# Follows https://github.com/DIPlib/diplib/tree/master/tools/travis
+import sys
+
+from auditwheel.main import main
+from auditwheel.policy import _POLICIES as POLICIES
+
+# Do not include licensed dynamic libraries
+libs = [
+    "libcudart.so.11.0",
+    "libcublasLt.so.11",
+    "libcublas.so.11",
+    "libcusparse.so.11",
+    "libcustatevec.so.1",
+]
+
+print(f"Excluding {libs}")
+
+for p in POLICIES:
+    p["lib_whitelist"].extend(libs)
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/doc/code/__init__.rst b/doc/code/__init__.rst
index bf68bf024a..1e4eb7d3c8 100644
--- a/doc/code/__init__.rst
+++ b/doc/code/__init__.rst
@@ -1,5 +1,5 @@
-pennylane_lightning
-===================
+Python API
+==========
 
 This section contains the API documentation for the Lightning packages.
 
@@ -18,6 +18,10 @@ This section contains the API documentation for the Lightning packages.
    :description: API documentation for the lightning_qubit package
    :link: ../lightning_qubit/package.html
 
+.. title-card::
+   :name: lightning_gpu
+   :description: API documentation for the lightning_gpu package
+   :link: ../lightning_gpu/package.html
 
 .. title-card::
    :name: lightning_kokkos
@@ -33,4 +37,5 @@ This section contains the API documentation for the Lightning packages.
    :hidden:
 
    ../lightning_qubit/package
+   ../lightning_gpu/package
    ../lightning_kokkos/package
diff --git a/doc/docker.rst b/doc/docker.rst
new file mode 100644
index 0000000000..85ae81ba73
--- /dev/null
+++ b/doc/docker.rst
@@ -0,0 +1,3 @@
+.. include:: ../README.rst
+  :start-after:	docker-start-inclusion-marker-do-not-remove
+  :end-before: docker-end-inclusion-marker-do-not-remove
diff --git a/doc/index.rst b/doc/index.rst
index c9316bd782..f48d86c567 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -14,7 +14,7 @@ Lightning plugins
 
 
 Devices
--------
+*******
 
 The Lightning ecosystem provides the following devices:
 
@@ -23,6 +23,11 @@ The Lightning ecosystem provides the following devices:
     :description: A fast state-vector qubit simulator written in C++
     :link: lightning_qubit/device.html
 
+.. title-card::
+    :name: 'lightning.gpu'
+    :description: A heterogeneous backend state-vector simulator with NVIDIA cuQuantum library support.
+    :link: lightning_gpu/device.html
+
 .. title-card::
     :name: 'lightning.kokkos'
     :description: A heterogeneous backend state-vector simulator with Kokkos library support.
@@ -39,6 +44,7 @@ The Lightning ecosystem provides the following devices:
    :hidden:
 
    installation
+   docker
    support
 
 .. toctree::
@@ -47,6 +53,7 @@ The Lightning ecosystem provides the following devices:
    :hidden:
 
    lightning_qubit/device
+   lightning_gpu/device
    lightning_kokkos/device
 
 .. toctree::
diff --git a/doc/installation.rst b/doc/installation.rst
index d89f62a24c..c0b056f5c0 100644
--- a/doc/installation.rst
+++ b/doc/installation.rst
@@ -8,6 +8,10 @@ Each device in the Lightning ecosystem is a separate Python package. Select the
    :description: Guidelines to installing and testing the Lightning Qubit device.
    :link: ./lightning_qubit/installation.html
 
+.. title-card::
+   :name: Lightning GPU
+   :description: Guidelines to installing and testing the Lightning GPU device
+   :link: ./lightning_gpu/installation.html
 
 .. title-card::
    :name: Lightning Kokkos
@@ -23,4 +27,5 @@ Each device in the Lightning ecosystem is a separate Python package. Select the
    :hidden:
 
    lightning_qubit/installation
+   lightning_gpu/installation
    lightning_kokkos/installation
diff --git a/doc/lightning_gpu/device.rst b/doc/lightning_gpu/device.rst
new file mode 100644
index 0000000000..b1abc9cf92
--- /dev/null
+++ b/doc/lightning_gpu/device.rst
@@ -0,0 +1,284 @@
+Lightning GPU device
+====================
+
+The ``lightning.gpu`` device is an extension of PennyLane's built-in ``lightning.qubit`` device.
+It extends the CPU-focused Lightning simulator to run using the NVIDIA cuQuantum SDK, enabling GPU-accelerated simulation of quantum state-vector evolution.
+
+A ``lightning.gpu`` device can be loaded using:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=2)
+
+If the NVIDIA cuQuantum libraries are available, the above device will allow all operations to be performed on a CUDA capable GPU of generation SM 7.0 (Volta) and greater. If the libraries are not correctly installed, or available on path, the device will fall-back to ``lightning.qubit`` and perform all simulation on the CPU.
+
+The ``lightning.gpu`` device also directly supports quantum circuit gradients using the adjoint differentiation method. This can be enabled at the PennyLane QNode level with:
+
+.. code-block:: python
+
+    qml.qnode(dev, diff_method="adjoint")
+    def circuit(params):
+        ...
+
+Check out the :doc:`/lightning_gpu/installation` guide for more information.
+
+Supported operations and observables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Supported operations:**
+
+.. raw:: html
+
+    <div class="summary-table">
+
+.. autosummary::
+    :nosignatures:
+
+    ~pennylane.BasisState
+    ~pennylane.CNOT
+    ~pennylane.ControlledPhaseShift
+    ~pennylane.ControlledQubitUnitary
+    ~pennylane.CPhase
+    ~pennylane.CRot
+    ~pennylane.CRX
+    ~pennylane.CRY
+    ~pennylane.CRZ
+    ~pennylane.CSWAP
+    ~pennylane.CY
+    ~pennylane.CZ
+    ~pennylane.DiagonalQubitUnitary
+    ~pennylane.DoubleExcitation
+    ~pennylane.DoubleExcitationMinus
+    ~pennylane.DoubleExcitationPlus
+    ~pennylane.ECR
+    ~pennylane.Hadamard
+    ~pennylane.Identity
+    ~pennylane.IsingXX
+    ~pennylane.IsingXY
+    ~pennylane.IsingYY
+    ~pennylane.IsingZZ
+    ~pennylane.ISWAP
+    ~pennylane.MultiControlledX
+    ~pennylane.MultiRZ
+    ~pennylane.OrbitalRotation
+    ~pennylane.PauliX
+    ~pennylane.PauliY
+    ~pennylane.PauliZ
+    ~pennylane.PhaseShift
+    ~pennylane.PSWAP
+    ~pennylane.QFT
+    ~pennylane.QubitCarry
+    ~pennylane.QubitStateVector
+    ~pennylane.QubitSum
+    ~pennylane.QubitUnitary
+    ~pennylane.Rot
+    ~pennylane.RX
+    ~pennylane.RY
+    ~pennylane.RZ
+    ~pennylane.S
+    ~pennylane.SingleExcitation
+    ~pennylane.SingleExcitationMinus
+    ~pennylane.SingleExcitationPlus
+    ~pennylane.SISWAP
+    ~pennylane.SQISW
+    ~pennylane.SWAP
+    ~pennylane.SX
+    ~pennylane.T
+    ~pennylane.Toffoli
+
+.. raw:: html
+
+    </div>
+
+**Supported observables:**
+
+.. raw:: html
+
+    <div class="summary-table">
+
+.. autosummary::
+    :nosignatures:
+
+    ~pennylane.ops.op_math.Exp
+    ~pennylane.Hadamard
+    ~pennylane.Hamiltonian
+    ~pennylane.Hermitian
+    ~pennylane.Identity
+    ~pennylane.PauliX
+    ~pennylane.PauliY
+    ~pennylane.PauliZ
+    ~pennylane.ops.op_math.Prod
+    ~pennylane.Projector
+    ~pennylane.SparseHamiltonian
+    ~pennylane.ops.op_math.SProd
+    ~pennylane.ops.op_math.Sum
+
+.. raw:: html
+
+    </div>
+
+
+
+**Parallel adjoint differentiation support:**
+
+The ``lightning.gpu`` device directly supports the `adjoint differentiation method <https://pennylane.ai/qml/demos/tutorial_adjoint_diff.html>`__, and enables parallelization over the requested observables. This supports direct controlling of observable batching, which can be used to run concurrent calculations across multiple available GPUs.
+
+If you are computing a large number of expectation values, or if you are using a large number of wires on your device, it may be best to evenly divide the number of expectation value calculations across all available GPUs. This will reduce the overall memory cost of the observables per GPU, at the cost of additional compute time. Assuming `m` observables, and `n` GPUs, the default behaviour is to pre-allocate all storage for `n` observables on a single GPU. To divide the workload amongst many GPUs, initialize a ``lightning.gpu`` device with the ``batch_obs=True`` keyword argument, as:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=20, batch_obs=True)
+
+With the above, each GPU will see at most `m/n` observables to process, reducing the preallocated memory footprint.
+
+Additionally, there can be situations where even with the above distribution, and limited GPU memory, the overall problem does not fit on the requested GPU devices. You can further reduce the concurrent allocations on available GPUs by providing an integer value to the `batch_obs` keyword. For example, to batch evaluate observables with at most 1 observable allocation per GPU, define the device as:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=27, batch_obs=1)
+
+Each problem is unique, so it can often be best to choose the default behaviour up-front, and tune with the above only if necessary.
+ 
+**Multi-GPU/multi-node support:**
+
+The ``lightning.gpu`` device allows users to leverage the computational power of many GPUs sitting on separate nodes for running large-scale simulations. 
+Provided that NVIDIA ``cuQuantum`` libraries, a ``CUDA-aware MPI`` library and ``mpi4py`` are properly installed and the path to the ``libmpi.so`` is 
+added to the ``LD_LIBRARY_PATH`` environment variable, the following requirements should be met to enable multi-node and multi-GPU simulations:
+
+1. The ``mpi`` keyword argument should be set as ``True`` when initializing a ``lightning.gpu`` device.
+2. Both the total number of MPI processes and MPI processes per node must be powers of 2. For example, 2, 4, 8, 16, etc.. Each MPI process is responsible for managing one GPU. 
+
+The workflow for the multi-node/GPUs feature is as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    @qml.qnode(dev)
+    def circuit_mpi():
+        qml.PauliX(wires=[0])
+        return qml.state()
+    local_state_vector = circuit_mpi()
+
+Currently, a ``lightning.gpu`` device with the MPI multi-GPU backend supports all the ``gate operations`` and ``observables`` that a ``lightning.gpu`` device with a single GPU/node backend supports.
+
+By default, each MPI process will return the overall simulation results, except for the ``qml.state()`` and ``qml.prob()`` methods for which each MPI process only returns the local simulation
+results for the ``qml.state()`` and ``qml.prob()`` methods to avoid buffer overflow. It is the user's responsibility to ensure correct data collection for those two methods. Here are examples of collecting
+the local simulation results for ``qml.state()`` and ``qml.prob()`` methods:
+
+The workflow for collecting local state vector (using the ``qml.state()`` method) to ``rank 0`` is as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank() 
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    @qml.qnode(dev)
+    def circuit_mpi():
+        qml.PauliX(wires=[0])
+        return qml.state()
+    local_state_vector = circuit_mpi()
+    #rank 0 will collect the local state vector
+    state_vector = comm.gather(local_state_vector, root=0)
+    if rank == 0:
+        print(state_vector)
+    
+The workflow for collecting local probability (using the ``qml.prob()`` method) to ``rank 0`` is as follows:
+
+.. code-block:: python
+    
+    from mpi4py import MPI
+    import pennylane as qml
+    import numpy as np
+
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    prob_wires = [0, 1]
+
+    @qml.qnode(dev)
+    def mpi_circuit():
+        qml.Hadamard(wires=1)
+        return qml.probs(wires=prob_wires)
+
+    local_probs = mpi_circuit()
+ 
+    #For data collection across MPI processes.
+    recv_counts = comm.gather(len(local_probs),root=0)
+    if rank == 0:
+        probs = np.zeros(2**len(prob_wires))
+    else:
+        probs = None
+
+    comm.Gatherv(local_probs,[probs,recv_counts],root=0)
+    if rank == 0:
+        print(probs)
+
+Then the python script can be executed with the following command:
+
+.. code-block:: console
+    
+    $ mpirun -np 4 python yourscript.py
+
+Furthermore, users can optimize the performance of their applications by allocating the appropriate amount of GPU memory for MPI operations with the ``mpi_buf_size`` keyword argument. To allocate ``n`` mebibytes (MiB, `2^20` bytes) of GPU memory for MPI operations, initialize a ``lightning.gpu`` device with the ``mpi_buf_size=n`` keyword argument, as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    n = 8
+    dev = qml.device("lightning.gpu", wires=20, mpi=True, mpi_buf_size=n)
+
+Note the value of ``mpi_buf_size`` should also be a power of ``2``. Remember to carefully manage the ``mpi_buf_size`` parameter, taking into account the available GPU memory and the memory 
+requirements of the local state vector, to prevent memory overflow issues and ensure optimal performance. By default (``mpi_buf_size=0``), the GPU memory allocated for MPI operations 
+will match the size of the local state vector, with a limit of ``64 MiB``. Please be aware that a runtime warning will occur if the local GPU memory buffer for MPI operations exceeds
+the GPU memory allocated to the local state vector.
+
+**Multi-GPU/multi-node support for adjoint method:**
+
+The ``lightning.gpu`` device with the multi-GPU/multi-node backend also directly supports the `adjoint differentiation method <https://pennylane.ai/qml/demos/tutorial_adjoint_diff.html>`__. Instead of batching observables across the multiple GPUs available within a node, the state vector is distributed among the available GPUs with the multi-GPU/multi-node backend.
+By default, the adjoint method with MPI support follows the performance-oriented implementation of the single GPU backend. This means that a separate ``bra`` is created for each observable and the ``ket`` is updated only once for each operation, regardless of the number of observables.
+
+The workflow for the default adjoint method with MPI support is as follows:
+
+.. code-block:: python
+    
+    from mpi4py import MPI
+    import pennylane as qml
+    from pennylane import numpy as np
+  
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    n_wires = 20
+    n_layers = 2
+  
+    dev = qml.device('lightning.gpu', wires= n_wires, mpi=True)
+    @qml.qnode(dev, diff_method="adjoint")
+    def circuit_adj(weights):
+        qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
+        return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])
+  
+    if rank == 0:
+        params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
+    else:
+        params = None
+  
+    params = comm.bcast(params, root=0)
+    jac = qml.jacobian(circuit_adj)(params)
+
+If users aim to handle larger system sizes with limited hardware resources, the memory-optimized adjoint method with MPI support is more appropriate. The memory-optimized adjoint method with MPI support employs a single ``bra`` object that is reused for all observables.
+This approach results in a notable reduction in the required GPU memory when dealing with a large number of observables. However, it's important to note that the reduction in memory requirement may come at the expense of slower execution due to the multiple ``ket`` updates per gate operation.
+
+To enable the memory-optimized adjoint method with MPI support, ``batch_obs`` should be set as ``True`` and the workflow follows:
+
+.. code-block:: python
+    
+    dev = qml.device('lightning.gpu', wires= n_wires, mpi=True, batch_obs=True)
+
+For the adjoint method, each MPI process will provide the overall simulation results.
\ No newline at end of file
diff --git a/doc/lightning_gpu/installation.rst b/doc/lightning_gpu/installation.rst
new file mode 100644
index 0000000000..9754aae396
--- /dev/null
+++ b/doc/lightning_gpu/installation.rst
@@ -0,0 +1,3 @@
+.. include:: ../../README.rst
+  :start-after:	installation_LGPU-start-inclusion-marker-do-not-remove
+  :end-before: installation_LGPU-end-inclusion-marker-do-not-remove
\ No newline at end of file
diff --git a/doc/lightning_gpu/package.rst b/doc/lightning_gpu/package.rst
new file mode 100644
index 0000000000..1b0b96b84d
--- /dev/null
+++ b/doc/lightning_gpu/package.rst
@@ -0,0 +1,19 @@
+lightning_gpu
+=============
+
+.. automodapi:: pennylane_lightning.lightning_gpu
+    :no-heading:
+    :include-all-objects:
+
+.. raw:: html
+
+        <div style='clear:both'></div>
+        </br>
+
+Directly importing the device class:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python3
+
+    from pennylane_lightning.lightning_gpu import LightningGPU
+
diff --git a/doc/lightning_qubit/development/index.rst b/doc/lightning_qubit/development/index.rst
index 7eb4a918e4..90489e166f 100644
--- a/doc/lightning_qubit/development/index.rst
+++ b/doc/lightning_qubit/development/index.rst
@@ -20,5 +20,5 @@ Lightning Qubit
 .. toctree::
    :hidden:
 
-   avx_kernels/index
    add_gate_kernel
+   avx_kernels/index
diff --git a/doc/lightning_qubit/device.rst b/doc/lightning_qubit/device.rst
index 3aba8e5e64..53a1e5d9a2 100644
--- a/doc/lightning_qubit/device.rst
+++ b/doc/lightning_qubit/device.rst
@@ -140,9 +140,6 @@ If you are computing a large number of expectation values, or if you are using a
     import pennylane as qml
     dev = qml.device("lightning.qubit", wires=2, batch_obs=True)
 
-.. raw:: html
-
-    </div>
 
 **Markov Chain Monte Carlo sampling support:**
 
diff --git a/doc/requirements.txt b/doc/requirements.txt
index 3acc41f79a..f6d4fac3d1 100644
--- a/doc/requirements.txt
+++ b/doc/requirements.txt
@@ -6,3 +6,5 @@ pybind11
 sphinx
 sphinx-automodapi
 pennylane-sphinx-theme
+custatevec-cu11
+wheel
diff --git a/mpitests/test_adjoint_jacobian.py b/mpitests/test_adjoint_jacobian.py
index 3657c336f8..ced09124f5 100644
--- a/mpitests/test_adjoint_jacobian.py
+++ b/mpitests/test_adjoint_jacobian.py
@@ -1161,24 +1161,36 @@ def test_integration_H2_Hamiltonian(
     create_xyz_file, batches
 ):  # pylint: disable=redefined-outer-name
     """Tests getting the total energy and its derivatives for an H2 Hamiltonian."""
+    comm = MPI.COMM_WORLD
     _ = pytest.importorskip("openfermionpyscf")
 
     n_electrons = 2
     np.random.seed(1337)
 
-    str_path = create_xyz_file
-    symbols, coordinates = qml.qchem.read_structure(str(str_path), outpath=str(str_path.parent))
-
-    H, qubits = qml.qchem.molecular_hamiltonian(
-        symbols,
-        coordinates,
-        method="pyscf",
-        basis="6-31G",
-        active_electrons=n_electrons,
-        name="h2",
-        outpath=str(str_path.parent),
-        load_data=True,
-    )
+    if comm.Get_rank() == 0:
+        str_path = create_xyz_file
+        symbols, coordinates = qml.qchem.read_structure(str(str_path), outpath=str(str_path.parent))
+        H, qubits = qml.qchem.molecular_hamiltonian(
+            symbols,
+            coordinates,
+            method="pyscf",
+            basis="6-31G",
+            active_electrons=n_electrons,
+            name="h2",
+            outpath=str(str_path.parent),
+            load_data=True,
+        )
+    else:
+        symbols = None
+        coordinates = None
+        H = None
+        qubits = None
+
+    symbols = comm.bcast(symbols, root=0)
+    coordinates = comm.bcast(coordinates, root=0)
+    H = comm.bcast(H, root=0)
+    qubits = comm.bcast(qubits, root=0)
+
     hf_state = qml.qchem.hf_state(n_electrons, qubits)
     _, doubles = qml.qchem.excitations(n_electrons, qubits)
 
@@ -1211,9 +1223,12 @@ def circuit_compare(params, excitations):
     jac_func_comp = qml.jacobian(circuit_compare)
 
     params = qml.numpy.array([0.0] * len(doubles), requires_grad=True)
+
     jacs = jac_func(params, excitations=doubles)
     jacs_comp = jac_func_comp(params, excitations=doubles)
 
+    comm.Barrier()
+
     assert np.allclose(jacs, jacs_comp)
 
 
@@ -1273,6 +1288,8 @@ def circuit(params):
     j_gpu = qml.jacobian(qnode_gpu)(params)
     j_cpu = qml.jacobian(qnode_cpu)(params)
 
+    comm.Barrier()
+
     assert np.allclose(j_cpu, j_gpu)
 
 
@@ -1361,4 +1378,6 @@ def circuit(params):
     j_gpu = qml.jacobian(qnode_gpu)(params)
     j_cpu = qml.jacobian(qnode_cpu)(params)
 
+    comm.Barrier()
+
     assert np.allclose(j_cpu, j_gpu)
diff --git a/pennylane_lightning/core/_version.py b/pennylane_lightning/core/_version.py
index c37a035029..11eee0a9ee 100644
--- a/pennylane_lightning/core/_version.py
+++ b/pennylane_lightning/core/_version.py
@@ -16,4 +16,4 @@
    Version number (major.minor.patch[-label])
 """
 
-__version__ = "0.34.0-dev"
+__version__ = "0.34.0-dev1"
diff --git a/pennylane_lightning/core/lightning_base.py b/pennylane_lightning/core/lightning_base.py
index 9587ce3dd2..163f7cdeb4 100644
--- a/pennylane_lightning/core/lightning_base.py
+++ b/pennylane_lightning/core/lightning_base.py
@@ -61,7 +61,7 @@ class LightningBase(QubitDevice):
             OpenMP.
     """
 
-    pennylane_requires = ">=0.30"
+    pennylane_requires = ">=0.32"
     version = __version__
     author = "Xanadu Inc."
     short_name = "lightning.base"
@@ -394,7 +394,7 @@ def processing_fns(tapes):
 
 class LightningBaseFallBack(DefaultQubitLegacy):  # pragma: no cover
     # pylint: disable=missing-class-docstring, too-few-public-methods
-    pennylane_requires = ">=0.30"
+    pennylane_requires = ">=0.32"
     version = __version__
     author = "Xanadu Inc."
     _CPP_BINARY_AVAILABLE = False
diff --git a/pennylane_lightning/core/src/simulators/lightning_gpu/StateVectorCudaMPI.hpp b/pennylane_lightning/core/src/simulators/lightning_gpu/StateVectorCudaMPI.hpp
index 452e2f638a..db77384d71 100644
--- a/pennylane_lightning/core/src/simulators/lightning_gpu/StateVectorCudaMPI.hpp
+++ b/pennylane_lightning/core/src/simulators/lightning_gpu/StateVectorCudaMPI.hpp
@@ -353,6 +353,7 @@ class StateVectorCudaMPI final
                                              wires.begin() + ctrl_offset};
         const std::vector<std::size_t> tgts{wires.begin() + ctrl_offset,
                                             wires.end()};
+
         if (opName == "Identity") {
             return;
         } else if (native_gates_.find(opName) != native_gates_.end()) {
@@ -1640,6 +1641,9 @@ class StateVectorCudaMPI final
             PL_CUDA_IS_SUCCESS(cudaStreamSynchronize(localStream_.get()));
             PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
         }
+        // Ensure sync for all local target wires scenarios
+        PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
+        mpi_manager_.Barrier();
     }
 
     /**
@@ -1796,6 +1800,9 @@ class StateVectorCudaMPI final
             PL_CUDA_IS_SUCCESS(cudaStreamSynchronize(localStream_.get()));
             PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
         }
+        // Ensure sync for all local target wires scenarios
+        PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
+        mpi_manager_.Barrier();
     }
 
     /**
@@ -1920,6 +1927,7 @@ class StateVectorCudaMPI final
             PL_CUDA_IS_SUCCESS(cudaStreamSynchronize(localStream_.get()));
             PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
         }
+        PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
         auto expect = mpi_manager_.allreduce<CFP_t>(local_expect, "sum");
         return expect;
     }
diff --git a/pennylane_lightning/core/src/simulators/lightning_gpu/algorithms/AdjointJacobianGPUMPI.hpp b/pennylane_lightning/core/src/simulators/lightning_gpu/algorithms/AdjointJacobianGPUMPI.hpp
index 07cd3997e0..419b2221c6 100644
--- a/pennylane_lightning/core/src/simulators/lightning_gpu/algorithms/AdjointJacobianGPUMPI.hpp
+++ b/pennylane_lightning/core/src/simulators/lightning_gpu/algorithms/AdjointJacobianGPUMPI.hpp
@@ -90,7 +90,6 @@ class AdjointJacobianMPI final
                                sv1.getDataBuffer().getDevTag().getDeviceID(),
                                sv1.getDataBuffer().getDevTag().getStreamID(),
                                sv1.getCublasCaller(), &result);
-
         auto jac_single_param =
             sv2.getMPIManager().template allreduce<CFP_t>(result, "sum");
 
@@ -235,6 +234,7 @@ class AdjointJacobianMPI final
         if (!jd.hasTrainableParams()) {
             return;
         }
+
         const OpsData<StateVectorT> &ops = jd.getOperations();
         const std::vector<std::string> &ops_name = ops.getOpsName();
 
@@ -302,6 +302,7 @@ class AdjointJacobianMPI final
                 break; // All done
             }
             mu.updateData(lambda);
+
             BaseType::applyOperationAdj(lambda, ops, op_idx);
 
             if (ops.hasParams(op_idx)) {
@@ -325,6 +326,7 @@ class AdjointJacobianMPI final
                 }
                 current_param_idx--;
             }
+
             for (size_t obs_idx = 0; obs_idx < num_observables; obs_idx++) {
                 BaseType::applyOperationAdj(*H_lambda[obs_idx], ops, op_idx);
             }
diff --git a/pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPUMPI.hpp b/pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPUMPI.hpp
index 94f5e45739..955365915a 100644
--- a/pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPUMPI.hpp
+++ b/pennylane_lightning/core/src/simulators/lightning_gpu/observables/ObservablesGPUMPI.hpp
@@ -191,6 +191,7 @@ class HamiltonianMPI final : public HamiltonianBase<StateVectorT> {
 
     // to work with
     void applyInPlace(StateVectorT &sv) const override {
+        auto mpi_manager = sv.getMPIManager();
         using CFP_t = typename StateVectorT::CFP_t;
         DataBuffer<CFP_t, int> buffer(sv.getDataBuffer().getLength(),
                                       sv.getDataBuffer().getDevTag());
@@ -209,8 +210,9 @@ class HamiltonianMPI final : public HamiltonianBase<StateVectorT> {
                 tmp.getDataBuffer().getDevTag().getStreamID(),
                 tmp.getCublasCaller());
         }
-
         sv.CopyGpuDataToGpuIn(buffer.getData(), buffer.getLength());
+        PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
+        mpi_manager.Barrier();
     }
 };
 
diff --git a/pennylane_lightning/core/src/simulators/lightning_gpu/utils/MPILinearAlg.hpp b/pennylane_lightning/core/src/simulators/lightning_gpu/utils/MPILinearAlg.hpp
index cd2afd426b..9f0f3972a6 100644
--- a/pennylane_lightning/core/src/simulators/lightning_gpu/utils/MPILinearAlg.hpp
+++ b/pennylane_lightning/core/src/simulators/lightning_gpu/utils/MPILinearAlg.hpp
@@ -113,8 +113,9 @@ inline void SparseMV_cuSparseMPI(
                                           length_local, reduce_root_rank,
                                           "sum");
         }
+        PL_CUDA_IS_SUCCESS(cudaDeviceSynchronize());
+        mpi_manager.Barrier();
     }
-    mpi_manager.Barrier();
 }
 
 } // namespace Pennylane::LightningGPU::Util
\ No newline at end of file
diff --git a/pennylane_lightning/lightning_gpu/lightning_gpu.py b/pennylane_lightning/lightning_gpu/lightning_gpu.py
index 98de0e9512..177275ec13 100644
--- a/pennylane_lightning/lightning_gpu/lightning_gpu.py
+++ b/pennylane_lightning/lightning_gpu/lightning_gpu.py
@@ -204,9 +204,17 @@ def _mebibytesToBytes(mebibytes):
     }
 
     class LightningGPU(LightningBase):  # pylint: disable=too-many-instance-attributes
-        """PennyLane-Lightning-GPU device.
+        """PennyLane Lightning GPU device.
+
+        A GPU-backed Lightning device using NVIDIA cuQuantum SDK.
+
+        Use of this device requires pre-built binaries or compilation from source. Check out the
+        :doc:`/lightning_gpu/installation` guide for more details.
+
         Args:
             wires (int): the number of wires to initialize the device with
+            mpi (bool): enable MPI support. MPI support will be enabled if ``mpi`` is set as``True``.
+            mpi_buf_size (int): size of GPU memory (in MiB) set for MPI operation and its default value is 64 MiB.
             sync (bool): immediately sync with host-sv after applying operations
             c_dtype: Datatypes for statevector representation. Must be one of ``np.complex64`` or ``np.complex128``.
             shots (int): How many times the circuit should be evaluated (or sampled) to estimate
@@ -216,7 +224,7 @@ class LightningGPU(LightningBase):  # pylint: disable=too-many-instance-attribut
             batch_obs (Union[bool, int]): determine whether to use multiple GPUs within the same node or not
         """
 
-        name = "PennyLane plugin for GPU-backed Lightning device using NVIDIA cuQuantum SDK"
+        name = "Lightning GPU PennyLane plugin"
         short_name = "lightning.gpu"
 
         operations = allowed_operations
@@ -283,6 +291,7 @@ def __init__(
             self._create_basis_state(0)
 
         def _mpi_init_helper(self, num_wires):
+            """Set up MPI checks."""
             if not MPI_SUPPORT:
                 raise ImportError("MPI related APIs are not found.")
             # initialize MPIManager and config check in the MPIManager ctor
@@ -545,6 +554,7 @@ def apply_lightning(self, operations):
 
         # pylint: disable=unused-argument
         def apply(self, operations, rotations=None, **kwargs):
+            """Applies a list of operations to the state tensor."""
             # State preparation is currently done in Python
             if operations:  # make sure operations[0] exists
                 if isinstance(operations[0], StatePrep):
@@ -635,6 +645,12 @@ def _init_process_jacobian_tape(self, tape, starting_state, use_device_state):
             return self._gpu_state
 
         def adjoint_jacobian(self, tape, starting_state=None, use_device_state=False):
+            """Implements the adjoint method outlined in
+            `Jones and Gacon <https://arxiv.org/abs/2009.02823>`__ to differentiate an input tape.
+
+            After a forward pass, the circuit is reversed by iteratively applying adjoint
+            gates to scan backwards through the circuit.
+            """
             if self.shots is not None:
                 warn(
                     "Requested adjoint differentiation to be computed with finite shots."
@@ -697,7 +713,42 @@ def adjoint_jacobian(self, tape, starting_state=None, use_device_state=False):
 
         # pylint: disable=inconsistent-return-statements, line-too-long, missing-function-docstring
         def vjp(self, measurements, grad_vec, starting_state=None, use_device_state=False):
-            """Generate the processing function required to compute the vector-Jacobian products of a tape."""
+            """Generate the processing function required to compute the vector-Jacobian products
+            of a tape.
+
+            This function can be used with multiple expectation values or a quantum state.
+            When a quantum state is given,
+
+            .. code-block:: python
+
+                vjp_f = dev.vjp([qml.state()], grad_vec)
+                vjp = vjp_f(tape)
+
+            computes :math:`w = (w_1,\\cdots,w_m)` where
+
+            .. math::
+
+                w_k = \\langle v| \\frac{\\partial}{\\partial \\theta_k} | \\psi_{\\pmb{\\theta}} \\rangle.
+
+            Here, :math:`m` is the total number of trainable parameters,
+            :math:`\\pmb{\\theta}` is the vector of trainable parameters and
+            :math:`\\psi_{\\pmb{\\theta}}` is the output quantum state.
+
+            Args:
+                measurements (list): List of measurement processes for vector-Jacobian product.
+                    Now it must be expectation values or a quantum state.
+                grad_vec (tensor_like): Gradient-output vector. Must have shape matching the output
+                    shape of the corresponding tape, i.e. number of measurements if the return
+                    type is expectation or :math:`2^N` if the return type is statevector
+                starting_state (tensor_like): post-forward pass state to start execution with.
+                    It should be complex-valued. Takes precedence over ``use_device_state``.
+                use_device_state (bool): use current device state to initialize.
+                    A forward pass of the same circuit should be the last thing the device
+                    has executed. If a ``starting_state`` is provided, that takes precedence.
+
+            Returns:
+                The processing function required to compute the vector-Jacobian products of a tape.
+            """
             if self.shots is not None:
                 warn(
                     "Requested adjoint differentiation to be computed with finite shots."
@@ -742,6 +793,7 @@ def processing_fn(tape):
 
         # pylint: disable=attribute-defined-outside-init
         def sample(self, observable, shot_range=None, bin_size=None, counts=False):
+            """Return samples of an observable."""
             if observable.name != "PauliZ":
                 self.apply_lightning(observable.diagonalizing_gates())
                 self._samples = self.generate_samples()
@@ -763,6 +815,19 @@ def generate_samples(self):
 
         # pylint: disable=protected-access, missing-function-docstring
         def expval(self, observable, shot_range=None, bin_size=None):
+            """Expectation value of the supplied observable.
+
+            Args:
+                observable: A PennyLane observable.
+                shot_range (tuple[int]): 2-tuple of integers specifying the range of samples
+                    to use. If not specified, all samples are used.
+                bin_size (int): Divides the shot range into bins of size ``bin_size``, and
+                    returns the measurement statistic separately over each bin. If not
+                    provided, the entire shot range is treated as a single bin.
+
+            Returns:
+                Expectation value of the observable
+            """
             if self.shots is not None:
                 # estimate the expectation value
                 samples = self.sample(observable, shot_range=shot_range, bin_size=bin_size)
@@ -814,6 +879,15 @@ def expval(self, observable, shot_range=None, bin_size=None):
             return self.measurements.expval(observable.name, observable_wires)
 
         def probability_lightning(self, wires=None):
+            """Return the probability of each computational basis state.
+
+            Args:
+                wires (Iterable[Number, str], Number, str, Wires): wires to return
+                    marginal probabilities for. Wires not provided are traced out of the system.
+
+            Returns:
+                array[float]: list of the probabilities
+            """
             # translate to wire labels used by device
             observable_wires = self.map_wires(wires)
             # Device returns as col-major orderings, so perform transpose on data for bit-index shuffle for now.
@@ -825,6 +899,19 @@ def probability_lightning(self, wires=None):
 
         # pylint: disable=missing-function-docstring
         def var(self, observable, shot_range=None, bin_size=None):
+            """Variance of the supplied observable.
+
+            Args:
+                observable: A PennyLane observable.
+                shot_range (tuple[int]): 2-tuple of integers specifying the range of samples
+                    to use. If not specified, all samples are used.
+                bin_size (int): Divides the shot range into bins of size ``bin_size``, and
+                    returns the measurement statistic separately over each bin. If not
+                    provided, the entire shot range is treated as a single bin.
+
+            Returns:
+                Variance of the observable
+            """
             if self.shots is not None:
                 # estimate the var
                 # Lightning doesn't support sampling yet
@@ -858,7 +945,7 @@ def var(self, observable, shot_range=None, bin_size=None):
 
     class LightningGPU(LightningBaseFallBack):  # pragma: no cover
         # pylint: disable=missing-class-docstring, too-few-public-methods
-        name = "PennyLane plugin for GPU-backed Lightning device using NVIDIA cuQuantum SDK: [No binaries found - Fallback: default.qubit]"
+        name = "Lightning GPU PennyLane plugin: [No binaries found - Fallback: default.qubit]"
         short_name = "lightning.gpu"
 
         def __init__(self, wires, *, c_dtype=np.complex128, **kwargs):
diff --git a/requirements-dev.txt b/requirements-dev.txt
index 0d86e268fa..7582deeebb 100644
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -10,4 +10,6 @@ pytest-mock
 pre-commit>=2.19.0
 black==23.7.0
 clang-format==14
-pylint
\ No newline at end of file
+cmake
+custatevec-cu11
+pylint
diff --git a/requirements.txt b/requirements.txt
index af606a496a..e9a63ac0e5 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,4 +4,4 @@ pennylane>=0.32
 pybind11
 pytest
 pytest-cov
-pytest-mock
+pytest-mock
\ No newline at end of file
diff --git a/tests/test_adjoint_jacobian.py b/tests/test_adjoint_jacobian.py
index 41a9784dc0..51e41d61db 100644
--- a/tests/test_adjoint_jacobian.py
+++ b/tests/test_adjoint_jacobian.py
@@ -1265,7 +1265,7 @@ def create_xyz_file(tmp_path_factory):
 
 
 @pytest.mark.skipif(
-    device_name != "lightning.gpu" or not ld._CPP_BINARY_AVAILABLE,
+    not ld._CPP_BINARY_AVAILABLE,
     reason="Tests only for lightning.gpu",
 )
 @pytest.mark.parametrize(