Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiled_cholesky w/o openmp #27

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

hcq9102
Copy link
Contributor

@hcq9102 hcq9102 commented Dec 1, 2023

tiled_cholesky implementation.

BUILD NOTE: build openblas library as needed. build OpenBlas script

include openblas library when build or $ export OPENBLAS_DIR=/openblas/path

TODO:

  • add args : matrix_size, num_tiles, int verifycorrectness = 1; bool layRow = true;

Copy link
Collaborator

@weilewei weilewei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, I put some comments around coding styles.

apps/choleskey/CMakeLists.txt Outdated Show resolved Hide resolved
apps/choleskey/CMakeLists.txt Show resolved Hide resolved
apps/choleskey/help.hpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/help.hpp Outdated Show resolved Hide resolved
apps/choleskey/help.hpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions:

  1. Is this code taken/modified from lapack or blas examples or written from scratch. I see a few C style code and _WIN32 relics that seem to be coming from somewhere else?

  2. Maybe I am not in the loop but I thought we were implementing a stdexec/omp version of the cholesky? From the last meeting, I remember you had a design ready for at least 3x3 matrices that you were implementing?

  3. I don't get the point of using lapack or blas to build cholesky as those libraries already contain methods to do that directly: https://nalgebra.org/docs/user_guide/decompositions_and_lapack/

  4. Does this code run in serial or parallel fashion? If serial, what's the advantage of doing this over the serial version we already have?

  5. Generally, I am not in favor of adding so many external dependencies (lapack/blas) in this repository unless they are absolutely needed for functionality of our code.

Minor:

  1. I think this PR has a conflict with main branch that will need to be resolved before merging.

  2. I agree with @weilewei comments and want to emphasize that there needs to be documentation of what this code is doing, why and how is this different from the serial version?

apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/help.hpp Outdated Show resolved Hide resolved
@hcq9102
Copy link
Contributor Author

hcq9102 commented Dec 2, 2023

Hi @mhaseeb123 Thanks for asking:

  1. Is this code taken/modified from lapack or blas examples or written from scratch. I see a few C style code and _WIN32 relics that seem to be coming from somewhere else?

This code modified a lot mainly from Intel open source project and other online sources.

  1. Maybe I am not in the loop but I thought we were implementing a stdexec/omp version of the cholesky? From the last meeting, I remember you had a design ready for at least 3x3 matrices that you were implementing?

Yes, we were trying to implement a stdexec/omp version of the cholesky.
And if you remember the paper "Task-Based Cholesky Decomposition on Knights Corner using OpenMP" and it talks about the OpenMP Task-Based Cholesky Decomposition. and other papers about the tiled_cholesky decomposition.

So according to its pseudo code in the paper, first, we have to have the tiled_cholesky decomposition algrithm for constructing task-based graph(for openmp).

Once we have this tiled_cholesky decomposition algorithm works, then we can implementing 'omp' version by adding openmp syntax.

And last time meeting, I had a design ready for at least 3x3 matrices works for this tiled_cholesky decomposition, i was working on verifying the correctness and generalize the code as much as possible.

  1. I don't get the point of using lapack or blas to build cholesky as those libraries already contain methods to do that directly: https://nalgebra.org/docs/user_guide/decompositions_and_lapack/

As I mention in Question2, this implementation is for tiled_cholesky using functions in lapack or blas.
Yes, lapack or blas provide the LAPACKE_dpotrf() to do this directly but not in sub_blocks way.

As we have seen, the nvdia's cholesky implementation using their advanced library #include <experimental/linalg>, this is also a block cholesky algorithm which with fix 4 sub_blocks.

I cannot find a way to do (task_based )'omp' if we only have previous implemented code.

  1. Does this code run in serial or parallel fashion? If serial, what's the advantage of doing this over the serial version we already have?

currently, I am running the code with gpu.
I didn't test the performance between this and serial.

  1. Generally, I am not in favor of adding so many external dependencies (lapack/blas) in this repository unless they are absolutely needed for functionality of our code.

Totally understand. I also worry about this external dependencies issue at the very beginning.
but I didnt find a good way to support this kind of implementation after two weeks literature reviews on this.
This algorithm has heavy data dependencies, I didnt see much innovation methods to implement.
and cannot write these support functions(LAPACKE_dpotrf, cblas_dtrsm, cblas_dsyrk, cblas_dgemm) easily from scratch.

Of course, we can discuss wether we need to continue 'omp' implementation based on this PR and using external dependencies (lapack/blas) as well or not. I appreciate any suggestions on task-based implementation for this algorithm

@mhaseeb123
Copy link
Collaborator

Hi @hcq9102, thank you for taking the time to respond to questions.

  1. That's fine but please include original licenses in files as well if required for transparency and legal reasons.

  2. I would recommend also using #include <experimental/linalg/..> as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.

  3. As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and #include <experimental/linalg> first. Doing so will allow you to work with stdexec and use CPUs/GPUs similar to other apps without using external dependencies.

  4. If possible, I would like to see some performance and correctness results before merging this into the main branch

@hcq9102
Copy link
Contributor Author

hcq9102 commented Dec 5, 2023

Hi @mhaseeb123
Thanks for your comments!

  1. I would recommend also using #include <experimental/linalg/..> as it is natively available in our NVC++ compiler instead of adding external lapack and blas libraries.

Weile and I have discussed the possibility of using <experimental/linalg/..> in task-graph imp one month ago, but there is one function missing in <experimental/linalg/..>, so have to turn to lapack. And its not that good if mixed <experimental/linalg/..> and lapack.
But I will check the possibility of using #include <experimental/linalg/..> again.

  1. As you have a task-based parallel design ready (at least for 3x3 matrix), I would strong recommend implementing it with stdexec and #include <experimental/linalg> first. Doing so will allow you to work with stdexec and use CPUs/GPUs similar to other apps without using external dependencies.

I will try to do senders/receivers if possible, but there is external lib involved issue.

  1. If possible, I would like to see some performance and correctness results before merging this into the main branch.

I have verified the correctness with lapack function which calculate cholesky decomposition directly.
Yes, Will have Performance test.

reference the implementation from Intel open source project, hetero-streams which will no longer be maintained by Intel.
https://github.com/intel/hetero-streams/tree/master/ref_code/cholesky

4. Additionally, include openblas library when build:
Copy link
Collaborator

@mhaseeb123 mhaseeb123 Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to install/setup OpenBLAS with CPM during CMake? Personally I don't want to add these extra manual steps of installing and export OPENBLAS_DIR=.. as they would make general usage as well as CI/CD really tedious. @weilewei

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhaseeb123 agreed not introducing extra flags. I just tried with CPMadd, and building openblas takes a while for the first-time building it.
@hcq9102 try the following in the CMakeLists files. Additionally, I request to add a compilation flag something like USE_OPENBLAS to cmake build to separate out if we should use cpm to add openblas.

in main CmakeLists.txt

if (USE_OPENBLAS)
cpmaddpackage(NAME openblas GITHUB_REPOSITORY OpenMathLib/OpenBLAS GIT_TAG
              v0.3.25)
endif()

in your application CMakeLists.txt:

target_link_libraries(myapp openblas)

@mhaseeb123
Copy link
Collaborator

General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example: matrix[row * n + col] = syntax. Would it make more sense to use mdspans for better syntax and readability?

@weilewei
Copy link
Collaborator

weilewei commented Dec 5, 2023

General question: I noticed that we are accessing matrices via C-style indexing at several locations. For example: matrix[row * n + col] = syntax. Would it make more sense to use mdspans for better syntax and readability?

I believe the openblas library is C-style and cannot figure out mdspan.

apps/choleskey/cholesky_tiled.cpp Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Outdated Show resolved Hide resolved
apps/choleskey/cholesky_tiled.cpp Show resolved Hide resolved
apps/choleskey/matrixutil.hpp Outdated Show resolved Hide resolved
apps/choleskey/matrixutil.hpp Outdated Show resolved Hide resolved
apps/choleskey/matrixutil.hpp Outdated Show resolved Hide resolved
apps/choleskey/matrixutil.hpp Outdated Show resolved Hide resolved
apps/choleskey/matrixutil.hpp Outdated Show resolved Hide resolved
@weilewei weilewei mentioned this pull request Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants