Convert DG-RePlAce algorithm to Kokkos #5352

kamilrakoczy · 2024-07-08T09:41:44Z

This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.

Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.

Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using ariane133 design.

	original placer	CUDA implementation	Kokkos (CUDA backend)	Kokkos (OpenMP backend)	Kokkos (Threads backend)
ariane133 global place time	11:27.39	0:57.70	1:33.49	3:24.12	6:08.94

Signed-off-by: ZhiangWang033 <[email protected]>

Co-authored-by: Kamil Rakoczy <[email protected]> Signed-off-by: Krzysztof Bieganski <[email protected]> Signed-off-by: Kamil Rakoczy <[email protected]>

github-actions

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 52. Check the log or trigger a new build to see more.

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/MakeDgReplace.cpp

+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+///////////////////////////////////////////////////////////////////////////////
+
+#include "gpl2/MakeDgReplace.h"


warning: 'gpl2/MakeDgReplace.h' file not found [clang-diagnostic-error]

#include "gpl2/MakeDgReplace.h" ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.cpp

+//
+///////////////////////////////////////////////////////////////////////////////
+
+#include <Kokkos_Core.hpp>


warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp> ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+//
+//
+///////////////////////////////////////////////////////////////////////////////
+#include <Kokkos_Core.hpp>


warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp> ^

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+///////////////////////////////////////////////////////////////////////////////
+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,


warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

void dct_2d_fft(const int M,

void dct_2d_fft(int M,

github-actions · 2024-07-08T09:49:01Z

src/gpl2/src/dct.h

+#include <Kokkos_Core.hpp>
+
+void dct_2d_fft(const int M,
+                const int N,


warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change

const int N,

int N,

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.cpp

+    binCntY_ = 512;
+  }
+
+  binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);


warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.cpp:40:

- #include <cstdio> + #include <cmath> + #include <cstdio>

Suggested change

binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.cpp

+  }
+
+  binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);
+  binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);


warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

Suggested change

binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.h

+#include <string>
+#include <vector>
+
+#include "db_sta/dbNetwork.hh"


warning: 'db_sta/dbNetwork.hh' file not found [clang-diagnostic-error]

#include "db_sta/dbNetwork.hh" ^

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerBase.h

+  int64_t nesterovInstsArea() const
+  {
+    return stdInstsArea_
+           + static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));


warning: call to 'round' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.h:38:

- #include <memory> + #include <cmath> + #include <memory>

Suggested change

+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));

+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_));

github-actions · 2024-07-08T09:49:03Z

src/gpl2/src/placerObjects.cpp

+///////////////////////////////////////////////////////////////
+// Instance
+Instance::Instance()
+    : inst_(nullptr),


warning: member initializer for 'inst_' is redundant [modernize-use-default-member-init]

Suggested change

: inst_(nullptr),

: ,

maliberty · 2024-07-08T15:30:55Z

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

kamilrakoczy · 2024-07-09T09:03:40Z

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

Earlier measurements were done when some parts was still using native CUDA and using different design (black-parrot).
This measurements are single run on local machine while using it for other things too, so they are not very accurate.

I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further.

maliberty · 2024-07-09T17:30:25Z

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

Do all the various versions produce the same result? That is also important.

maliberty · 2024-07-09T17:33:28Z

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

QuantamHD · 2024-07-09T18:31:57Z

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved.

We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.

Do all the various versions produce the same result? That is also important.

Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results.

maliberty · 2024-07-09T20:01:25Z

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion.

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

A 50% overhead is worth exploring to at least understand if not eliminate.

QuantamHD · 2024-07-09T21:50:06Z

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.

A 50% overhead is worth exploring to at least understand if not eliminate.

+1 I just want to point out if this is the fastest we could go that seems fast enough for me.

kamilrakoczy · 2024-07-18T13:56:48Z

Do all the various versions produce the same result? That is also important.

No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result.
We investigated this and it turned out that it is because Kokkos passes all files that depends on it through nvcc_wrapper. This wrapper converts host compiler options (g++) to nvcc options and uses nvcc to compile all Kokkos-dependent sources. This is done to allow device code in single .cpp file instead of separate .cu file for it.

NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler.

We checked that when nvcc is used to compile InitialPlace, Eigen solveWithGuess returns different results with exactly the same inputs comparing to using g++ directly.

I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.

A 50% overhead is worth exploring to at least understand if not eliminate.

I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:

	CUDA implementation	Kokkos (CUDA backend)
ariane133 global place time without initial placement	0:55.52	0:58.25

I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using ariane133 design:

	min time [min]	avg time [min]	med time [min]	max time [min]
CUDA implementation	0:45	0:48	0:47	0:53
Kokkos (CUDA backend)	1:53	1:57	1:57	2:00
Kokkos (OpenMP backend)	1:50	2:04	1:54	2:37
Kokkos (threads backend)	3:42	3:43	3:43	3:45

maliberty · 2024-07-18T15:10:28Z

Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is nvcc calling g++ with different flags?

kamilrakoczy · 2024-07-19T07:33:04Z

Is nvcc calling g++ with different flags?

Arguments that are passed to nvcc and that nvcc should pass to g++ are the same.
I haven't investigated yet how (with what flags) g++ is invoked from nvcc.

maliberty · 2024-07-19T15:38:07Z

another possibility is that it is invoking a different g++ binary from another path

maliberty · 2024-10-14T04:42:11Z

Converted to a draft due to no progress.

ZhiangWang033 and others added 2 commits July 8, 2024 11:30

DGRePlAce

01e8318

Signed-off-by: ZhiangWang033 <[email protected]>

Replace CUDA with Kokkos library

bafe535

Co-authored-by: Kamil Rakoczy <[email protected]> Signed-off-by: Krzysztof Bieganski <[email protected]> Signed-off-by: Kamil Rakoczy <[email protected]>

github-actions bot reviewed Jul 8, 2024

View reviewed changes

maliberty marked this pull request as draft October 14, 2024 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert DG-RePlAce algorithm to Kokkos #5352

Convert DG-RePlAce algorithm to Kokkos #5352

kamilrakoczy commented Jul 8, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

github-actions bot Jul 8, 2024

maliberty commented Jul 8, 2024

kamilrakoczy commented Jul 9, 2024

maliberty commented Jul 9, 2024

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 •

edited

Loading

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 •

edited

Loading

kamilrakoczy commented Jul 18, 2024

maliberty commented Jul 18, 2024

kamilrakoczy commented Jul 19, 2024

maliberty commented Jul 19, 2024

maliberty commented Oct 14, 2024

	binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);
	binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

	binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);
	binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

	+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));
	+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_));

Convert DG-RePlAce algorithm to Kokkos #5352

Are you sure you want to change the base?

Convert DG-RePlAce algorithm to Kokkos #5352

Conversation

kamilrakoczy commented Jul 8, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

github-actions bot Jul 8, 2024

Choose a reason for hiding this comment

maliberty commented Jul 8, 2024

kamilrakoczy commented Jul 9, 2024

maliberty commented Jul 9, 2024

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 • edited Loading

maliberty commented Jul 9, 2024

QuantamHD commented Jul 9, 2024 • edited Loading

kamilrakoczy commented Jul 18, 2024

maliberty commented Jul 18, 2024

kamilrakoczy commented Jul 19, 2024

maliberty commented Jul 19, 2024

maliberty commented Oct 14, 2024

kamilrakoczy commented Jul 8, 2024 •

edited

Loading

QuantamHD commented Jul 9, 2024 •

edited

Loading

QuantamHD commented Jul 9, 2024 •

edited

Loading