[libc] Add Multithreaded GPU Benchmarks #98964

jameshu15869 · 2024-07-15T21:19:25Z

This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the SINGLE_THREADED_BENCHMARK() macro.

I chose to use a flag here so that other options could be added in the future.

…aded benchmarks

llvmbot · 2024-07-15T21:19:57Z

@llvm/pr-subscribers-libc

Author: None (jameshu15869)

Changes

This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the SINGLE_THREADED_BENCHMARK() macro.

I chose to use a flag here so that other options could be added in the future.

Full diff: https://github.com/llvm/llvm-project/pull/98964.diff

4 Files Affected:

(modified) libc/benchmarks/gpu/CMakeLists.txt (+6)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.cpp (+4-2)
(modified) libc/benchmarks/gpu/LibcGpuBenchmark.h (+11-3)
(modified) libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp (+2)

diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index eaeecbdacd23e..8c409bc6ef3ea 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -10,6 +10,10 @@ function(add_benchmark benchmark_name)
     "LINK_LIBRARIES" # Multi-value arguments
     ${ARGN}
   )
+  # We run benchmarks for a single warp with and give the 
+  # option to run only a single thread
+  set(BENCHMARK_NUM_THREADS 32)
+
   if(NOT libc.src.time.clock IN_LIST TARGET_LLVMLIBC_ENTRYPOINTS)
     message(FATAL_ERROR "target does not support clock")
   endif()
@@ -19,6 +23,8 @@ function(add_benchmark benchmark_name)
     LINK_LIBRARIES
       LibcGpuBenchmark.hermetic
       ${BENCHMARK_LINK_LIBRARIES}
+    LOADER_ARGS
+      --threads ${BENCHMARK_NUM_THREADS}
     ${BENCHMARK_UNPARSED_ARGUMENTS}
   )
   get_fq_target_name(${benchmark_name} fq_target_name)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 23fff3e8180f7..2094d33e1e9e7 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -114,8 +114,10 @@ void Benchmark::run_benchmarks() {
       all_results.reset();
 
     gpu::sync_threads();
-    auto current_result = b->run();
-    all_results.update(current_result);
+    if (!(b->flags & BenchmarkFlags::SINGLE_THREADED) || id == 0) {
+      auto current_result = b->run();
+      all_results.update(current_result);
+    }
     gpu::sync_threads();
 
     if (id == 0)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index 1f813f8655de6..53f35768e1bf1 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -74,16 +74,19 @@ struct BenchmarkResult {
   clock_t total_time = 0;
 };
 
+enum BenchmarkFlags { SINGLE_THREADED = 0x1 };
+
 BenchmarkResult benchmark(const BenchmarkOptions &options,
                           cpp::function<uint64_t(void)> wrapper_func);
 
 class Benchmark {
   const cpp::function<uint64_t(void)> func;
   const cpp::string_view name;
+  const uint8_t flags;
 
 public:
-  Benchmark(cpp::function<uint64_t(void)> func, char const *name)
-      : func(func), name(name) {
+  Benchmark(cpp::function<uint64_t(void)> func, char const *name, uint8_t flags)
+      : func(func), name(name), flags(flags) {
     add_benchmark(this);
   }
 
@@ -104,6 +107,11 @@ class Benchmark {
 
 #define BENCHMARK(SuiteName, TestName, Func)                                   \
   LIBC_NAMESPACE::benchmarks::Benchmark SuiteName##_##TestName##_Instance(     \
-      Func, #SuiteName "." #TestName)
+      Func, #SuiteName "." #TestName, 0)
+
+#define SINGLE_THREADED_BENCHMARK(SuiteName, TestName, Func)                   \
+  LIBC_NAMESPACE::benchmarks::Benchmark SuiteName##_##TestName##_Instance(     \
+      Func, #SuiteName "." #TestName,                                          \
+      LIBC_NAMESPACE::benchmarks::BenchmarkFlags::SINGLE_THREADED)
 
 #endif
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index 6f8d247902f76..d9c1a804ec506 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -7,6 +7,8 @@ uint64_t BM_IsAlnum() {
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
 BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum);
+SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
+                          BM_IsAlnum);
 
 uint64_t BM_IsAlnumCapital() {
   char x = 'A';

jhuber6

Hmm, I feel like it's easy enough to just set this one each one. It's also difficult because the warp size varies on the hardware (and compilation settings) for AMDGPU. Having a helper for a single threaded run is probably fine.

jameshu15869 · 2024-07-15T21:42:43Z

Ah, do you mean have the macro say how many threads to use? e.g. BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum, 32);?

jhuber6 · 2024-07-15T21:43:38Z

Ah, do you mean have the macro say how many threads to use? e.g. BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum, 32);?

Nah I just mean each time we register a benchmark it should just say how many it wants. One thread is a reasonable default since that's what the loader defaults to.

jameshu15869 · 2024-07-15T21:49:54Z

How many threads should we launch the loader with? I mean like if we always run benchmarks with --threads 32, what should happen if the user requests 64 threads for a benchmark?

jhuber6 · 2024-07-15T21:51:04Z

How many threads should we launch the loader with? I mean like if we always run benchmarks with --threads 32, what should happen if the user requests 64 threads for a benchmark?

It should launch with whatever the user requested when they wrote the add_libc_benchmark and default to no arguments (1 thread 1 block for the loader), same way the integration tests work.

jhuber6

It's probably easier to just forward the loader args through the unparsed arguments, but either way works.

jameshu15869 · 2024-07-17T04:09:09Z

Do you mean you would prefer something that looks more like

add_benchmark(
  isalpha_benchmark
  SUITE
    libc-gpu-ctype-benchmarks
  SRCS
    isalpha_benchmark.cpp
  DEPENDS
    libc.src.ctype.isalpha
  LOADER_ARGS
    --threads 32
)

I was thinking explicitly having NUM_THREADS would be like a shorthand and might be a little more clear than using LOADER_ARGS, but I think I can see what you mean

libc/benchmarks/gpu/CMakeLists.txt

libc/benchmarks/gpu/LibcGpuBenchmark.h

libc/benchmarks/gpu/CMakeLists.txt

This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the `SINGLE_THREADED_BENCHMARK()` macro. I chose to use a flag here so that other options could be added in the future.

Summary: This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the `SINGLE_THREADED_BENCHMARK()` macro. I chose to use a flag here so that other options could be added in the future. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250873

run benchmarks on warps by default, adding the option for single thre…

25bcbd1

…aded benchmarks

llvmbot added the libc label Jul 15, 2024

jhuber6 reviewed Jul 15, 2024

View reviewed changes

specify threads when registering benchmarks in cmake

b96f564

jameshu15869 changed the title ~~[libc] Run Benchmarks on 32 Threads by Default~~ [libc] Add Multithreaded GPU Benchmarks Jul 16, 2024

jhuber6 approved these changes Jul 16, 2024

View reviewed changes

correctly handle default arg for num threads

c49436c

jhuber6 reviewed Jul 17, 2024

View reviewed changes

libc/benchmarks/gpu/CMakeLists.txt Outdated Show resolved Hide resolved

libc/benchmarks/gpu/CMakeLists.txt Outdated Show resolved Hide resolved

libc/benchmarks/gpu/LibcGpuBenchmark.h Show resolved Hide resolved

jhuber6 reviewed Jul 17, 2024

View reviewed changes

libc/benchmarks/gpu/CMakeLists.txt Outdated Show resolved Hide resolved

make threads a loader arg and add single wave helper

6a2bc11

jhuber6 approved these changes Jul 17, 2024

View reviewed changes

jhuber6 merged commit 8badfcc into llvm:main Jul 18, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc] Add Multithreaded GPU Benchmarks #98964

[libc] Add Multithreaded GPU Benchmarks #98964

jameshu15869 commented Jul 15, 2024

llvmbot commented Jul 15, 2024

jhuber6 left a comment

jameshu15869 commented Jul 15, 2024

jhuber6 commented Jul 15, 2024

jameshu15869 commented Jul 15, 2024

jhuber6 commented Jul 15, 2024 •

edited

Loading

jhuber6 left a comment

jameshu15869 commented Jul 17, 2024

[libc] Add Multithreaded GPU Benchmarks #98964

[libc] Add Multithreaded GPU Benchmarks #98964

Conversation

jameshu15869 commented Jul 15, 2024

llvmbot commented Jul 15, 2024

jhuber6 left a comment

Choose a reason for hiding this comment

jameshu15869 commented Jul 15, 2024

jhuber6 commented Jul 15, 2024

jameshu15869 commented Jul 15, 2024

jhuber6 commented Jul 15, 2024 • edited Loading

jhuber6 left a comment

Choose a reason for hiding this comment

jameshu15869 commented Jul 17, 2024

jhuber6 commented Jul 15, 2024 •

edited

Loading