Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc] Add Multithreaded GPU Benchmarks #98964

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

jameshu15869
Copy link
Contributor

This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the SINGLE_THREADED_BENCHMARK() macro.

I chose to use a flag here so that other options could be added in the future.

@llvmbot llvmbot added the libc label Jul 15, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jul 15, 2024

@llvm/pr-subscribers-libc

Author: None (jameshu15869)

Changes

This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by default, adding the option for single threaded benchmarks. We can specify that a benchmark should be run on a single thread using the SINGLE_THREADED_BENCHMARK() macro.

I chose to use a flag here so that other options could be added in the future.


Full diff: https://github.com/llvm/llvm-project/pull/98964.diff

4 Files Affected:

  • (modified) libc/benchmarks/gpu/CMakeLists.txt (+6)
  • (modified) libc/benchmarks/gpu/LibcGpuBenchmark.cpp (+4-2)
  • (modified) libc/benchmarks/gpu/LibcGpuBenchmark.h (+11-3)
  • (modified) libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp (+2)
diff --git a/libc/benchmarks/gpu/CMakeLists.txt b/libc/benchmarks/gpu/CMakeLists.txt
index eaeecbdacd23e..8c409bc6ef3ea 100644
--- a/libc/benchmarks/gpu/CMakeLists.txt
+++ b/libc/benchmarks/gpu/CMakeLists.txt
@@ -10,6 +10,10 @@ function(add_benchmark benchmark_name)
     "LINK_LIBRARIES" # Multi-value arguments
     ${ARGN}
   )
+  # We run benchmarks for a single warp with and give the 
+  # option to run only a single thread
+  set(BENCHMARK_NUM_THREADS 32)
+
   if(NOT libc.src.time.clock IN_LIST TARGET_LLVMLIBC_ENTRYPOINTS)
     message(FATAL_ERROR "target does not support clock")
   endif()
@@ -19,6 +23,8 @@ function(add_benchmark benchmark_name)
     LINK_LIBRARIES
       LibcGpuBenchmark.hermetic
       ${BENCHMARK_LINK_LIBRARIES}
+    LOADER_ARGS
+      --threads ${BENCHMARK_NUM_THREADS}
     ${BENCHMARK_UNPARSED_ARGUMENTS}
   )
   get_fq_target_name(${benchmark_name} fq_target_name)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
index 23fff3e8180f7..2094d33e1e9e7 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.cpp
@@ -114,8 +114,10 @@ void Benchmark::run_benchmarks() {
       all_results.reset();
 
     gpu::sync_threads();
-    auto current_result = b->run();
-    all_results.update(current_result);
+    if (!(b->flags & BenchmarkFlags::SINGLE_THREADED) || id == 0) {
+      auto current_result = b->run();
+      all_results.update(current_result);
+    }
     gpu::sync_threads();
 
     if (id == 0)
diff --git a/libc/benchmarks/gpu/LibcGpuBenchmark.h b/libc/benchmarks/gpu/LibcGpuBenchmark.h
index 1f813f8655de6..53f35768e1bf1 100644
--- a/libc/benchmarks/gpu/LibcGpuBenchmark.h
+++ b/libc/benchmarks/gpu/LibcGpuBenchmark.h
@@ -74,16 +74,19 @@ struct BenchmarkResult {
   clock_t total_time = 0;
 };
 
+enum BenchmarkFlags { SINGLE_THREADED = 0x1 };
+
 BenchmarkResult benchmark(const BenchmarkOptions &options,
                           cpp::function<uint64_t(void)> wrapper_func);
 
 class Benchmark {
   const cpp::function<uint64_t(void)> func;
   const cpp::string_view name;
+  const uint8_t flags;
 
 public:
-  Benchmark(cpp::function<uint64_t(void)> func, char const *name)
-      : func(func), name(name) {
+  Benchmark(cpp::function<uint64_t(void)> func, char const *name, uint8_t flags)
+      : func(func), name(name), flags(flags) {
     add_benchmark(this);
   }
 
@@ -104,6 +107,11 @@ class Benchmark {
 
 #define BENCHMARK(SuiteName, TestName, Func)                                   \
   LIBC_NAMESPACE::benchmarks::Benchmark SuiteName##_##TestName##_Instance(     \
-      Func, #SuiteName "." #TestName)
+      Func, #SuiteName "." #TestName, 0)
+
+#define SINGLE_THREADED_BENCHMARK(SuiteName, TestName, Func)                   \
+  LIBC_NAMESPACE::benchmarks::Benchmark SuiteName##_##TestName##_Instance(     \
+      Func, #SuiteName "." #TestName,                                          \
+      LIBC_NAMESPACE::benchmarks::BenchmarkFlags::SINGLE_THREADED)
 
 #endif
diff --git a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
index 6f8d247902f76..d9c1a804ec506 100644
--- a/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
+++ b/libc/benchmarks/gpu/src/ctype/isalnum_benchmark.cpp
@@ -7,6 +7,8 @@ uint64_t BM_IsAlnum() {
   return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);
 }
 BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum);
+SINGLE_THREADED_BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumSingleThread,
+                          BM_IsAlnum);
 
 uint64_t BM_IsAlnumCapital() {
   char x = 'A';

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel like it's easy enough to just set this one each one. It's also difficult because the warp size varies on the hardware (and compilation settings) for AMDGPU. Having a helper for a single threaded run is probably fine.

@jameshu15869
Copy link
Contributor Author

Ah, do you mean have the macro say how many threads to use? e.g. BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum, 32);?

@jhuber6
Copy link
Contributor

jhuber6 commented Jul 15, 2024

Ah, do you mean have the macro say how many threads to use? e.g. BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnum, BM_IsAlnum, 32);?

Nah I just mean each time we register a benchmark it should just say how many it wants. One thread is a reasonable default since that's what the loader defaults to.

@jameshu15869
Copy link
Contributor Author

How many threads should we launch the loader with? I mean like if we always run benchmarks with --threads 32, what should happen if the user requests 64 threads for a benchmark?

@jhuber6
Copy link
Contributor

jhuber6 commented Jul 15, 2024

How many threads should we launch the loader with? I mean like if we always run benchmarks with --threads 32, what should happen if the user requests 64 threads for a benchmark?

It should launch with whatever the user requested when they wrote the add_libc_benchmark and default to no arguments (1 thread 1 block for the loader), same way the integration tests work.

@jameshu15869 jameshu15869 changed the title [libc] Run Benchmarks on 32 Threads by Default [libc] Add Multithreaded GPU Benchmarks Jul 16, 2024
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably easier to just forward the loader args through the unparsed arguments, but either way works.

@jameshu15869
Copy link
Contributor Author

Do you mean you would prefer something that looks more like

add_benchmark(
  isalpha_benchmark
  SUITE
    libc-gpu-ctype-benchmarks
  SRCS
    isalpha_benchmark.cpp
  DEPENDS
    libc.src.ctype.isalpha
  LOADER_ARGS
    --threads 32
)

I was thinking explicitly having NUM_THREADS would be like a shorthand and might be a little more clear than using LOADER_ARGS, but I think I can see what you mean

libc/benchmarks/gpu/CMakeLists.txt Outdated Show resolved Hide resolved
libc/benchmarks/gpu/CMakeLists.txt Outdated Show resolved Hide resolved
libc/benchmarks/gpu/LibcGpuBenchmark.h Show resolved Hide resolved
@jhuber6 jhuber6 merged commit 8badfcc into llvm:main Jul 18, 2024
6 checks passed
Harini0924 pushed a commit to Harini0924/llvm-project that referenced this pull request Jul 22, 2024
This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by
default, adding the option for single threaded benchmarks. We can
specify that a benchmark should be run on a single thread using the
`SINGLE_THREADED_BENCHMARK()` macro.

I chose to use a flag here so that other options could be added in the
future.
sgundapa pushed a commit to sgundapa/upstream_effort that referenced this pull request Jul 23, 2024
This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by
default, adding the option for single threaded benchmarks. We can
specify that a benchmark should be run on a single thread using the
`SINGLE_THREADED_BENCHMARK()` macro.

I chose to use a flag here so that other options could be added in the
future.
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
This PR runs benchmarks on a 32 threads (A single warp on NVPTX) by
default, adding the option for single threaded benchmarks. We can
specify that a benchmark should be run on a single thread using the
`SINGLE_THREADED_BENCHMARK()` macro.

I chose to use a flag here so that other options could be added in the
future.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants