Support `simpleomp` on MSVC #5691

Jiang-Weibo · 2024-09-10T11:25:29Z

Jiang-Weibo
Sep 10, 2024

Support `simpleomp` on MSVC

The content of this article pertains to NCNN version 1.2.0.

The goal of this article for this open-source project is to implement a simple OpenMP runtime feature for the MSVC compiler in the simpleomp.cpp file of NCNN. NCNN has already implemented a simple OpenMP runtime feature for the Clang and GCC compilers, and nihui, the author of NCNN and also the mentor of this project, has already written most of the content (see platform-independent components) and provided implementation ideas, so the implementation process is relatively smooth.

Brief Introduction to OpenMP Implementation

The OpenMP is a standard that defines a series of multi-threading operations based on shared memory, and these standards specify a set of directives.

#pragma omp parallel for
for (int i = 0; i < 100; ++i) {
    //...
}

#pragma omp parallel sections
{
    #pragma omp section
    {
        // ...
    }

    #pragma omp section
    {
        // ...
    }
}

#pragma omp single
// ...

These directives can produce various effects, such as ensuring that a specific block of code is executed only once in a multi-threaded environment, enabling threads executing a code block to perform reductions, or distributing iterations of a for loop across different threads for parallel execution, and so on.

Among these, NCNN only need to support one directive, which also happens to be the most commonly used one. That directive is #pragma omp parallel for num_threads(X), where X is an integer constant representing the number of threads. The purpose of this directive is to parallelize the code within the for loop that follows, assigning it to different threads for execution.

#pragma omp parallel for num_threads(4)
for (int i = 0; i < 40; ++i) {
    printf("Hello, OpenMP and NCNN!\n");
}

For example, we created 4 threads, and by default, each thread is assigned the same number of iterations. In this example, there are 40 iterations, so each thread gets 40/4 = 10 iterations. However, OpenMP also specifies other ways to allocate iteration tasks, known as thread scheduling. Common thread scheduling methods include static scheduling, where tasks are evenly distributed across threads, dynamic scheduling (using a thread pool and a task pool), and guided scheduling (which, simply put, starts by assigning a lot of tasks and then gradually reduces the number of tasks), among others.

In NCNN, we only implement static scheduling and dynamic scheduling. More specifically, we implement dynamic scheduling for GNU, and static scheduling for Clang and MSVC. Note: Although we use a thread pool and task pool in Clang and MSVC, since the number of tasks each thread receives is determined from the beginning, it is still considered static scheduling according to the OpenMP standard.

Why did we want to do this kind of job? Generally speaking, common compilers for C/C++/Fortran (naturally including the "big three" GNU/Clang/MSVC) have built-in support for most of the OpenMP standard, so we don’t need to implement OpenMP ourselves. However, in some cases, NCNN might not be able to use the compiler’s internal OpenMP support, or it might not want to link the OpenMP library (as using OpenMP requires linking certain libraries). Therefore, we need to implement a small portion of OpenMP functionality ourselves, specifically #pragma omp parallel for num_threads(X).

How does this work? When the compiler encounters directives like #pragma omp parallel for num_threads(X), it translates these directives into individual function calls. These function calls are then handled internally by the compiler to implement multi-threading operations that conform to the OpenMP standard. In fact, this "translation of directives into individual functions" happens during the syntax analysis phase of compilation, which we don't need to worry about. Our focus is on the second step, which is the "internal function calls" made by the compiler.

This means that instead of directly handling the parsing of OpenMP directives, we only need to concern ourselves with implementing or managing the function calls that handle the threading behavior, ensuring they align with what OpenMP requires. By doing so, we can manually manage the thread parallelization without relying on OpenMP’s runtime libraries.

First, the signatures, calling order, and timing of these functions are all determined by the compiler. For example, to implement a certain directive, the compiler might first call a function with the signature A(int, int), then call another function with the signature B(double, double). Based on the result of B(), the compiler creates X threads, and these threads then execute a function with the signature C(type1 P, type2 Q) in parallel. Finally, the main thread calls a function with the signature D(int, int) to finish the process.

Second, since these functions need to be cross-platform, they must conform to different platform-specific ABIs (in C/C++, this involves using certain macro definitions, and the compiler will handle these macros once added).

So, our task includes two parts:

Identifying the compiler-generated functions and their calling order. This is straightforward—just look at the OpenMP implementation code in Clang/GNU/MSVC to find the function definitions. Figuring out the calling order is a bit trickier because there’s no official documentation for this. We’ll need to guess the timing of the function calls based on their names and our understanding of the OpenMP standard, then perform experiments to validate our guesses.
Adding the appropriate macros at the right places. This is quite simple. Since Clang/GNU/MSVC already need to handle cross-platform issues, you just need to add the corresponding macros that these functions use, and the compiler will handle the rest.

By focusing on these tasks, we can essentially recreate the functionality of #pragma omp parallel for num_threads(X) without relying entirely on the internal OpenMP library, while still ensuring cross-platform compatibility.

Our task is to support NCNN with the MSVC compiler to implement a simple OpenMP runtime. Specifically, after enabling OpenMP compilation and using cmake -DNCNN_SIMPLEOMP=ON, there should be no need to link vcompXXX.lib, and deployment should not require vcompXXX.dll, while still achieving effective multithreading acceleration. vcompXXX is the library MSVC uses to implement OpenMP, so based on the previous example, our job is to replace the logic of the library's functions with our own. (For those familiar with game development in Unity, this is similar to writing custom logic for update() and awake() functions). Once again, NCNN only needs to implement #pragma omp parallel for num_threads(X).

Platform-independent Components

The various classes mentioned here mainly implement a thread pool and a task queue. If you are already familiar with these concepts, feel free to skip to the next section.

KMPTask Class

Purpose

Represents a single task that can be executed by a thread.

Attributes

fn: A function pointer or callable object representing the task to be executed. The implementation varies based on the compiler (clang vs. others like GCC).
data/argv: Arguments or data needed by the task function.
num_threads: The total number of threads involved in executing this task.
thread_num: The specific thread number executing this task.
num_threads_to_wait: Pointer to an integer that tracks how many threads need to finish before proceeding.
finish_lock and finish_condition: Synchronization primitives used to coordinate task completion.

KMPTaskQueueClass

Purpose

Manages a queue of KMPTask objects, providing a mechanism for threads to dispatch and retrieve tasks. It uses a circular buffer (ring buffer) to manage the tasks.

Methods

dispatch: Adds multiple tasks to the queue, ensuring that the queue does not exceed its maximum size. It locks the queue during this operation.
put: Adds a single task to the queue, waiting if the queue is full. It also handles signaling other threads when tasks are available.
get: Retrieves a task from the queue for a thread to execute, waiting if the queue is empty.

Attributes

lock: A mutex used to ensure exclusive access to the queue.
condition: A condition variable used for signaling between threads.
max_size, tasks, size, front, back: Variables that manage the circular buffer of tasks.

KMPGlobal Class

Purpose

Manages the global state and initialization of the thread pool and task queue. It is responsible for setting up and tearing down the threading infrastructure.

Methods

try_init(): Ensures that the global thread pool and task queue are initialized only once (using pthread_once).
init(): Initializes the thread pool based on the number of CPU cores available, and creates a KMPTaskQueue with a size proportional to the number of threads.
deinit(): Cleans up resources, ensures all tasks are completed, and joins any running threads before deleting them.

Attributes

kmp_max_threads: The maximum number of threads to be used.
kmp_threads: Array of thread objects.
kmp_threads_tid: Array of thread IDs.
kmp_task_queue: Pointer to the global task queue.

Key Concepts

Thread Management

The code manually manages threads and tasks, implementing a thread pool-like system where tasks can be distributed among multiple worker threads.

Synchronization

The use of mutexes and condition variables ensures that access to shared resources (like the task queue) is thread-safe, and threads can be coordinated effectively.

Platform-Specific Code

The code includes preprocessor directives to handle differences between compilers (clang vs. others), indicating it’s designed to be portable across different environments.

How It Works

Initialization: The KMPGlobal class initializes a set of worker threads and a task queue. The number of threads is based on the available CPU cores.
Task Dispatching: Tasks are created and dispatched to the task queue, where worker threads can retrieve and execute them.
Task Execution: Each worker thread runs in a loop, fetching tasks from the queue and executing them until there are no more tasks left or the program is shutting down.
Cleanup: The deinit method ensures that all threads are joined and resources are freed when the application is done with the threading system.

Other Features

Thread Pool and Task Queue: As mentioned earlier, the cost of creating and destroying threads each time is too high, so NCNN uses a thread pool. Each task essentially consists of a function pointer and some data. Whenever a new task is generated, we enqueue it (in the KMPTaskQueueClass), while threads that cannot obtain a task will block.
A large task (parallel for) is automatically divided into num_threads subtasks. The subtasks are independent and will not be duplicated. The automatic task division is handled by the compiler, but ensuring that the subtasks remain independent and avoiding race conditions is something you need to handle when writing your own code.
ThreadLocalStorage(TLS): Each thread is provided with local storage (TLS), a feature supported by the operating system, with corresponding functions available on different platforms. NCNN calls different TLS API implementations depending on the platform. On Windows, the relevant APIs are TlsAlloc(), TlsSetValue(), and TlsGetValue(), among others. Note: TLS stores information in key-value pairs. Different threads can store information using the same key, and the operating system will allocate separate space based on which thread is storing the data. Therefore, when different threads use the same key to read information, they won't read each other's data.

static ncnn::ThreadLocalStorage tls_num_threads;
static ncnn::ThreadLocalStorage tls_thread_num;

NCNN wraps some essential multithreading components, including locks, condition variables, and others. These components are backed by the operating system's multithreading APIs.
NCNN implements several functions required by the OpenMP standard, including functions to get the current number of threads, among others.

int omp_get_max_threads()
{
    return ncnn::get_cpu_count();
}

int omp_get_dynamic()
{
    return 1;
}

void omp_set_dynamic(int /*dynamic*/)
{
    // always dynamic, ignore
}

void omp_set_num_threads(int num_threads)
{
    tls_num_threads.set(reinterpret_cast<void*>((size_t)std::max(num_threads, 1)));
}

int omp_get_num_threads()
{
    return std::max((int)reinterpret_cast<size_t>(tls_num_threads.get()), 1);
}

int omp_get_thread_num()
{
    return (int)reinterpret_cast<size_t>(tls_thread_num.get());
}

The Specific Implementation and ABI of OpenMP in Different Compilers

The recommended reading order is GNU -> Clang -> MSVC. For each implementation, we first focus on the function call chain, and then go into detail about what each function needs to do. Notably, the MSVC implementation is quite similar to Clang's.

GNU

There are many references for GNU, and the following articles are well-written.

OpenMP Parallel Construct 实现原理与源码分析 - 一无是处的研究僧 - 博客园 (cnblogs.com)

compilation - How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself? - Stack Overflow

call chain

When encountering a parallel for, a parallel region is constructed (by calling GOMP_parallel()).
Since the default is dynamic scheduling (or NCNN only implements dynamic scheduling), each task (iteration) is directly placed into the task queue after initialization (by calling GOMP_parallel_start()).
After execution, threads and other resources are cleaned up (by calling GOMP_parallel_end()).

In summary, the GNU workflow consists of three sequential function calls: GOMP_parallel() -> GOMP_parallel_start() -> GOMP_parallel_end(). However, in reality, we found that only GOMP_parallel() is called.

Important Functions

GOMP_parallel_start()
1. Try to initialize the KMPGlobal variable, create and immediately execute a thread (the task for the thread is obtained from KMPTaskQueue).
2. Corner case: If multithreading is not supported (e.g., when there's only one core), have the current thread directly execute the task (without obtaining it from KMPTaskQueue), then exit.
3. If multithreading is supported, add num_threads - 1 tasks to the KMPTaskQueue. Each task is identical in terms of data and the function to execute, except for the tls_thread_num.
4. Set the current active thread count and executing thread count in Thread Local Storage (TLS).
GOMP_parallel()
1. Try to initialize the KMPGlobal variable, create and immediately execute a thread (the thread's task is obtained from KMPTaskQueue).
2. Corner case: If multithreading is not supported (e.g., when there's only one core), have the current thread directly execute the task (without obtaining it from KMPTaskQueue), then exit.
3. If multithreading is supported, add num_threads - 1 tasks to the KMPTaskQueue. Each task is identical in terms of data and the function to execute, except for the thread_num.
4. Set the current active thread count and executing thread count in Thread Local Storage (TLS).
5. Thread synchronization: Before the created tasks are completed (determined by the num_threads_to_wait variable), the thread will sleep until they are finished.
GOMP_parallel_end()
1. Block the current thread until all tasks are completed.
2. Free the space occupied by all tasks and other related variables.

Clang

call chain

When encountering a for loop, first determine how many threads need to be generated for the current loop based on the value of X in num_threads(X) (by calling __kmpc_push_num_threads()).
Use the static scheduling algorithm as defined by the OpenMP standard to distribute the loop iterations as evenly as possible among the generated threads. Essentially, this involves calculating the loop indices and assigning a specific range of tasks to each thread (by calling __kmpc_for_static_init_XX(), where XX corresponds to various types). For example, if there are 100 iterations and 10 threads, the index range [0, 9] is assigned to thread 0, [10, 19] to thread 1, and so on.
The main thread creates tasks and places them into the task queue (by calling __kmpc_fork_call()).
Clean up resources (by calling __kmpc_for_static_fini()).

Important Functions

void __kmpc_push_num_threads(void* /*loc*/, int32_t /*gtid*/, int32_t num_threads)
1. When the Clang compiler encounters clause of num_threads(X), this function would be triggered, and the num_threadsindicates the number of threads used within the loop.
void __kmpc_for_static_init_4(void* /*loc*/, int32_t gtid, int32_t /*sched*/, int32_t* last, int32_t* lower, int32_t* upper, int32_t* /*stride*/, int32_t /*incr*/, int32_t /*chunk*/)
1. gtidindicates the index of a thread, lastindicates whether current thread is the last thread in the team, lower and upper indicate the interval of iterations, i.e., the iteratons between [lower, upper].
2. This function tries to distribute work among thread evenly, and if the iterations can not be devided the number of threads exactly, the last thread would take the remaining.
void __kmpc_fork_call(void* /*loc*/, int32_t argc, kmpc_micro fn, ...)
1. This function is invoked when Clang tried to execute the distributed work.
2. The ... is variable parameter, and we have to use va_list related maros to build up arguments.
3. Corner case: If multithreading is not supported (e.g., when there's only one core), have the current thread directly execute the task (without obtaining it from KMPTaskQueue), then exit.
4. If multithreading is supported, add num_threads - 1 tasks to the KMPTaskQueue. Each task is identical in terms of data and the function to execute, except for the thread_num.
5. Set the current active thread count and executing thread count in Thread Local Storage (TLS).
6. Thread synchronization: Before the created tasks are completed (determined by the num_threads_to_wait variable), the thread will sleep until they are finished.
static int kmp_invoke_microtask(kmpc_micro fn, int gtid, int tid, int argc, void** argv)
1. This function is called by __kmpc_fork_call, and it uses the argc to determine which instance should be called.
2. We have 32 callee with almost the same signature, because we cannot know the number of arguments in advance and in order to comply the convention of ABI of Clang, we manually set the maximum number of arguments as 32, which corresponds the number of callee.
void __kmpc_for_static_fini(void* /*loc*/, int32_t gtid)
1. Do nothing.

MSVC

call chain

When encountering a for loop, first determine how many threads need to be generated for the current loop based on the value of X in num_threads(X)（by calling _vcomp_set_num_threads())
Statically assign tasks to threads (by calling _vcomp_for_static_simple_init()).
The main thread creates tasks and places them into the task queue (by calling _vcomp_fork()).
Clean up resources (by calling _vcomp_for_static_end()).

Important Functions

void CDECL _vcomp_set_num_threads(int num_threads)
1. When the MSVC compiler encounters clause of num_threads(X), this function would be triggered, and the num_threadsindicates the number of threads used within the loop.
void CDECL _vcomp_for_static_simple_init(unsigned int first, unsigned int last, int step, BOOL increment, unsigned int* begin, unsigned int* end)
1. This function is very similar to __kmpc_for_static_init_4 but more advanced.
2. The function adjusts the range for each thread depending on its thread_num. Each thread calculates its begin and end values based on its assigned portion of iterations.
3. For the first few threads that need to handle additional iterations due to a remainder, it increments their workload by one iteration.
4. Current first and last are previous begin and end.
5. This function can handle situations like for (int i = 0; i < 1000; i += step), and for (int i = 1000; i >= 0; i -= step)
void _vcomp_fork(BOOL ifval, int nargs, void* wrapper, ...)
1. This function is invoked when MSVC tried to execute the distributed work.
2. The ... is variable parameter, and we have to use va_list related maros to build up arguments.
3. Corner case: If multithreading is not supported (e.g., when there's only one core), have the current thread directly execute the task (without obtaining it from KMPTaskQueue), then exit.
4. If multithreading is supported, add num_threads - 1 tasks to the KMPTaskQueue. Each task is identical in terms of data and the function to execute, except for the thread_num.
5. Set the current active thread count and executing thread count in Thread Local Storage (TLS).
6. Thread synchronization: Before the created tasks are completed (determined by the num_threads_to_wait variable), the thread will sleep until they are finished.
void CDECL _vcomp_fork_call_wrapper(void* wrapper, int nargs, void** args)
1. This function is called by _vcomp_fork, and it uses the nargs to determine which instance should be called.
2. We have 32 callee with almost the same signature, because we cannot know the number of arguments in advance and in order to comply the convention of ABI of MSVC, we manually set the maximum number of arguments as 32, which corresponds the number of callee.
void CDECL _vcomp_for_static_end(void)
1. Do nothing.

Challenges

We need to convert _vcomp_fork_call_wrapper() from assembly to C. A very simple approach (as seen in NCNN) is to limit the number of parameters passed to 32. Then, create 32 functions, each with a corresponding number of parameters, and use a switch statement to determine how many parameters need to be passed, calling the function with the appropriate number of parameters. We will also take this approach. Another seemingly more "elegant" method is to write the assembly code like vcomp. A final "possible" approach is to calculate the parameter offsets. Since we are passing pointers, it only involves shifting the byte offset accordingly, which can be implemented in C.
The MSVC compiler implementation differs from Clang and GNU in two key ways. First, MSVC uses static scheduling. Second, the static scheduling algorithm in MSVC is different from Clang's, as it needs to account for more factors.

Summary

Simpleomp defines three custom classes: KMPTask, which wraps the function to be executed; KMPTaskQueue, which schedules tasks; and KMPGlobal, which initializes and records the information for the entire multithreading schedule (such as the number of threads, thread IDs, and other details). These three custom classes will be used in the runtime functions defined by the compilers (Clang, GNU, and MSVC) for OpenMP. Since the OpenMP runtime functions differ across these compilers, we need to understand each compiler’s ABI and clarify the usage and workflow of these functions.

When the compiler encounters directives like #pragma omp parallel for, it generates and calls the corresponding runtime functions. Therefore, the compiler-defined runtime functions are the starting point (bootstrap), and our task is to implement these runtime functions with our own logic, while ensuring that the function signatures, usage, and ABI remain consistent with what the compiler expects.

How to take advantage of our work

Basically, the only thing you need to do is add an option of -DNCNN_SIMPLEOMP=ON when you build up the NCNN. However, after several tests (again, up to NCNN version 1.2.0.), we found that this feature could not work well with a shared library within Visual Studio setup, thus we suggest you not to enable the -DNCNN_SHARED_LIB option, and hopefully we will solve this issue in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `simpleomp` on MSVC #5691

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Support simpleomp on MSVC #5691

Jiang-Weibo Sep 10, 2024

Support simpleomp on MSVC

Brief Introduction to OpenMP Implementation

Platform-independent Components

KMPTask Class

Purpose

Attributes

KMPTaskQueueClass

Purpose

Methods

Attributes

KMPGlobal Class

Purpose

Methods

Attributes

Key Concepts

Thread Management

Synchronization

Platform-Specific Code

How It Works

Other Features

The Specific Implementation and ABI of OpenMP in Different Compilers

GNU

call chain

Important Functions

Clang

call chain

Important Functions

MSVC

call chain

Important Functions

Challenges

Summary

How to take advantage of our work

Replies: 0 comments

Support `simpleomp` on MSVC #5691

Jiang-Weibo
Sep 10, 2024

Support `simpleomp` on MSVC