Support simpleomp
on MSVC
#5691
Jiang-Weibo
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Support
simpleomp
on MSVCThe content of this article pertains to NCNN version 1.2.0.
The goal of this article for this open-source project is to implement a simple OpenMP runtime feature for the MSVC compiler in the
simpleomp.cpp
file of NCNN. NCNN has already implemented a simple OpenMP runtime feature for the Clang and GCC compilers, and nihui, the author of NCNN and also the mentor of this project, has already written most of the content (see platform-independent components) and provided implementation ideas, so the implementation process is relatively smooth.Brief Introduction to OpenMP Implementation
The OpenMP is a standard that defines a series of multi-threading operations based on shared memory, and these standards specify a set of directives.
These directives can produce various effects, such as ensuring that a specific block of code is executed only once in a multi-threaded environment, enabling threads executing a code block to perform reductions, or distributing iterations of a
for
loop across different threads for parallel execution, and so on.Among these, NCNN only need to support one directive, which also happens to be the most commonly used one. That directive is
#pragma omp parallel for num_threads(X)
, whereX
is an integer constant representing the number of threads. The purpose of this directive is to parallelize the code within thefor
loop that follows, assigning it to different threads for execution.For example, we created 4 threads, and by default, each thread is assigned the same number of iterations. In this example, there are 40 iterations, so each thread gets 40/4 = 10 iterations. However, OpenMP also specifies other ways to allocate iteration tasks, known as thread scheduling. Common thread scheduling methods include static scheduling, where tasks are evenly distributed across threads, dynamic scheduling (using a thread pool and a task pool), and guided scheduling (which, simply put, starts by assigning a lot of tasks and then gradually reduces the number of tasks), among others.
In NCNN, we only implement static scheduling and dynamic scheduling. More specifically, we implement dynamic scheduling for GNU, and static scheduling for Clang and MSVC. Note: Although we use a thread pool and task pool in Clang and MSVC, since the number of tasks each thread receives is determined from the beginning, it is still considered static scheduling according to the OpenMP standard.
Why did we want to do this kind of job? Generally speaking, common compilers for C/C++/Fortran (naturally including the "big three" GNU/Clang/MSVC) have built-in support for most of the OpenMP standard, so we don’t need to implement OpenMP ourselves. However, in some cases, NCNN might not be able to use the compiler’s internal OpenMP support, or it might not want to link the OpenMP library (as using OpenMP requires linking certain libraries). Therefore, we need to implement a small portion of OpenMP functionality ourselves, specifically
#pragma omp parallel for num_threads(X)
.How does this work? When the compiler encounters directives like
#pragma omp parallel for num_threads(X)
, it translates these directives into individual function calls. These function calls are then handled internally by the compiler to implement multi-threading operations that conform to the OpenMP standard. In fact, this "translation of directives into individual functions" happens during the syntax analysis phase of compilation, which we don't need to worry about. Our focus is on the second step, which is the "internal function calls" made by the compiler.This means that instead of directly handling the parsing of OpenMP directives, we only need to concern ourselves with implementing or managing the function calls that handle the threading behavior, ensuring they align with what OpenMP requires. By doing so, we can manually manage the thread parallelization without relying on OpenMP’s runtime libraries.
First, the signatures, calling order, and timing of these functions are all determined by the compiler. For example, to implement a certain directive, the compiler might first call a function with the signature
A(int, int)
, then call another function with the signatureB(double, double)
. Based on the result ofB()
, the compiler createsX
threads, and these threads then execute a function with the signatureC(type1 P, type2 Q)
in parallel. Finally, the main thread calls a function with the signatureD(int, int)
to finish the process.Second, since these functions need to be cross-platform, they must conform to different platform-specific ABIs (in C/C++, this involves using certain macro definitions, and the compiler will handle these macros once added).
So, our task includes two parts:
By focusing on these tasks, we can essentially recreate the functionality of
#pragma omp parallel for num_threads(X)
without relying entirely on the internal OpenMP library, while still ensuring cross-platform compatibility.Our task is to support NCNN with the MSVC compiler to implement a simple OpenMP runtime. Specifically, after enabling OpenMP compilation and using
cmake -DNCNN_SIMPLEOMP=ON
, there should be no need to linkvcompXXX.lib
, and deployment should not requirevcompXXX.dll
, while still achieving effective multithreading acceleration.vcompXXX
is the library MSVC uses to implement OpenMP, so based on the previous example, our job is to replace the logic of the library's functions with our own. (For those familiar with game development in Unity, this is similar to writing custom logic forupdate()
andawake()
functions). Once again, NCNN only needs to implement#pragma omp parallel for num_threads(X)
.Platform-independent Components
The various classes mentioned here mainly implement a thread pool and a task queue. If you are already familiar with these concepts, feel free to skip to the next section.
KMPTask Class
Purpose
Represents a single task that can be executed by a thread.
Attributes
fn: A function pointer or callable object representing the task to be executed. The implementation varies based on the compiler (clang vs. others like GCC).
data/argv: Arguments or data needed by the task function.
num_threads: The total number of threads involved in executing this task.
thread_num: The specific thread number executing this task.
num_threads_to_wait: Pointer to an integer that tracks how many threads need to finish before proceeding.
finish_lock and finish_condition: Synchronization primitives used to coordinate task completion.
KMPTaskQueueClass
Purpose
Manages a queue of KMPTask objects, providing a mechanism for threads to dispatch and retrieve tasks. It uses a circular buffer (ring buffer) to manage the tasks.
Methods
dispatch: Adds multiple tasks to the queue, ensuring that the queue does not exceed its maximum size. It locks the queue during this operation.
put: Adds a single task to the queue, waiting if the queue is full. It also handles signaling other threads when tasks are available.
get: Retrieves a task from the queue for a thread to execute, waiting if the queue is empty.
Attributes
lock: A mutex used to ensure exclusive access to the queue.
condition: A condition variable used for signaling between threads.
max_size, tasks, size, front, back: Variables that manage the circular buffer of tasks.
KMPGlobal Class
Purpose
Manages the global state and initialization of the thread pool and task queue. It is responsible for setting up and tearing down the threading infrastructure.
Methods
try_init(): Ensures that the global thread pool and task queue are initialized only once (using pthread_once).
init(): Initializes the thread pool based on the number of CPU cores available, and creates a KMPTaskQueue with a size proportional to the number of threads.
deinit(): Cleans up resources, ensures all tasks are completed, and joins any running threads before deleting them.
Attributes
kmp_max_threads: The maximum number of threads to be used.
kmp_threads: Array of thread objects.
kmp_threads_tid: Array of thread IDs.
kmp_task_queue: Pointer to the global task queue.
Key Concepts
Thread Management
The code manually manages threads and tasks, implementing a thread pool-like system where tasks can be distributed among multiple worker threads.
Synchronization
The use of mutexes and condition variables ensures that access to shared resources (like the task queue) is thread-safe, and threads can be coordinated effectively.
Platform-Specific Code
The code includes preprocessor directives to handle differences between compilers (clang vs. others), indicating it’s designed to be portable across different environments.
How It Works
Initialization: The KMPGlobal class initializes a set of worker threads and a task queue. The number of threads is based on the available CPU cores.
Task Dispatching: Tasks are created and dispatched to the task queue, where worker threads can retrieve and execute them.
Task Execution: Each worker thread runs in a loop, fetching tasks from the queue and executing them until there are no more tasks left or the program is shutting down.
Cleanup: The deinit method ensures that all threads are joined and resources are freed when the application is done with the threading system.
Other Features
TlsAlloc()
,TlsSetValue()
, andTlsGetValue()
, among others. Note: TLS stores information in key-value pairs. Different threads can store information using the same key, and the operating system will allocate separate space based on which thread is storing the data. Therefore, when different threads use the same key to read information, they won't read each other's data.The Specific Implementation and ABI of OpenMP in Different Compilers
The recommended reading order is GNU -> Clang -> MSVC. For each implementation, we first focus on the function call chain, and then go into detail about what each function needs to do. Notably, the MSVC implementation is quite similar to Clang's.
GNU
There are many references for GNU, and the following articles are well-written.
OpenMP Parallel Construct 实现原理与源码分析 - 一无是处的研究僧 - 博客园 (cnblogs.com)
compilation - How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself? - Stack Overflow
call chain
parallel for
, a parallel region is constructed (by callingGOMP_parallel()
).GOMP_parallel_start()
).GOMP_parallel_end()
).In summary, the GNU workflow consists of three sequential function calls:
GOMP_parallel() -> GOMP_parallel_start() -> GOMP_parallel_end()
. However, in reality, we found that onlyGOMP_parallel()
is called.Important Functions
GOMP_parallel_start()
KMPGlobal
variable, create and immediately execute a thread (the task for the thread is obtained fromKMPTaskQueue
).KMPTaskQueue
), then exit.num_threads - 1
tasks to theKMPTaskQueue
. Each task is identical in terms of data and the function to execute, except for thetls_thread_num
.GOMP_parallel()
KMPGlobal
variable, create and immediately execute a thread (the thread's task is obtained fromKMPTaskQueue
).KMPTaskQueue
), then exit.num_threads - 1
tasks to theKMPTaskQueue
. Each task is identical in terms of data and the function to execute, except for thethread_num
.num_threads_to_wait
variable), the thread will sleep until they are finished.GOMP_parallel_end()
Clang
call chain
for
loop, first determine how many threads need to be generated for the current loop based on the value ofX
innum_threads(X)
(by calling__kmpc_push_num_threads()
).__kmpc_for_static_init_XX()
, whereXX
corresponds to various types). For example, if there are 100 iterations and 10 threads, the index range [0, 9] is assigned to thread 0, [10, 19] to thread 1, and so on.__kmpc_fork_call()
).__kmpc_for_static_fini()
).Important Functions
void __kmpc_push_num_threads(void* /*loc*/, int32_t /*gtid*/, int32_t num_threads)
num_threads(X)
, this function would be triggered, and thenum_threads
indicates the number of threads used within the loop.void __kmpc_for_static_init_4(void* /*loc*/, int32_t gtid, int32_t /*sched*/, int32_t* last, int32_t* lower, int32_t* upper, int32_t* /*stride*/, int32_t /*incr*/, int32_t /*chunk*/)
gtid
indicates the index of a thread,last
indicates whether current thread is the last thread in the team,lower
andupper
indicate the interval of iterations, i.e., the iteratons between [lower, upper].void __kmpc_fork_call(void* /*loc*/, int32_t argc, kmpc_micro fn, ...)
...
is variable parameter, and we have to useva_list
related maros to build up arguments.KMPTaskQueue
), then exit.num_threads - 1
tasks to theKMPTaskQueue
. Each task is identical in terms of data and the function to execute, except for thethread_num
.num_threads_to_wait
variable), the thread will sleep until they are finished.static int kmp_invoke_microtask(kmpc_micro fn, int gtid, int tid, int argc, void** argv)
__kmpc_fork_call
, and it uses theargc
to determine which instance should be called.void __kmpc_for_static_fini(void* /*loc*/, int32_t gtid)
MSVC
call chain
for
loop, first determine how many threads need to be generated for the current loop based on the value ofX
innum_threads(X)
(by calling_vcomp_set_num_threads()
)_vcomp_for_static_simple_init()
)._vcomp_fork()
)._vcomp_for_static_end()
).Important Functions
void CDECL _vcomp_set_num_threads(int num_threads)
num_threads(X)
, this function would be triggered, and thenum_threads
indicates the number of threads used within the loop.void CDECL _vcomp_for_static_simple_init(unsigned int first, unsigned int last, int step, BOOL increment, unsigned int* begin, unsigned int* end)
__kmpc_for_static_init_4
but more advanced.thread_num
. Each thread calculates itsbegin
andend
values based on its assigned portion of iterations.first
andlast
are previousbegin
andend
.for (int i = 0; i < 1000; i += step)
, andfor (int i = 1000; i >= 0; i -= step)
void _vcomp_fork(BOOL ifval, int nargs, void* wrapper, ...)
...
is variable parameter, and we have to useva_list
related maros to build up arguments.KMPTaskQueue
), then exit.num_threads - 1
tasks to theKMPTaskQueue
. Each task is identical in terms of data and the function to execute, except for thethread_num
.num_threads_to_wait
variable), the thread will sleep until they are finished.void CDECL _vcomp_fork_call_wrapper(void* wrapper, int nargs, void** args)
nargs
to determine which instance should be called.void CDECL _vcomp_for_static_end(void)
Challenges
_vcomp_fork_call_wrapper()
from assembly to C. A very simple approach (as seen in NCNN) is to limit the number of parameters passed to 32. Then, create 32 functions, each with a corresponding number of parameters, and use aswitch
statement to determine how many parameters need to be passed, calling the function with the appropriate number of parameters. We will also take this approach. Another seemingly more "elegant" method is to write the assembly code likevcomp
. A final "possible" approach is to calculate the parameter offsets. Since we are passing pointers, it only involves shifting the byte offset accordingly, which can be implemented in C.Summary
Simpleomp defines three custom classes:
KMPTask
, which wraps the function to be executed;KMPTaskQueue
, which schedules tasks; andKMPGlobal
, which initializes and records the information for the entire multithreading schedule (such as the number of threads, thread IDs, and other details). These three custom classes will be used in the runtime functions defined by the compilers (Clang, GNU, and MSVC) for OpenMP. Since the OpenMP runtime functions differ across these compilers, we need to understand each compiler’s ABI and clarify the usage and workflow of these functions.When the compiler encounters directives like
#pragma omp parallel for
, it generates and calls the corresponding runtime functions. Therefore, the compiler-defined runtime functions are the starting point (bootstrap), and our task is to implement these runtime functions with our own logic, while ensuring that the function signatures, usage, and ABI remain consistent with what the compiler expects.How to take advantage of our work
Basically, the only thing you need to do is add an option of
-DNCNN_SIMPLEOMP=ON
when you build up the NCNN. However, after several tests (again, up to NCNN version 1.2.0.), we found that this feature could not work well with a shared library within Visual Studio setup, thus we suggest you not to enable the-DNCNN_SHARED_LIB
option, and hopefully we will solve this issue in the future.Beta Was this translation helpful? Give feedback.
All reactions