This repo contains multiple asynchronous implementations of sieve of Eratosthenes.
The sieve of Eratosthenes is a well-known (and ancient) algorithm for finding prime numbers. The algorithm maintains a list of all numbers up to a certain value. Starting at 2, the algorithm finds the next number in the list that has not been crossed off and then crosses off all multiples of that number from the list. If we maintain the list of numbers as a bitmap, with true meaning a number is prime, we just iterate through the list with a stride equal to the current prime, setting each corresponding location to false.
A simple sequential implementation might look like the following:
template <class bool_t>
auto sieve_seq(size_t n) {
std::vector<bool_t> sieve(n, true);
sieve[0] = sieve[1] = true;
size_t sqrt_n = static_cast<size_t>(std::ceil(std::sqrt(n)));
for (size_t i = 2; i < sqrt_n; ++i) {
if (sieve[i]) {
for (size_t j = i * i; j < n; j += i) {
sieve[j] = false;
}
}
}
return sieve;
}
For the parallel implementations in this repository, instead of processing the entire list of numbers, we first sequentially determine all the primes in
.
Using that initial set of primes, the algorithm finds primes in fixed-size blocks of numbers, delimited by
The computation is broken into five tasks:
-
input_body()
generates p, a sequence of integers, starting at 0 -
gen_range()
creates a bitmap for indicating primality (or not) -
range_sieve()
applies sieve, to block$p$ , using initial set of
$\sqrt{n}$ primes and records results in bitmap obtained from
gen_range()
-
sieve_to_primes_part()
generates a list of prime numbers from the bitmap generated byrange_sieve()
-
output_body()
saves the list of primes in a vector at location$p+1$ .
The original set of
A set of
The various implementations included here are based on the same block algorithm and use essentially the same functions. They differ, however, in how concurrency and parallelism are effected.
- async: Uses
std::async
andstd::future
for concurrency and parallelism. Algorithmic steps are chained together viastd::async()
andstd::future.get()
. - cc: Uses the
concurrencpp
library, based on C++20 coroutines for concurrency and parallelism. Algorithmic steps are chained together viaco_return
andco_await
- direct: Algorithmic steps are chained together via one function directly using the results of the previous one. Function call chains are launched as separate
std::async
tasks. - p2300: Uses WG21 P2300
std::execution
for concurrency and parallelism. Algorithmic steps are chained together withstd::execution
andoperator|
. The version used for this benchmark is a fork from Kirk Shoop's fork (which added support for async_scope). The fork from the fork adds support forspawn_on
which is necessary for parallel execution. - tbb: Uses Intel Threading Building Blocks (oneTBB) for concurrency and parallelism. Algorithmic steps are embedded in
tbb::flow
task graph nodes. - unifex: Uses Facebook's
libunifex
for concurrency and parallelism. Algorithmic steps are chained together withunifex::then
andoperator|
.
The associated driver programs are named sieve_<framework>_fun.cpp
. (The "fun" is due not only to this being fun but because the driver is based on composing free functions together. There are also "obj" variants, based on function objects, not yet copied here.)
Pull in the concurrencpp, libunifex, tbb, and wg21_p2300_std_execution submodules with git
$ git submodule update --init --recursive
This will clone concurrencpp, libunifex, oneTBB, and wg21_p2300_std_execution into the "external" subdirectory. Note that wg21_p2300_std_execution should be the "async_scope" branch (this should be done automatically with the submodule update).
$ mkdir build
$ cd build
$ cmake ..
Options available for this project in CMakeLists.txt include
Option | Description | Default |
---|---|---|
EC_BUILD_SIEVE | Option to build sieve demo programs | ON |
EC_USE_CONCURRENCPP | Option whether or not to use concurrencpp library | ON |
EC_USE_WG21_P2003 | Option whether or not to use std execution library | ON |
EC_USE_LIBUNIFEX | Option whether or not to use libunifex | ON |
EC_USE_TBB | Option whether or not to use TBB | ON |
EC_USE_LOCAL_TBB | Use TBB submodule (rather than system TBB) | ON |
Important. Several submodules (concurrencpp and wg1_21_p2003) in the project depend on C++20 and other features supported by clang compilers but not by GNU. The examples all compile with Apple clang, but will not compile with g++. cmake will disable those libraries if it detects g++. I haven't tried out any clang compilers other than Apple clang, but since Apple clang is usually behind mainstream clang, I expect it will work (or will only require small changes). Everything should work with taskflow and with TBB.
Right now, cmake just picks the default compilation flags for the build level. If you want a different set of flags, you will (for now) need to set them. For example:
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_BUILD_FLAGS="-Ofast -march=native"
By default, all of the demos will be built: kite, sieve, and eval. You can turn those off with DAG_BUILD_EVAL, DAG_BUILD_KITE, or DAG_BUILD_SIEVE options. The default is ON for all.
A few of the executables depend on TBB. Right now, TBB is included as a submodule. You can either use a system install of TBB (use your favorite package manager to install), or use the submodule. To use the submodule
$ cmake -DDAG_USE_LOCAL_TBB=ON
The default is ON.
The suite of executables can be built with
$ make
This will build all of the sieve_<framework>_fun.exe
and put them in ./build/bin.
From the build subdirectory, the executables can be invoked as:
$ ./bin/sieve_<framework>_fun.exe [problem_size] [block_size]
where problem_size
is an optional argument specifying the upper limit of numbers to search for primes and block_size
is how many thousands of numbers to process together in a block. Default problem_size
is 100'000'000
and default block_size
is 100
.
(In initial benchmarks a block_size
of 100
seems to offer the best performance.)
The program will run the sieve program twice: once using a bitmap of std::vector<bool>
and once using a bitmap of std::vector<uint8_t>
. (This remains for historical reasons to compare efficiency of using bool
for uint8_t
for bitmaps.)
A jupyter notebook is available in the benchmarks subdirectory
$ cd benchmarks
$ jupyter notebook sieve_benchmark.ipynb
Executing the notebook will build, run, and plot the benchmarks. Assuming everything will build and run properly, it will take about 10 minutes to run through all the benchmarking. NB: If you have run the notebook already, you should manually delete the subdirectories build, log, and img. I decided against doing this in the notebook for safety reasons.
The following results were obtained on a Mac Mini M1, 2020 with 8 cores (4 performance, 4 efficiency). The programs were compiled with Apple Clang version 13.0.0 for arm64-apple-darwin20.6.0. Optimization flags used were " -O3 -DNDEBUG -arch arm64". The TBB used was oneTBB (2021.5.0). The concurrencpp version was v.0.1.4. The std_execution version was hash d219000, from the async_scope branch of https://github.com/lums658/wg21_p2300_std_execution.
Shown from left to right are execution times for sieve implementations using wg21_p2300_std_async, concurrencpp, libunifex, TBB, direct function calls, and std::async. These results were obtained with a block size of 100k numbers. Each bar shows mean and standard deviation over a total of 16 runs for each implementation.
Shown from left to right are execution times for sieve implementations using wg21_p2300_std_async, concurrencpp, libunifex, TBB, direct function calls, and std::async. These results were obtained with a block size of 100k numbers. Each bar shows mean and standard deviation over a total of 16 runs for each implementation.