-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI benchmark driver #138
base: develop
Are you sure you want to change the base?
MPI benchmark driver #138
Conversation
ffe986d
to
468d002
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see above for comments on this. I'm also nervous of merging this without some evidence that this works with all supported models.
Can you also think about how we can update the CI to test this is valid?
@@ -137,8 +182,15 @@ std::vector<std::vector<double>> run_all(Stream<T> *stream, T& sum) | |||
timings[3].push_back(std::chrono::duration_cast<std::chrono::duration<double> >(t2 - t1).count()); | |||
|
|||
// Execute Dot | |||
#if USE_MPI | |||
// Synchronize ranks before computing dot-product | |||
MPI_Barrier(MPI_COMM_WORLD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overhead is not timed, and waits for 4 kernels to finish before starting the timer for dot. Can you explain the motivation for not including synchronisation time for each kernel?
{ | ||
// MiB = 2^20 | ||
std::cout << std::setprecision(1) << std::fixed | ||
#if USE_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a better way to do this, as this is hard to read. Maybe creating a string variable with the label that is initialised depending the MPI case. E.g.,
#ifdef USE_MPI
char * array_size_str = "Array size (per rank): ";
#else
char * array_size_str = "Array size: ";
#endif
#if USE_MPI | ||
MPI_Datatype MPI_DTYPE = use_float ? MPI_FLOAT : MPI_DOUBLE; | ||
|
||
// Collect global min/max timings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear that the output will be for a single device, even though this is an MPI code. Is that what we expect, or do we want to (additionally?) report aggregate bandwidth?
<< std::left << std::setw(12) << std::setprecision(5) << average | ||
<< std::endl; | ||
<< "--------------------------------" | ||
<< std::endl << std::fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to change this? Especially adding the line?
We've got a large general refactor coming for the main driver coming in #186. |
This PR modifies the current driver
main.cpp
and adds MPI support for launching the benchmark across multiple devices. The main takeaways here:MPI_Allreduce
).The only major question I have is how MPI should be treated by CMake. I am open to suggestions and happy to comply with whatever you all prefer.