Skip to content

Commit

Permalink
Merge pull request #169 from gvallee/doc_file_info
Browse files Browse the repository at this point in the history
Add documentation about the file generated by the profiler
  • Loading branch information
gvallee authored Mar 19, 2021
2 parents 87a3045 + 202d7b6 commit 6b2802e
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 1 deletion.
3 changes: 3 additions & 0 deletions CONTRIB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
To submit contributions, developers are required to ensure that two tests can run successfully:
1. the CI tests on Github once a Pull Request (PR) is created; these tests are executed automatically and developers just need to ensure they are successful or the PR will not be merged.
2. run the validation tests by running the `make validate` command from the top directory of the source code. These tests requires to have MPI installed and setup (`PATH` and `LD_LIBRARY_PATH` must be correctly setup to point to the correct MPI implementations).
10 changes: 10 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Recommanded configuration

While there is no known restrictions regarding the compilers to use, we recommand the following configuration:
- gcc, any version compatible with C90
- openmpi 4.0.0 or newer
- go-1.13 or newer for the post-mortem analysis tool

# Installation

Please refer to the `README.md` file for installation instructions.
125 changes: 124 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ mpirun -np $NPROC -x LD_PRELOAD=/global/home/users/geoffroy/projects/alltoall_pr
When using a job scheduler, users are required to correctly set the LD_PRELOAD details
in their scripts or command line.

### Example
#### Example with Slurm

Assuming Slurm is used to execute jobs on the target platform, the following is an example of
a Slurm batch script that runs the OSU microbenchmakrs and gathers all the profiling traces
Expand Down Expand Up @@ -137,6 +137,74 @@ mpirun -np 1024 -map-by ppr:32:node -bind-to core $MPIFLAGS -x LD_PRELOAD="$A2AT
mpirun -np 1024 -map-by ppr:32:node -bind-to core $MPIFLAGS -x LD_PRELOAD="$LATETIMINGFLAGS" /path/to/osu/install/osu-5.6.3/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoallv -f
```

### Generated data

When using the default shared libraries, the following files are generated which are details further below:
- files with a name starting with `send-counters` and `recv-counters`; files providing the alltoallv counts as defined by the MPI standard,
- files prefixed with `alltoallv_locations`, which stores data about the location of the ranks involved in alltoallv operations,
- files prefixed with `alltoallv_late_arrival`, which stores time data about ranks arrival into the alltoallv operations,
- files prefixed with `alltoallv_execution_times`, which stores the time each rank spent in the alltoallv operations,
- files prefixed with `alltoallv_backtrace`, which stores information about the context in which the application is invoking alltoallv.

In other to compress data and control the size of the generated dataset, the tool is able to use a compact notation to avoid duplication in lists. This notation is mainly applied to list of ranks. The format is a comma-separated list where consecutive numbers are saved as a range. For example, ranks `1, 3` means ranks 1 and 3; ranks `2-5` means ranks 2, 3, 4, and 5; and ranks `1, 3-5` means ranks 1, 3, 4, and 5.

#### Send and receive count files

A `send-counters` and `recv-counters` files is generated per communicator used to perform an alltoallv operations. In other words, if alltoallv operations are executed on a single communicator, only two files are generated: `send-counters.job<JOBID>.rank<LEADRANK>.txt` and `recv-counters.job<JOBID>.rank<LEADRANK>.txt`, where `JOBID` is the job number when a job manager such as Slurm is used (equal to 0 when no job manager is used) and `LEADRANK` is the rank on `MPI_COMM_WORLD` that is rank 0 on the communicator used. `LEADRANK` is therefore used to differantiate data from different sub-communicators.

The content of the count files is predictable and organized as follow:
- `# Raw counters` indicates a new set of counts and is always followed by an empty line.
- `Number of ranks:` indicates how many ranks were involved in the alltoallv operations.
- `Datatype size:` indicates the size of the datatype used during the operation. Note that at the moment, the size is saved only in the context of the lead rank (as previously defined); alltoallv communications involving different datatype sizes is currently not supported.
- `Alltoallv calls:` indicates how many alltoallv calls *in total* (not specifically for the current set of counts) are captured in the file.
- `Count:` indicates how many alltoallv calls have the counts reported below. This line gives the total number of all calls as well as the list of all the calls using our compact notation.
- And finally the raw counts which are delimited by `BEGINNING DATA` and `END DATA`. Each line of the raw counts represents the count for ranks. Please refer to the MPI standard to fully understand the semantic of counts. `Rank(s) 0, 2: 1 2 3 4` means that ranks 0 and 2 have the following counts: 1 for rank 0, 2 for rank 1, 3 for rank 2 and 4 for rank 3.

#### Time file: alltoallv_late_arrival* and alltoallv_execution_times* files

The first line is the version of the data format. This is used for internal purposes to ensure that the post-mortem analysis tool supports that format.

Then the file has a series of timing data per call. Each call data starts with `# Call` with the number of the call following by the ordered list of timing data per rank.

All timings are in seconds.

#### Location files

The first line is the version of the data format. This is used for internal purposes to ensure that the post-mortem analysis tool supports that format.

Then the files has a series of entries, one per unique location where a location is the rank on the communicator and the host name. An example of such a location is:
```
Hostnames:
Rank 0: node1
Rank 1: node2
Rank 2: node3
```
In order to control the size of the dataset, the metadata for each unique location includes: the communicator identifier (`Communicator ID:`), the list of calls having the unique location (`Calls:`), the ranks on MPI_COMM_WORLD (`COMM_WORLD_ rank:`) and PIDs (`PIDs`)

#### Trace files

The first line is the version of the data format. This is used for internal purposes to ensure that the post-mortem analysis tool supports that format.

After the format version, a line prefixed with `stack trace for` indicates the binary associated to the trace. In most cases, only one binary will be reported.

Then the files has a series of entries, one per unique backtrace where a backtrace the data returned by the backtrace system call. An example of such a backtrace is:
```
/home/user1/collective_profiler/src/alltoallv/liballtoallv_backtrace.so(_mpi_alltoallv+0xd4) [0x147c58511fa8]
/home/user1/collective_profiler/src/alltoallv/liballtoallv_backtrace.so(MPI_Alltoallv+0x7d) [0x147c5851240c]
./wrf.exe_i202h270vx2() [0x32fec53]
./wrf.exe_i202h270vx2() [0x866604]
./wrf.exe_i202h270vx2() [0x1a5fd30]
./wrf.exe_i202h270vx2() [0x148ad35]
./wrf.exe_i202h270vx2() [0x5776ba]
./wrf.exe_i202h270vx2() [0x41b031]
./wrf.exe_i202h270vx2() [0x41afe6]
./wrf.exe_i202h270vx2() [0x41af62]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x147c552d57b3]
./wrf.exe_i202h270vx2() [0x41ae69]
```

Finally each unique trace is accociated to 1 or more context(s) (`# Context` followed by the context number, i.e., the number in which it has been detected). A context is composed of a communicator (`Communicator`), the rank on the communicator (`Communicator rank`) which in most cases is `0` because the lead on the communicator, the rank on MPI_COMM_WORLD (`COMM_WORLD rank`), and finally the list of alltoallv calls having the backtrace using the compact notation previously presented.

# Post-mortem analysis

We provide a set of tools that parses and analyses the data compiled when executing
Expand Down Expand Up @@ -172,6 +240,61 @@ The set of post-mortem analysis can use data generated by the profiler, intermed
data generated by other tools, creating a chain of dependencies. [A graphic shows the internal
dependencies](doc/tool_dependencies.png) for the main tools provided by this project.

### Generated data

The execution of the `profile` command to run the post-mortem analysis generates the following files:
- a single file prefixed with `profile_alltoallv` which gives an overview of some of the results from the post-mortem analysis,
- files refixed with `patterns-` that presents all the patterns that were detected (see the patterns sub-section for details), as well as a summary (file prefixed with `pattern-summary`),
- files prefixed with `stats` which provides statistics based on send and receive count files,
- files prefixed with `ranks_map` which provide a map of the ranks on each node,
- a file named `rankfile.txt` which gives the location of each rank,
- files prefixed with `alltoallv_heat_map` that have heat map for individual alltoallv calls, i.e., the amount of data that is send or received on a per rank basis. The send heat map file name is suffixed with `send.md`, while the receive heat map file name is suffixed with `recv.md`.
- files prefixed with `alltoallv_hosts_heat_map` that have a similar heat map but based on hosts rather than ranks.

#### Post-mortem overview files

All post-mortem analysis generates a single `profile_alltoallv*` file. The file is in markdown. The format of the file is as follow:
- A summary that gives the size of MPI_COMM_WORLD and the total number of alltoallv calls that the profiler tracked.
- A series of datasets where a dataset is a group of alltoallv calls having the same characteristics, including:
- the communicator size used for the alltoallv operation
- the number of alltoallv calls
- how many send and receive counts were equal to zero (data used to know the sparcity of alltoallv calls).

#### Patterns

Two types of pattern files are generated

File with the `patterns-job<JOBID>-rank<LEADRANK>.md` naming scheme where `JOBID` and `LEADRANK` follow the definition previously presented. A pattern captures how many ranks are actually in communication with other ranks during a given alltoallv calls. This is valuable information when send and receive counts include counts equal to zero.
The file is organized as follow:
- first the pattern ID with the number of alltoallv calls that have the patterns. For instance `## Pattern #0 (61/484 alltoallv calls)` means that 61 alltoallv calls out of 484 have the pattern 0.
- the list of calls having that pattern using the compact notation previously presented,
- the pattern itself is a succession of entries that are either `X ranks send to Y other ranks` or `X ranks recv'ed from Y other ranks`.

File with the `patterns-job<JOBID>-rank<LEADRANK>.md` naming scheme `patterns-summary-job<JOBID>-rank<LEADRANK>.md` where `JOBID` and `LEADRANK` follow the definition previously presented. These files captures the patterns that have predefined characteristics, such as 1->n patterns (a few ranks send or receive to/from many ranks). These patterns are useful to detect alltoallv operations that do not involve all the ranks and therefore may create performance bottlenecks.

#### Statistics files

These files are based on the following format:
- the total number of alltoallv calls,
- the description of the datatypes that have been used during the alltoallv calls; for example, `484/484 calls use a datatype of size 4 while sending data` means that 484 out of a total of 484 alltoallv calls (so all the calls) used a send datatype of size 4,
- the communicator size,
- the message sizes that are calculated using the counts and the datatype size; the sizes are grouped based on a threshold (which can be customized). The distinction between messages is small messages (below the threshold), large messages and small and non zero messages.
- minimum and maximum count values.

#### Heat maps

A heat map is defined as the amount of data exchanged between ranks. Two different heat maps are generated: a rank-based heat map and a host-based heat map.

The rank based heat map (e.g., from file `alltoallv_heat-map.rank0-send.md`) is organized as follow:
- the first line is the version of the data format. This is used for internal purposes to ensure that the post-mortem analysis tool supports that format,
- then a list all numered calls with the amount of data each rank sends or receives.

The host heat map is very similar but the amount of data presented is on a host-based basis.

#### Rank maps

These files present the amount of data exchanged between ranks on a specific node. The files are named as followed: `ranks_map_<hostname>.txt`. Each line represents the amount of data a rank is sending, the line number being the rank on the communicator used to perform the alltoallv operation (line 0 for rank 0; line _n_ for rank _n_).

# Visualization using the WebUI

The project provides a WebUI to visualize the data. The interface assumes that postmortem
Expand Down

0 comments on commit 6b2802e

Please sign in to comment.