Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: DAOS and DFS modules #1014

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open

Conversation

shanedsnyder
Copy link
Contributor

@shanedsnyder shanedsnyder commented Oct 31, 2024

This PR adds new instrumentation of DAOS storage APIs and corresponding updates to our analysis tools to integrate this DAOS data. Specifically, 2 new Darshan modules are defined: DARSHAN_DFS_MOD for instrumenting usage of the DAOS file system (DFS) API and DARSHAN_DAOS_MOD for instrumenting native DAOS object APIs. More details on each module below.

DFS module:

  • For each DFS file, Darshan captures a fixed set of integer/FP counters (see full list in dfs-log-format.h) and the corresponding DAOS pool/container UUIDs.
  • DFS file record names are based on the full path in the DFS directory tree, similar to our other file-based modules.
  • DFS file record IDs are based off of the underlying DAOS OID, not the file name.
    • This approach was used, because not all DFS file open routines take a filename as input (e.g., dfs_obj_global2local()), meaning not all processes will have the filename available to generate a consistent record ID -- using the object OID allows all processes to agree on a consistent record ID value.
    • One side effect of this approach worth mentioning is that, since Darshan records are based on underlying OIDs and not file names, deleting/recreating files will result in multiple Darshan records corresponding to the same file -- this behavior can be easily observed in benchmarks like IOR which delete/recreate the output file on each iteration. It will ultimately be the responsibility of analysis tools to aggregate file records in this case.
  • The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DFS     -1      13156018442998895329    DFS_OPENS       2       /testFile       f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-4
62a-9c70-a80bf5a5a0f9       N/A

DAOS module:

  • For each DAOS object, Darshan captures a fixed set of integer/FP counters (see full list in daos-log-format.h), the corresponding DAOS pool/container UUIDs, and the full DAOS OID.
    • There are actually 3 distinct DAOS object APIs tracked in the Darshan DAOS module: object (DAOS_OBJ), array (DAOS_ARRAY), and KV (DAOS_KV).
  • DAOS object records have no name -- when printing these records in darshan-util programs, we just print the OID in string format (i.e., oid_hi.oid_lo, same approach as DAOS's own utilities)
    • Small changes were made to darshan-runtime and darshan-util libraries to allow for records that have no name associated.
  • DAOS file record IDs are based off of the underlying DAOS OID.
    • This makes it trivial to identify which DAOS object records correspond to which DFS file records, as they will have the same Darshan record identifier.
  • The pool_uuid:cont_uuid combo is used in place of the mount pt in tools like darshan-parser.

Example darshan-parser output line:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
DAOS    -1      13156018442998895329    DAOS_OBJ_OPENS  1       937047793718163273.416  f4996f65-9c9a-41c6-ac18-88059a11aeb1:b445df4d-0f29-462a-9c70-a80bf5a5a0f9       N/A

Both DFS and DAOS modules integrate with the Darshan heatmap module to generate histograms of I/O activity on each process. Both DFS and DAOS modules have also fully implemented darshan-util and PyDarshan functionality, including support for generating PyDarshan summary reports detailing DFS/DAOS access patterns. PyDarshan tests have been updated to ensure expected behavior when parsing logs containing DFS/DAOS data.

There are a few outstanding items that are not addressed in this PR:

  • There is no DXT support for DAOS modules, yet. It seems like the right call to try to limit the scope of changes here and weigh that capability with other development priorities going forward.
  • DAOS data is integrated into most of the relevant sections in PyDarshan summary reports, but not in the "data access by category" plots. I created an issue to track this: ENH: add new DAOS module data to PyDarshan "data access by category" plots #1015

Replaces #739

Shane Snyder and others added 30 commits October 22, 2024 19:01
* add CFFI shims needed to access DFS
record data at the Python level

* adjust `test_main_all_logs_repo_files()` to handle
the new `ior` `DFS` log file from Shane--it has a single
runtime heatmap for `STDIO`

* `test_module_table()` has been updated with a regression
case for Shane's new DFS log file

* add `test_dfs_daos_posix_match()` to ensure counter
equivalence between similar `ior..` runs with DAOS vs.
POSIX (NOTE: these actually don't look that similar yet--xfailed
for now..)
* adjust `test_dfs_daos_posix_match()` to handle
the two new POSIX/DAOS "mirror files" from Shane;
the `xfail` has been removed and it now passes

* there seems to be soem reasonable agreement
between the logs, which is good; see the test
proper for data columns that do not match or
required special handling for DFS-POSIX equivalence
testing

* a few other test suite shims after Shane changed
the POSIX/DAOS mirror files
* add DFS support to I/O cost graph
in summary reports, with some light
unit testing
* add a DFS per-module stats section to the Python
summary report, and some initial tests
* simplify the "time" counter handling in
`test_dfs_daos_posix_match()` based on reviewer
feedback

* `DFS_SLOWEST_RANK` is ignored in the comparisons
in `test_dfs_daos_posix_match()` based on reviewer
feedback

* the comment about `STAT` counter differences in
`test_dfs_daos_posix_match` was removed, based on
reviewer feedback
The OID backing a DFS file can change if the file is deleted and
recreated.
we don't currently have a way to generate darshan record IDs
given only a pathname -- they are based on OIDs
Shane Snyder and others added 10 commits October 22, 2024 19:01
* requires interception of `daos_cont_open` routines to allow
  mapping of container handles to pool/cont UUIDs
* DAOS module record ID now based on OID, cont UUID, and pool UUID
* add logic to allow name records with zero-length names to be
  updated with names in later register_record calls
  - this is useful because DAOS/DFS generate the same record IDs
    for "file objects", but the DAOS module does not register a
    name with the record and registers the record before DFS module
when reading name records from the log file, allow for updating
an existing zero-length name record
@shanedsnyder shanedsnyder added this to the 3.4.7 milestone Oct 31, 2024
@shanedsnyder shanedsnyder reopened this Nov 8, 2024
@shanedsnyder shanedsnyder changed the title WIP: DAOS and DFS modules ENH: DAOS and DFS modules Nov 12, 2024
@shanedsnyder shanedsnyder changed the title ENH: DAOS and DFS modules [WIP] ENH: DAOS and DFS modules Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants