Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, ENH: memory efficient DXT segs #784

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Collaborator

Fixes #779

  • at the moment on main, DXT record data is effectively
    stored as a list of dictionaries of lists of dictionaries
    that look like this:
DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}
  • the list of segments is extremely memory inefficient, with
    the smallest file in the matching issue exceeding 20 GB of
    physical memory in mod_read_all_dxt_records:
Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
  • if we switch to NumPy arrays the memory footprint drops a lot
    (see below),
    and the performance informally seems similar (36 seconds
    vs. 33 seconds on main to produce a report object
    with smallest file in matching issue):
Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
  • this branch currently uses NumPy record arrays,
    because I thought they'd be a better fit for a data
    structure with 2 int columns and 2 float columns;
    however, there is a big performance hit over
    regular NumPy arrays (almost 6 minutes vs. 33
    seconds for the smallest file in matchin issue);
    so, if we could live without the extra dtype
    structuring of a recarray, maybe that would be best
    (we could also try to use a pandas dataframe, which
    is another natural fit for dtype columns..)

arr_write[i, ...] = (segments[i].offset,
segments[i].length,
segments[i].start_time,
segments[i].end_time)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carns @shanedsnyder @jakobluettgau I feel like there should be something much cleaner/faster here. Since we can iterate over segments in Python I feel like we should be able to stride over the segment_info buffer in some kind of organized way. Each segment appears to be fixed size, with two ints and two floats, right?

I played a bit with np.frombuffer() earlier today, but didn't quite get it to work. I'm not sure if this is pushing up against gh-497 already, but I don't think so, since I'm only talking about the segments in a given record, rather than all the records in a log file.

If you know off the top of your head how the C memory layout works for the segments from struct segment info * that may be helpful to know. I'm sure it can be dug up from the source of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all assumes that we can live with changing the DXT segment storage format, but I think it is pretty clear that the memory efficiency of thousands of pure Python objects isn't going to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this live the other day, but belatedly for posterity, here is the definition of the binary format for the dxt data in the logs:

https://github.com/darshan-hpc/darshan/blob/main/include/darshan-dxt-log-format.h#L19

Each segment is indeed a fixed size (all 64 bit values; there shouldn't be any padding in the struct) and the segments are laid out linearly in the log.

arr_read = np.recarray(rcnt, dtype=[("offset", int),
("length", int),
("start_time", float),
("end_time", float)])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked the idea of using recarray because it kind of retains the original metadata in the structured dtype, but it seems that this carries a huge performance overhead against conventional NumPy float64 arrays.

I believe the test suite actually didn't complain too much about storing all the data as say float64, though this would be something we should consider (is the int->float casting safe in general, etc.).

Copy link
Collaborator Author

@tylerjereddy tylerjereddy Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to try pandas DataFrame as well perhaps. There's a bit of a design issue where we have to_numpy() and to_df() methods, where one of those may become redundant depending on what we decide here, and to_dict() is likely a bad idea based on the memory profiling.

@tylerjereddy
Copy link
Collaborator Author

After my revisions this afternoon, I think this may be far better than main for both memory and performance now--see below. Note that generating the HTML summaries is a separate matter, this is just to improve memory footprint and performance for report generation with DXT data per the matching issue.

Table 1: time to generate report object for logs provided by NERSC team in gh-779

Branch dbwy_....darshan run_...457...darshan run_...458...darshan
main 33 s 1 minute, 8 s 1 minute, 8 s
treddy_issue_779 4 s 7.5 s 8.4 s

I should note that I ran the tests on this branch on a slower machine with less memory as well (my laptop), which probably makes the numbers even more dramatic, though folks should confirm of course.

Our benchmark suite is neutral on this feature branch, which probably means we need to expand coverage there a bit, but that can perhaps be dealt with separately. It could also be a matter of basically needing very large files to detect the original issue.

@carns
Copy link
Contributor

carns commented Aug 4, 2022

Wow, thanks @tylerjereddy , that's a huge difference!

@shanedsnyder
Copy link
Contributor

Just to make sure I understand current state here, do your most recent revisions give you both the "good" performance and the structured layout we want (e.g., you can still do rec['write_segments'][n]['offset'], and expect the type to be an int)?

If so, then for end users it seems that even though we changed the segments from a list of dictionaries to a numpy record array, the interface is basically the same, just with better memory usage? That would be a slam dunk.

Fixes darshan-hpc#779

* at the moment on `main`, DXT record data is effectively
stored as a list of dictionaries of lists of dictionaries
that look like this:

```
DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}
```

- the list of segments is extremely memory inefficient, with
the smallest file in the matching issue exceeding 20 GB of
physical memory in `mod_read_all_dxt_records`:

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- if we switch to NumPy arrays the memory footprint drops a lot
(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on `main` to produce a `report` object
with smallest file in matching issue):

```
Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)
```

- this branch currently uses NumPy record arrays,
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)
* more efficient `log_get_dxt_record()` by reading
directly from the C-contiguous segment buffer into
NumPy recarrays

* simplify the changes to `test_dxt_records()`
@tylerjereddy
Copy link
Collaborator Author

@shanedsnyder as far as I can tell, that call signature was never possible, but if I adjust to i.e., rec['read_segments']['offset'][n] (i.e., flip the column name and row index order), then I get the same result on main and this feature branch per the sample code below. I think the "diff" you'll see in the results below is for the fact that column names get added even when a dataframe is otherwise empty on this branch, but not on main. I think that's probably an improvement as well, but largely irrelevant I suspect.

import darshan

def main():
    # spot check a few DXT_POSIX records
    recs_to_check = [1, 17, 289]
    report = darshan.DarshanReport("e3sm_io_heatmap_and_dxt.darshan",
                                   read_all=True)
    recs = report.records["DXT_POSIX"].to_df()
    print(type(recs))
    print(len(recs))
    print(type(recs[0]))
    print(recs[0])
    # Shane's query re: consistency:
    for rec_index in recs_to_check:
        rec = recs[rec_index]
        print(rec['read_segments']['offset'][2])


if __name__ == "__main__":
    main()

Then run the script on both branches and diff the text output: git diff --no-index result_main.txt result_branch.txt

diff --git a/result_main.txt b/result_branch.txt
index 5f7df81..a83fb7d 100644
--- a/result_main.txt
+++ b/result_branch.txt
@@ -2,7 +2,7 @@
 624
 <class 'dict'>
 {'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': Empty DataFrame
-Columns: []
+Columns: [offset, length, start_time, end_time]
 Index: [], 'read_segments':      offset  length  start_time  end_time
 0         0       8    0.183510  0.320738
 1         0  262144    0.364255  0.374655

So, I think that is reasonably safe-looking, but you may want to add a more formal regression test of this nature if you're not convinced the current test suite is sufficient? Importantly though, this is only for the to_df() method (see notes below..).

And, to be clear, sample printout on this branch:

<class 'list'>
624
<class 'dict'>
{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': Empty DataFrame
Columns: [offset, length, start_time, end_time]
Index: [], 'read_segments':      offset  length  start_time  end_time
0         0       8    0.183510  0.320738
1         0  262144    0.364255  0.374655
2      3072    5376    0.393613  0.393686
3      8448     260    0.402161  0.402183
4    120924     260    0.422415  0.422432
5    233400    5376    0.422755  0.422809
6    238776    3900    0.423005  0.423020
7   1925916    3900    0.494480  0.496708
8   3613056    5376    0.566494  0.569992
9   3618432    2600    0.571324  0.571992
10  4743192    2600    0.625726  0.627753
11  5867952    5376    0.688010  0.690604
12  5873328    2340    0.691985  0.692764
13  6885612    2340    0.770699  0.771444
14  7897896    5376    0.808871  0.811797
15  7903272    4420    0.813007  0.813835
16  9815364    4420    0.875015  0.876950}
176160768
159383936
179548

I also rebased this PR while I was at it.

I've never even wanted to use the to_dict() method, but let's say someone was using to_numpy() instead. In that case, following the same procedure as above does lead to a larger diff, and in fact you can only run the same workflow on this branch, while on main you don't have the structured NumPy array to support the indexing scheme without a dataframe conversion.

Traceback (most recent call last):
  File "/home/tyler/LANL/rough_work/darshan/issue_779/shane_comment.py", line 20, in <module>
    main()
  File "/home/tyler/LANL/rough_work/darshan/issue_779/shane_comment.py", line 16, in main
    print(rec['read_segments']['offset'][2])
TypeError: list indices must be integers or slices, not str

If I remove those prints and just focus on the data structure diff itself:

git diff --no-index result_main_numpy.txt result_branch_numpy.txt

diff --git a/result_main_numpy.txt b/result_branch_numpy.txt
index b99cb2a..9cc6779 100644
--- a/result_main_numpy.txt
+++ b/result_branch_numpy.txt
@@ -1,4 +1,22 @@
 <class 'list'>
 624
 <class 'dict'>
-{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': [], 'read_segments': [{'offset': 0, 'length': 8, 'start_time': 0.18351030349731445, 'end_time': 0.3207380771636963}, {'offset': 0, 'length': 262144, 'start_time': 0.3642547130584717, 'end_time': 0.37465500831604004}, {'offset': 3072, 'length': 5376, 'start_time': 0.3936131000518799, 'end_time': 0.39368557929992676}, {'offset': 8448, 'length': 260, 'start_time': 0.40216064453125, 'end_time': 0.40218329429626465}, {'offset': 120924, 'length': 260, 'start_time': 0.42241454124450684, 'end_time': 0.42243242263793945}, {'offset': 233400, 'length': 5376, 'start_time': 0.42275452613830566, 'end_time': 0.4228091239929199}, {'offset': 238776, 'length': 3900, 'start_time': 0.4230048656463623, 'end_time': 0.4230201244354248}, {'offset': 1925916, 'length': 3900, 'start_time': 0.4944796562194824, 'end_time': 0.4967076778411865}, {'offset': 3613056, 'length': 5376, 'start_time': 0.5664935111999512, 'end_time': 0.5699923038482666}, {'offset': 3618432, 'length': 2600, 'start_time': 0.5713236331939697, 'end_time': 0.5719923973083496}, {'offset': 4743192, 'length': 2600, 'start_time': 0.6257259845733643, 'end_time': 0.6277532577514648}, {'offset': 5867952, 'length': 5376, 'start_time': 0.6880097389221191, 'end_time': 0.6906044483184814}, {'offset': 5873328, 'length': 2340, 'start_time': 0.6919851303100586, 'end_time': 0.6927640438079834}, {'offset': 6885612, 'length': 2340, 'start_time': 0.7706985473632812, 'end_time': 0.7714436054229736}, {'offset': 7897896, 'length': 5376, 'start_time': 0.8088712692260742, 'end_time': 0.8117971420288086}, {'offset': 7903272, 'length': 4420, 'start_time': 0.8130068778991699, 'end_time': 0.8138353824615479}, {'offset': 9815364, 'length': 4420, 'start_time': 0.8750150203704834, 'end_time': 0.8769500255584717}]}
+{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': array([],
+      dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')]), 'read_segments': array([(      0,      8, 0.1835103 , 0.32073808),
+       (      0, 262144, 0.36425471, 0.37465501),
+       (   3072,   5376, 0.3936131 , 0.39368558),
+       (   8448,    260, 0.40216064, 0.40218329),
+       ( 120924,    260, 0.42241454, 0.42243242),
+       ( 233400,   5376, 0.42275453, 0.42280912),
+       ( 238776,   3900, 0.42300487, 0.42302012),
+       (1925916,   3900, 0.49447966, 0.49670768),
+       (3613056,   5376, 0.56649351, 0.5699923 ),
+       (3618432,   2600, 0.57132363, 0.5719924 ),
+       (4743192,   2600, 0.62572598, 0.62775326),
+       (5867952,   5376, 0.68800974, 0.69060445),
+       (5873328,   2340, 0.69198513, 0.69276404),
+       (6885612,   2340, 0.77069855, 0.77144361),
+       (7897896,   5376, 0.80887127, 0.81179714),
+       (7903272,   4420, 0.81300688, 0.81383538),
+       (9815364,   4420, 0.87501502, 0.87695003)],
+      dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')])}

There the diff in data structure is more substantial. The change still looks like it is in a positive direction I think, and even allows you do use the same "indexing scheme" as to_df() because of the new recarray data structure, but it certainly is a change in this case.

So, while the changes are positive IMO, they are changes nonetheless--I'm not sure how i.e., the NERSC folks interact with this data.

@hammad45
Copy link

Great work @tylerjereddy. I have tested the changes on the log files mentioned and I can see a considerable performance improvement in generating the Darshan Report. All the reports are now being generated on my laptop without getting killed.

However, the structuring issue with the to_numpy() and to_df() functions is still there. The output of the to_df() function is the same as before however the output of to_numpy() function has changed. Currently, I am getting a dictionary within a list for each rank when I call report.records['DXT_POSIX'].to_numpy(). For the write and read segment keys, the value is an array containing the offset, length, start_time, and end_time. This structure causes an issue in converting this output into a dataframe because we have an array within a list. We still have to perform an aggregation operation to convert this output into a dataframe.

@tylerjereddy
Copy link
Collaborator Author

@hammad45 Can you help me understand why you're using to_numpy() instead of to_df() if you want a dataframe as you final datastructure? My impression is that there aren't many good reasons to use to_numpy() from a performance perspective.

Perhaps if you explain what final data structure you'd like we can get a better sense for the issue. For example, "I want a dataframe that has each row a different DXT segment, and the columns are A, B, C, D," and I can't use to_df because..

@hammad45
Copy link

So basically I was using the to_df() function before. The output of the to_df() function is a list for each rank where each list contains a dictionary with the following keys:

[{'id' , 'rank', 'hostname', 'write_count', 'read_count', 'write_segments':    df[offset    length  start_time  end_time]
 'read_segments':   df[offset    length  start_time  end_time]}]

Notice that the write_segments and the read_segments keys point to a dataframe which stores the offset, length, start_time, end_time for each segment. So basically the to_df() function is returning a dataframe within a dictionary within a list. This structure makes it complicated to use the output as a dataframe.

However, if we had an output where we only had a dictionary within the list, we can easily use it as a dataframe. Such an output would have a dictionary with the following keys:

[{'id' , 'rank', 'hostname', 'write_count', 'read_count', 'operation', 'offset' , 'length' , 'start_time' , 'end_time'}]

Notice that the read and write segments are replaced with the columns of the dataframe they pointed to and id, rank, hostname, write_count, read_count is repeated for each segment. The operation could be read or write.

@tylerjereddy
Copy link
Collaborator Author

It sounds like you're basically asking for a single rectangular data structure where each row is a DXT segment + associated extra data. I know I've discussed this with Phil, Shane, and Jakob previously. I believe using a native dictionary might be quite memory inefficient compared to a numpy recarray/pandas dataframe.

One problem with trying to produce a single dataframe where each row is effectively a DXT "record" + associated data, is that we still have to accumulate the data and aggregate it under the hood--we only retrieve 1 log record at a time from the log file, and each such record may contain N DXT segments, so we kind of end up needing to unpack and unroll, then stack into the rectangular data structure as we go. We also don't know how many DXT segments we'll have up front, so we can't pre-allocate a nice C buffer to rapidly fill up with DXT segments as they are read in and unrolled.

A few ideas I've heard of to improve the situation:

  • ability to read multiple records from a darshan log file: logutils API function to retrive multiple records in one call #497
  • possibly recording the number of DXT segments that are stored in a log file somehow (ability to pre-allocate memory for the rectangular data structure would be quite useful)
  • Phil has discussed the possibility of a different log format that could be converted to for special handling of DXT segments so that we don't run out of memory by automatically building a single huge rectangular data structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyDarshan: Killed when creating Darshan report for very large darshan logs
4 participants