WIP, ENH: memory efficient DXT segs #784

tylerjereddy · 2022-07-27T00:04:58Z

Fixes #779

at the moment on main, DXT record data is effectively
stored as a list of dictionaries of lists of dictionaries
that look like this:

DXT_list -> [rec0, rec1, ..., recN]
recN -> {"id":, ...,
          "rank":, ...,
          "write_segments": ...,
          ...}
recN["write_segments"] -> [seg0, seg1, ..., segN]
segN -> {"offset": int,
         "length": int,
         "start_time": float,
         "end_time": float}

the list of segments is extremely memory inefficient, with
the smallest file in the matching issue exceeding 20 GB of
physical memory in mod_read_all_dxt_records:

Line #    Mem usage    Increment  Occurrences   Line Contents
   852                                                 # fetch records
   853   92.484 MiB   18.820 MiB           1           rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype)
   854 20295.188 MiB    0.773 MiB        1025           while rec != None:
   855 20295.188 MiB    0.000 MiB        1024               self.records[mod].append(rec)
   856 20295.188 MiB    0.000 MiB        1024               self.data['modules'][mod]['num_records'] += 1
   857
   858                                                     # fetch next
   859 20295.188 MiB 20201.930 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)

if we switch to NumPy arrays the memory footprint drops a lot
(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on main to produce a report object
with smallest file in matching issue):

Line #    Mem usage    Increment  Occurrences   Line Contents
   859 3222.547 MiB 3146.344 MiB        1024               rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype)

this branch currently uses NumPy record arrays,
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)

tylerjereddy · 2022-07-27T02:25:15Z

darshan-util/pydarshan/darshan/backend/cffi_backend.py

+        arr_write[i, ...] = (segments[i].offset,
+                             segments[i].length,
+                             segments[i].start_time,
+                             segments[i].end_time)


@carns @shanedsnyder @jakobluettgau I feel like there should be something much cleaner/faster here. Since we can iterate over segments in Python I feel like we should be able to stride over the segment_info buffer in some kind of organized way. Each segment appears to be fixed size, with two ints and two floats, right?

I played a bit with np.frombuffer() earlier today, but didn't quite get it to work. I'm not sure if this is pushing up against gh-497 already, but I don't think so, since I'm only talking about the segments in a given record, rather than all the records in a log file.

If you know off the top of your head how the C memory layout works for the segments from struct segment info * that may be helpful to know. I'm sure it can be dug up from the source of course.

This all assumes that we can live with changing the DXT segment storage format, but I think it is pretty clear that the memory efficiency of thousands of pure Python objects isn't going to work.

We discussed this live the other day, but belatedly for posterity, here is the definition of the binary format for the dxt data in the logs:

https://github.com/darshan-hpc/darshan/blob/main/include/darshan-dxt-log-format.h#L19

Each segment is indeed a fixed size (all 64 bit values; there shouldn't be any padding in the struct) and the segments are laid out linearly in the log.

tylerjereddy · 2022-07-27T02:30:54Z

darshan-util/pydarshan/darshan/backend/cffi_backend.py

+    arr_read = np.recarray(rcnt, dtype=[("offset", int),
+                                        ("length", int),
+                                        ("start_time", float),
+                                        ("end_time", float)])


I really liked the idea of using recarray because it kind of retains the original metadata in the structured dtype, but it seems that this carries a huge performance overhead against conventional NumPy float64 arrays.

I believe the test suite actually didn't complain too much about storing all the data as say float64, though this would be something we should consider (is the int->float casting safe in general, etc.).

We may also want to try pandas DataFrame as well perhaps. There's a bit of a design issue where we have to_numpy() and to_df() methods, where one of those may become redundant depending on what we decide here, and to_dict() is likely a bad idea based on the memory profiling.

tylerjereddy · 2022-08-03T23:27:21Z

After my revisions this afternoon, I think this may be far better than main for both memory and performance now--see below. Note that generating the HTML summaries is a separate matter, this is just to improve memory footprint and performance for report generation with DXT data per the matching issue.

Table 1: time to generate report object for logs provided by NERSC team in gh-779

Branch	`dbwy_....darshan`	`run_...457...darshan`	`run_...458...darshan`
`main`	33 s	1 minute, 8 s	1 minute, 8 s
`treddy_issue_779`	4 s	7.5 s	8.4 s

I should note that I ran the tests on this branch on a slower machine with less memory as well (my laptop), which probably makes the numbers even more dramatic, though folks should confirm of course.

Our benchmark suite is neutral on this feature branch, which probably means we need to expand coverage there a bit, but that can perhaps be dealt with separately. It could also be a matter of basically needing very large files to detect the original issue.

carns · 2022-08-04T14:24:21Z

Wow, thanks @tylerjereddy , that's a huge difference!

shanedsnyder · 2022-09-16T21:25:47Z

Just to make sure I understand current state here, do your most recent revisions give you both the "good" performance and the structured layout we want (e.g., you can still do rec['write_segments'][n]['offset'], and expect the type to be an int)?

If so, then for end users it seems that even though we changed the segments from a list of dictionaries to a numpy record array, the interface is basically the same, just with better memory usage? That would be a slam dunk.

Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)

* more efficient `log_get_dxt_record()` by reading directly from the C-contiguous segment buffer into NumPy recarrays * simplify the changes to `test_dxt_records()`

tylerjereddy · 2022-09-19T22:42:08Z

@shanedsnyder as far as I can tell, that call signature was never possible, but if I adjust to i.e., rec['read_segments']['offset'][n] (i.e., flip the column name and row index order), then I get the same result on main and this feature branch per the sample code below. I think the "diff" you'll see in the results below is for the fact that column names get added even when a dataframe is otherwise empty on this branch, but not on main. I think that's probably an improvement as well, but largely irrelevant I suspect.

import darshan

def main():
    # spot check a few DXT_POSIX records
    recs_to_check = [1, 17, 289]
    report = darshan.DarshanReport("e3sm_io_heatmap_and_dxt.darshan",
                                   read_all=True)
    recs = report.records["DXT_POSIX"].to_df()
    print(type(recs))
    print(len(recs))
    print(type(recs[0]))
    print(recs[0])
    # Shane's query re: consistency:
    for rec_index in recs_to_check:
        rec = recs[rec_index]
        print(rec['read_segments']['offset'][2])


if __name__ == "__main__":
    main()

Then run the script on both branches and diff the text output: git diff --no-index result_main.txt result_branch.txt

diff --git a/result_main.txt b/result_branch.txt
index 5f7df81..a83fb7d 100644
--- a/result_main.txt
+++ b/result_branch.txt
@@ -2,7 +2,7 @@
 624
 <class 'dict'>
 {'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': Empty DataFrame
-Columns: []
+Columns: [offset, length, start_time, end_time]
 Index: [], 'read_segments':      offset  length  start_time  end_time
 0         0       8    0.183510  0.320738
 1         0  262144    0.364255  0.374655

So, I think that is reasonably safe-looking, but you may want to add a more formal regression test of this nature if you're not convinced the current test suite is sufficient? Importantly though, this is only for the to_df() method (see notes below..).

And, to be clear, sample printout on this branch:

<class 'list'>
624
<class 'dict'>
{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': Empty DataFrame
Columns: [offset, length, start_time, end_time]
Index: [], 'read_segments':      offset  length  start_time  end_time
0         0       8    0.183510  0.320738
1         0  262144    0.364255  0.374655
2      3072    5376    0.393613  0.393686
3      8448     260    0.402161  0.402183
4    120924     260    0.422415  0.422432
5    233400    5376    0.422755  0.422809
6    238776    3900    0.423005  0.423020
7   1925916    3900    0.494480  0.496708
8   3613056    5376    0.566494  0.569992
9   3618432    2600    0.571324  0.571992
10  4743192    2600    0.625726  0.627753
11  5867952    5376    0.688010  0.690604
12  5873328    2340    0.691985  0.692764
13  6885612    2340    0.770699  0.771444
14  7897896    5376    0.808871  0.811797
15  7903272    4420    0.813007  0.813835
16  9815364    4420    0.875015  0.876950}
176160768
159383936
179548

I also rebased this PR while I was at it.

I've never even wanted to use the to_dict() method, but let's say someone was using to_numpy() instead. In that case, following the same procedure as above does lead to a larger diff, and in fact you can only run the same workflow on this branch, while on main you don't have the structured NumPy array to support the indexing scheme without a dataframe conversion.

Traceback (most recent call last):
  File "/home/tyler/LANL/rough_work/darshan/issue_779/shane_comment.py", line 20, in <module>
    main()
  File "/home/tyler/LANL/rough_work/darshan/issue_779/shane_comment.py", line 16, in main
    print(rec['read_segments']['offset'][2])
TypeError: list indices must be integers or slices, not str

If I remove those prints and just focus on the data structure diff itself:

git diff --no-index result_main_numpy.txt result_branch_numpy.txt

diff --git a/result_main_numpy.txt b/result_branch_numpy.txt
index b99cb2a..9cc6779 100644
--- a/result_main_numpy.txt
+++ b/result_branch_numpy.txt
@@ -1,4 +1,22 @@
 <class 'list'>
 624
 <class 'dict'>
-{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': [], 'read_segments': [{'offset': 0, 'length': 8, 'start_time': 0.18351030349731445, 'end_time': 0.3207380771636963}, {'offset': 0, 'length': 262144, 'start_time': 0.3642547130584717, 'end_time': 0.37465500831604004}, {'offset': 3072, 'length': 5376, 'start_time': 0.3936131000518799, 'end_time': 0.39368557929992676}, {'offset': 8448, 'length': 260, 'start_time': 0.40216064453125, 'end_time': 0.40218329429626465}, {'offset': 120924, 'length': 260, 'start_time': 0.42241454124450684, 'end_time': 0.42243242263793945}, {'offset': 233400, 'length': 5376, 'start_time': 0.42275452613830566, 'end_time': 0.4228091239929199}, {'offset': 238776, 'length': 3900, 'start_time': 0.4230048656463623, 'end_time': 0.4230201244354248}, {'offset': 1925916, 'length': 3900, 'start_time': 0.4944796562194824, 'end_time': 0.4967076778411865}, {'offset': 3613056, 'length': 5376, 'start_time': 0.5664935111999512, 'end_time': 0.5699923038482666}, {'offset': 3618432, 'length': 2600, 'start_time': 0.5713236331939697, 'end_time': 0.5719923973083496}, {'offset': 4743192, 'length': 2600, 'start_time': 0.6257259845733643, 'end_time': 0.6277532577514648}, {'offset': 5867952, 'length': 5376, 'start_time': 0.6880097389221191, 'end_time': 0.6906044483184814}, {'offset': 5873328, 'length': 2340, 'start_time': 0.6919851303100586, 'end_time': 0.6927640438079834}, {'offset': 6885612, 'length': 2340, 'start_time': 0.7706985473632812, 'end_time': 0.7714436054229736}, {'offset': 7897896, 'length': 5376, 'start_time': 0.8088712692260742, 'end_time': 0.8117971420288086}, {'offset': 7903272, 'length': 4420, 'start_time': 0.8130068778991699, 'end_time': 0.8138353824615479}, {'offset': 9815364, 'length': 4420, 'start_time': 0.8750150203704834, 'end_time': 0.8769500255584717}]}
+{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': array([],
+      dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')]), 'read_segments': array([(      0,      8, 0.1835103 , 0.32073808),
+       (      0, 262144, 0.36425471, 0.37465501),
+       (   3072,   5376, 0.3936131 , 0.39368558),
+       (   8448,    260, 0.40216064, 0.40218329),
+       ( 120924,    260, 0.42241454, 0.42243242),
+       ( 233400,   5376, 0.42275453, 0.42280912),
+       ( 238776,   3900, 0.42300487, 0.42302012),
+       (1925916,   3900, 0.49447966, 0.49670768),
+       (3613056,   5376, 0.56649351, 0.5699923 ),
+       (3618432,   2600, 0.57132363, 0.5719924 ),
+       (4743192,   2600, 0.62572598, 0.62775326),
+       (5867952,   5376, 0.68800974, 0.69060445),
+       (5873328,   2340, 0.69198513, 0.69276404),
+       (6885612,   2340, 0.77069855, 0.77144361),
+       (7897896,   5376, 0.80887127, 0.81179714),
+       (7903272,   4420, 0.81300688, 0.81383538),
+       (9815364,   4420, 0.87501502, 0.87695003)],
+      dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')])}

There the diff in data structure is more substantial. The change still looks like it is in a positive direction I think, and even allows you do use the same "indexing scheme" as to_df() because of the new recarray data structure, but it certainly is a change in this case.

So, while the changes are positive IMO, they are changes nonetheless--I'm not sure how i.e., the NERSC folks interact with this data.

hammad45 · 2022-10-12T21:36:38Z

Great work @tylerjereddy. I have tested the changes on the log files mentioned and I can see a considerable performance improvement in generating the Darshan Report. All the reports are now being generated on my laptop without getting killed.

However, the structuring issue with the to_numpy() and to_df() functions is still there. The output of the to_df() function is the same as before however the output of to_numpy() function has changed. Currently, I am getting a dictionary within a list for each rank when I call report.records['DXT_POSIX'].to_numpy(). For the write and read segment keys, the value is an array containing the offset, length, start_time, and end_time. This structure causes an issue in converting this output into a dataframe because we have an array within a list. We still have to perform an aggregation operation to convert this output into a dataframe.

tylerjereddy · 2022-10-12T21:44:40Z

@hammad45 Can you help me understand why you're using to_numpy() instead of to_df() if you want a dataframe as you final datastructure? My impression is that there aren't many good reasons to use to_numpy() from a performance perspective.

Perhaps if you explain what final data structure you'd like we can get a better sense for the issue. For example, "I want a dataframe that has each row a different DXT segment, and the columns are A, B, C, D," and I can't use to_df because..

hammad45 · 2022-10-12T22:10:29Z

So basically I was using the to_df() function before. The output of the to_df() function is a list for each rank where each list contains a dictionary with the following keys:

[{'id' , 'rank', 'hostname', 'write_count', 'read_count', 'write_segments':    df[offset    length  start_time  end_time]
 'read_segments':   df[offset    length  start_time  end_time]}]

Notice that the write_segments and the read_segments keys point to a dataframe which stores the offset, length, start_time, end_time for each segment. So basically the to_df() function is returning a dataframe within a dictionary within a list. This structure makes it complicated to use the output as a dataframe.

However, if we had an output where we only had a dictionary within the list, we can easily use it as a dataframe. Such an output would have a dictionary with the following keys:

[{'id' , 'rank', 'hostname', 'write_count', 'read_count', 'operation', 'offset' , 'length' , 'start_time' , 'end_time'}]

Notice that the read and write segments are replaced with the columns of the dataframe they pointed to and id, rank, hostname, write_count, read_count is repeated for each segment. The operation could be read or write.

tylerjereddy · 2022-10-12T22:35:27Z

It sounds like you're basically asking for a single rectangular data structure where each row is a DXT segment + associated extra data. I know I've discussed this with Phil, Shane, and Jakob previously. I believe using a native dictionary might be quite memory inefficient compared to a numpy recarray/pandas dataframe.

One problem with trying to produce a single dataframe where each row is effectively a DXT "record" + associated data, is that we still have to accumulate the data and aggregate it under the hood--we only retrieve 1 log record at a time from the log file, and each such record may contain N DXT segments, so we kind of end up needing to unpack and unroll, then stack into the rectangular data structure as we go. We also don't know how many DXT segments we'll have up front, so we can't pre-allocate a nice C buffer to rapidly fill up with DXT segments as they are read in and unrolled.

A few ideas I've heard of to improve the situation:

ability to read multiple records from a darshan log file: logutils API function to retrive multiple records in one call #497
possibly recording the number of DXT segments that are stored in a log file somehow (ability to pre-allocate memory for the rectangular data structure would be quite useful)
Phil has discussed the possibility of a different log format that could be converted to for special handling of DXT segments so that we don't run out of memory by automatically building a single huge rectangular data structure

tylerjereddy added the pydarshan label Jul 27, 2022

tylerjereddy mentioned this pull request Jul 27, 2022

PyDarshan: Killed when creating Darshan report for very large darshan logs #779

Open

tylerjereddy commented Jul 27, 2022

View reviewed changes

tylerjereddy mentioned this pull request Aug 23, 2022

Concurrent get_heatmap_df #798

Open

tylerjereddy added 2 commits September 19, 2022 16:08

MAINT: PR 784 revisions

3d3af72

* more efficient `log_get_dxt_record()` by reading directly from the C-contiguous segment buffer into NumPy recarrays * simplify the changes to `test_dxt_records()`

tylerjereddy force-pushed the treddy_issue_779 branch from 0aabca0 to 3d3af72 Compare September 19, 2022 22:14

tylerjereddy mentioned this pull request Nov 11, 2022

ENH: improved precision of darshan job start/end times #830

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: memory efficient DXT segs #784

WIP, ENH: memory efficient DXT segs #784

tylerjereddy commented Jul 27, 2022

tylerjereddy Jul 27, 2022

tylerjereddy Jul 27, 2022

carns Aug 1, 2022

tylerjereddy Jul 27, 2022

tylerjereddy Jul 27, 2022 •

edited

Loading

tylerjereddy commented Aug 3, 2022

carns commented Aug 4, 2022

shanedsnyder commented Sep 16, 2022

tylerjereddy commented Sep 19, 2022

hammad45 commented Oct 12, 2022

tylerjereddy commented Oct 12, 2022

hammad45 commented Oct 12, 2022

tylerjereddy commented Oct 12, 2022

WIP, ENH: memory efficient DXT segs #784

Are you sure you want to change the base?

WIP, ENH: memory efficient DXT segs #784

Conversation

tylerjereddy commented Jul 27, 2022

tylerjereddy Jul 27, 2022

Choose a reason for hiding this comment

tylerjereddy Jul 27, 2022

Choose a reason for hiding this comment

carns Aug 1, 2022

Choose a reason for hiding this comment

tylerjereddy Jul 27, 2022

Choose a reason for hiding this comment

tylerjereddy Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

tylerjereddy commented Aug 3, 2022

carns commented Aug 4, 2022

shanedsnyder commented Sep 16, 2022

tylerjereddy commented Sep 19, 2022

hammad45 commented Oct 12, 2022

tylerjereddy commented Oct 12, 2022

hammad45 commented Oct 12, 2022

tylerjereddy commented Oct 12, 2022

tylerjereddy Jul 27, 2022 •

edited

Loading