-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: memory efficient DXT segs #784
base: main
Are you sure you want to change the base?
Conversation
arr_write[i, ...] = (segments[i].offset, | ||
segments[i].length, | ||
segments[i].start_time, | ||
segments[i].end_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carns @shanedsnyder @jakobluettgau I feel like there should be something much cleaner/faster here. Since we can iterate over segments
in Python I feel like we should be able to stride over the segment_info
buffer in some kind of organized way. Each segment appears to be fixed size, with two ints and two floats, right?
I played a bit with np.frombuffer()
earlier today, but didn't quite get it to work. I'm not sure if this is pushing up against gh-497 already, but I don't think so, since I'm only talking about the segments in a given record, rather than all the records in a log file.
If you know off the top of your head how the C memory layout works for the segments
from struct segment info *
that may be helpful to know. I'm sure it can be dug up from the source of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all assumes that we can live with changing the DXT segment storage format, but I think it is pretty clear that the memory efficiency of thousands of pure Python objects isn't going to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this live the other day, but belatedly for posterity, here is the definition of the binary format for the dxt data in the logs:
https://github.com/darshan-hpc/darshan/blob/main/include/darshan-dxt-log-format.h#L19
Each segment is indeed a fixed size (all 64 bit values; there shouldn't be any padding in the struct) and the segments are laid out linearly in the log.
arr_read = np.recarray(rcnt, dtype=[("offset", int), | ||
("length", int), | ||
("start_time", float), | ||
("end_time", float)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really liked the idea of using recarray
because it kind of retains the original metadata in the structured dtype, but it seems that this carries a huge performance overhead against conventional NumPy float64
arrays.
I believe the test suite actually didn't complain too much about storing all the data as say float64
, though this would be something we should consider (is the int->float casting safe in general, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to try pandas
DataFrame as well perhaps. There's a bit of a design issue where we have to_numpy()
and to_df()
methods, where one of those may become redundant depending on what we decide here, and to_dict()
is likely a bad idea based on the memory profiling.
After my revisions this afternoon, I think this may be far better than Table 1: time to generate
I should note that I ran the tests on this branch on a slower machine with less memory as well (my laptop), which probably makes the numbers even more dramatic, though folks should confirm of course. Our benchmark suite is neutral on this feature branch, which probably means we need to expand coverage there a bit, but that can perhaps be dealt with separately. It could also be a matter of basically needing very large files to detect the original issue. |
Wow, thanks @tylerjereddy , that's a huge difference! |
Just to make sure I understand current state here, do your most recent revisions give you both the "good" performance and the structured layout we want (e.g., you can still do If so, then for end users it seems that even though we changed the segments from a list of dictionaries to a numpy record array, the interface is basically the same, just with better memory usage? That would be a slam dunk. |
Fixes darshan-hpc#779 * at the moment on `main`, DXT record data is effectively stored as a list of dictionaries of lists of dictionaries that look like this: ``` DXT_list -> [rec0, rec1, ..., recN] recN -> {"id":, ..., "rank":, ..., "write_segments": ..., ...} recN["write_segments"] -> [seg0, seg1, ..., segN] segN -> {"offset": int, "length": int, "start_time": float, "end_time": float} ``` - the list of segments is extremely memory inefficient, with the smallest file in the matching issue exceeding 20 GB of physical memory in `mod_read_all_dxt_records`: ``` Line # Mem usage Increment Occurrences Line Contents 852 # fetch records 853 92.484 MiB 18.820 MiB 1 rec = backend.log_get_dxt_record(self.log, mod, dtype=dtype) 854 20295.188 MiB 0.773 MiB 1025 while rec != None: 855 20295.188 MiB 0.000 MiB 1024 self.records[mod].append(rec) 856 20295.188 MiB 0.000 MiB 1024 self.data['modules'][mod]['num_records'] += 1 857 858 # fetch next 859 20295.188 MiB 20201.930 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - if we switch to NumPy arrays the memory footprint drops a lot (see below), and the performance informally seems similar (36 seconds vs. 33 seconds on `main` to produce a `report` object with smallest file in matching issue): ``` Line # Mem usage Increment Occurrences Line Contents 859 3222.547 MiB 3146.344 MiB 1024 rec = backend.log_get_dxt_record(self.log, mod, reads=reads, writes=writes, dtype=dtype) ``` - this branch currently uses NumPy record arrays, because I thought they'd be a better fit for a data structure with 2 int columns and 2 float columns; however, there is a big performance hit over regular NumPy arrays (almost 6 minutes vs. 33 seconds for the smallest file in matchin issue); so, if we could live without the extra dtype structuring of a recarray, maybe that would be best (we could also try to use a pandas dataframe, which is another natural fit for dtype columns..)
* more efficient `log_get_dxt_record()` by reading directly from the C-contiguous segment buffer into NumPy recarrays * simplify the changes to `test_dxt_records()`
0aabca0
to
3d3af72
Compare
@shanedsnyder as far as I can tell, that call signature was never possible, but if I adjust to i.e., import darshan
def main():
# spot check a few DXT_POSIX records
recs_to_check = [1, 17, 289]
report = darshan.DarshanReport("e3sm_io_heatmap_and_dxt.darshan",
read_all=True)
recs = report.records["DXT_POSIX"].to_df()
print(type(recs))
print(len(recs))
print(type(recs[0]))
print(recs[0])
# Shane's query re: consistency:
for rec_index in recs_to_check:
rec = recs[rec_index]
print(rec['read_segments']['offset'][2])
if __name__ == "__main__":
main() Then run the script on both branches and diff the text output: diff --git a/result_main.txt b/result_branch.txt
index 5f7df81..a83fb7d 100644
--- a/result_main.txt
+++ b/result_branch.txt
@@ -2,7 +2,7 @@
624
<class 'dict'>
{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': Empty DataFrame
-Columns: []
+Columns: [offset, length, start_time, end_time]
Index: [], 'read_segments': offset length start_time end_time
0 0 8 0.183510 0.320738
1 0 262144 0.364255 0.374655
So, I think that is reasonably safe-looking, but you may want to add a more formal regression test of this nature if you're not convinced the current test suite is sufficient? Importantly though, this is only for the And, to be clear, sample printout on this branch:
I also rebased this PR while I was at it. I've never even wanted to use the
If I remove those prints and just focus on the data structure diff itself:
diff --git a/result_main_numpy.txt b/result_branch_numpy.txt
index b99cb2a..9cc6779 100644
--- a/result_main_numpy.txt
+++ b/result_branch_numpy.txt
@@ -1,4 +1,22 @@
<class 'list'>
624
<class 'dict'>
-{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': [], 'read_segments': [{'offset': 0, 'length': 8, 'start_time': 0.18351030349731445, 'end_time': 0.3207380771636963}, {'offset': 0, 'length': 262144, 'start_time': 0.3642547130584717, 'end_time': 0.37465500831604004}, {'offset': 3072, 'length': 5376, 'start_time': 0.3936131000518799, 'end_time': 0.39368557929992676}, {'offset': 8448, 'length': 260, 'start_time': 0.40216064453125, 'end_time': 0.40218329429626465}, {'offset': 120924, 'length': 260, 'start_time': 0.42241454124450684, 'end_time': 0.42243242263793945}, {'offset': 233400, 'length': 5376, 'start_time': 0.42275452613830566, 'end_time': 0.4228091239929199}, {'offset': 238776, 'length': 3900, 'start_time': 0.4230048656463623, 'end_time': 0.4230201244354248}, {'offset': 1925916, 'length': 3900, 'start_time': 0.4944796562194824, 'end_time': 0.4967076778411865}, {'offset': 3613056, 'length': 5376, 'start_time': 0.5664935111999512, 'end_time': 0.5699923038482666}, {'offset': 3618432, 'length': 2600, 'start_time': 0.5713236331939697, 'end_time': 0.5719923973083496}, {'offset': 4743192, 'length': 2600, 'start_time': 0.6257259845733643, 'end_time': 0.6277532577514648}, {'offset': 5867952, 'length': 5376, 'start_time': 0.6880097389221191, 'end_time': 0.6906044483184814}, {'offset': 5873328, 'length': 2340, 'start_time': 0.6919851303100586, 'end_time': 0.6927640438079834}, {'offset': 6885612, 'length': 2340, 'start_time': 0.7706985473632812, 'end_time': 0.7714436054229736}, {'offset': 7897896, 'length': 5376, 'start_time': 0.8088712692260742, 'end_time': 0.8117971420288086}, {'offset': 7903272, 'length': 4420, 'start_time': 0.8130068778991699, 'end_time': 0.8138353824615479}, {'offset': 9815364, 'length': 4420, 'start_time': 0.8750150203704834, 'end_time': 0.8769500255584717}]}
+{'id': 3748348667804457392, 'rank': 0, 'hostname': 'nid03824', 'write_count': 0, 'read_count': 17, 'write_segments': array([],
+ dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')]), 'read_segments': array([( 0, 8, 0.1835103 , 0.32073808),
+ ( 0, 262144, 0.36425471, 0.37465501),
+ ( 3072, 5376, 0.3936131 , 0.39368558),
+ ( 8448, 260, 0.40216064, 0.40218329),
+ ( 120924, 260, 0.42241454, 0.42243242),
+ ( 233400, 5376, 0.42275453, 0.42280912),
+ ( 238776, 3900, 0.42300487, 0.42302012),
+ (1925916, 3900, 0.49447966, 0.49670768),
+ (3613056, 5376, 0.56649351, 0.5699923 ),
+ (3618432, 2600, 0.57132363, 0.5719924 ),
+ (4743192, 2600, 0.62572598, 0.62775326),
+ (5867952, 5376, 0.68800974, 0.69060445),
+ (5873328, 2340, 0.69198513, 0.69276404),
+ (6885612, 2340, 0.77069855, 0.77144361),
+ (7897896, 5376, 0.80887127, 0.81179714),
+ (7903272, 4420, 0.81300688, 0.81383538),
+ (9815364, 4420, 0.87501502, 0.87695003)],
+ dtype=[('offset', '<i8'), ('length', '<i8'), ('start_time', '<f8'), ('end_time', '<f8')])}
There the diff in data structure is more substantial. The change still looks like it is in a positive direction I think, and even allows you do use the same "indexing scheme" as So, while the changes are positive IMO, they are changes nonetheless--I'm not sure how i.e., the NERSC folks interact with this data. |
Great work @tylerjereddy. I have tested the changes on the log files mentioned and I can see a considerable performance improvement in generating the However, the structuring issue with the |
@hammad45 Can you help me understand why you're using Perhaps if you explain what final data structure you'd like we can get a better sense for the issue. For example, "I want a dataframe that has each row a different DXT segment, and the columns are A, B, C, D," and I can't use |
So basically I was using the to_df() function before. The output of the to_df() function is a list for each rank where each list contains a dictionary with the following keys:
Notice that the write_segments and the read_segments keys point to a dataframe which stores the offset, length, start_time, end_time for each segment. So basically the to_df() function is returning a dataframe within a dictionary within a list. This structure makes it complicated to use the output as a dataframe. However, if we had an output where we only had a dictionary within the list, we can easily use it as a dataframe. Such an output would have a dictionary with the following keys:
Notice that the read and write segments are replaced with the columns of the dataframe they pointed to and |
It sounds like you're basically asking for a single rectangular data structure where each row is a DXT segment + associated extra data. I know I've discussed this with Phil, Shane, and Jakob previously. I believe using a native dictionary might be quite memory inefficient compared to a numpy recarray/pandas dataframe. One problem with trying to produce a single dataframe where each row is effectively a DXT "record" + associated data, is that we still have to accumulate the data and aggregate it under the hood--we only retrieve 1 log record at a time from the log file, and each such record may contain N DXT segments, so we kind of end up needing to unpack and unroll, then stack into the rectangular data structure as we go. We also don't know how many DXT segments we'll have up front, so we can't pre-allocate a nice C buffer to rapidly fill up with DXT segments as they are read in and unrolled. A few ideas I've heard of to improve the situation:
|
Fixes #779
main
, DXT record data is effectivelystored as a list of dictionaries of lists of dictionaries
that look like this:
the smallest file in the matching issue exceeding 20 GB of
physical memory in
mod_read_all_dxt_records
:(see below),
and the performance informally seems similar (36 seconds
vs. 33 seconds on
main
to produce areport
objectwith smallest file in matching issue):
because I thought they'd be a better fit for a data
structure with 2 int columns and 2 float columns;
however, there is a big performance hit over
regular NumPy arrays (almost 6 minutes vs. 33
seconds for the smallest file in matchin issue);
so, if we could live without the extra dtype
structuring of a recarray, maybe that would be best
(we could also try to use a pandas dataframe, which
is another natural fit for dtype columns..)