implement custom load function for collect_results #421

NuclearPowerNerd · 2024-08-12T18:17:15Z

Here is my first pass at implementing #420

This adds an optional wrapper around the loading of data when being processed by collect_results. This is useful in case the data being loaded from file needs to be modified before being processed by collect_results. The wrapper can be passed to collect_results using the new kwarg load_function_wrapper.

For example, consider you have a simulation result that is saved to disc as a struct. When loaded from disc it will be as a one-element dict containing that struct. The struct may contain many fields that should form the columns of the DataFrame. In that case you'd want to convert the struct to a dictionary after being loaded but before being processed by collect_results. The load_function_wrapper provides such a possibility.

By default load_function_wrapper=nothing and the default methods will be used to load data (so wload and jld2.jldopen).

Here is an example usage.

struct Dummy
    a::Float64
    b::Int64
    c::Matrix{Float64}
end

dummy = Dummy(1.0, 1, rand(3,3))
wsave("dummy.jld2", "dummy", dummy)

tmp = wload("dummy.jld2") # one-element Dict with key "dummy" that contains the struct Dummy(1.0, 1, rand(3,3))

tmp["dummy"] # now the struct itself instead of a dictionary
struct2dict(tmp["dummy"]) # Now a dict where the fields of the structs are now the keys of the dict

# To get desired behavior when calling collect_results we do the following
tabulated_results = collect_results("./", load_function_wrapper = (d) -> struct2dict(d["dummy"]))

All 587 tests passed.

I did try to implement something along the lines of the following.

function collect_results!(filename, folder;
    valid_filetypes=[".bson", "jld", ".jld2"],
    subfolders=false,
    rpath=nothing,
    verbose=true,
    update=false,
    newfile=false, # keyword only for defining collect_results without !
    rinclude=[r""],
    rexclude=[r"^\b$"],
    load_function_wrapper=(args...; kwargs...) -> wload(args...; kwargs...),
    kwargs...)

But I had difficulties getting this to work with to_data_row for the specialized JLD2 case. In that case it doesn't expect wload (as far as I could tell, I'm not an expert on JLD2 though) so that as a default doesn't work. But I am happy to refactor this PR if there is a better or more idiomatic way to achieve this.

This adds an optional wrapper around the loading of data when being processed by collect_results. This is useful in case the data being loaded from file needs to be modified before being processed by collect_results. The wrapper can be passed to collect_results using the new kwarg `load_function_wrapper`. For example, consider you have a simulation result that is saved to disc as a struct. When loaded from disc it will be as a one-element dict containing that struct. The struct may contain many fields that should form the columns of the DataFrame. In that case you'd want to convert the struct to a dictionary after being loaded but before being processed by collect_results. The `load_function_wrapper` provides such a possibility. By default `load_function_wrapper=nothing` and the default methods will be used to load data (so `wload` and `jld2.jldopen`).

codecov · 2024-08-12T18:23:29Z

Codecov Report

Attention: Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.59%. Comparing base (59118d2) to head (71dd90b).
Report is 8 commits behind head on main.

Files	Patch %	Lines
src/result_collection.jl	94.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
- Coverage   90.47%   89.59%   -0.88%     
==========================================
  Files           8        8              
  Lines         787      788       +1     
==========================================
- Hits          712      706       -6     
- Misses         75       82       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Datseris

All good, besides the minor changes for the source code, you should also add a test in the test suite using exactly the situation you describe in the docstring as well.

Datseris · 2024-08-14T12:09:25Z

src/result_collection.jl

+        if isnothing(load_function_wrapper)
+            data = wload(filename)
+        else
+            data = load_function_wrapper(filename)


Why are we using nothing here? Why don't we have the keyword itself be the function? Make load_function = wload and then use load_function throughout the source code.

If you take a look at the pull request I mention why underneath the code snippet. Probably should have included it farther up in the pull request.

But I had difficulties getting this to work with to_data_row for the specialized JLD2 case. In that case it doesn't expect wload (as far as I could tell, I'm not an expert on JLD2 though) so that as a default doesn't work. But I am happy to refactor this PR if there is a better or more idiomatic way to achieve this.

Sorry, but I don't understand the problem. Please explain it again. The way I see the code right now, if load_function is nothing, it becomes wload. Otherwise, it becomes a user-provided function. So... load_function can be by default wload. That's what it becomes anyways. So I am overall confused because in the current PR wload is the default but you say it doesn't work as the default.

The quote you quoted: what does "this" mean in its first sentence?

Sorry, but I don't understand the problem. Please explain it again. The way I see the code right now, if load_function is nothing, it becomes wload. Otherwise, it becomes a user-provided function. So... load_function can be by default wload. That's what it becomes anyways. So I am overall confused because in the current PR wload is the default but you say it doesn't work as the default.

The quote you quoted: what does "this" mean in its first sentence?

Yeah sorry for the confusion, I probably made this more complicated than it needed to be. I've pushed some new changes based on your advice. Thanks for your patience. Please take a look and let me know what you think.

Datseris · 2024-08-14T12:11:13Z

src/result_collection.jl

@@ -50,6 +50,7 @@ See also [`collect_results`](@ref).
 * `black_list = [:gitcommit, :gitpatch, :script]`: List of keys not to include from result-file.
 * `special_list = []`: List of additional (derived) key-value pairs
  to put in `df` as explained below.
+* `load_function_wrapper = nothing`: Wrapper around calls to loading the data. This is useful in case you need to do some conversion to your data before `collect_results` processes it. For example, if you are loading a struct, it will be stored as a one-element dictionary on disc. When processing it with `collect_results` you might want to convert the struct to a dict so that the fields of the struct become the elements of the dictionary. One way to do this would be to pass `load_function_wrapper = (d) -> struct2dict(d["key_name"])`. Note, this assumes the session you are loading the struct into is aware of the struct type and that the struct definition has not changed. 


I think load_function_wrapper is unecessarily complicated. There is no reason to talk about wrappers here. We can keep it simple and say the truth: this is just the loading function. We can call it load_function.

NuclearPowerNerd · 2024-08-16T02:42:32Z

All good, besides the minor changes for the source code, you should also add a test in the test suite using exactly the situation you describe in the docstring as well.

Thanks - I've updated the pull request. Also just realized my autoformat in vscode made tons of changes to update_results_tests.jl. That probably makes it harder to see my actual changes, which begin on line 73 of that file.

Edit: I just realized some of the tests are not passing. I did run the tests on my PC before pushing this and I got 589 of 589 passing (see below) so I'll have to look into whats breaking.

Datseris

Sorry for late. Don't worry about windows failures we have a problem that we don't know how to solve and CI fails on Windows.

There is a small change necessary: there is no reason to use an anonumous wrapper.

There is also a larger change necessary: please revert all formatting changes. Probably simpler to start a new branch/pr or reset this one to its original state and then do the couple of lines changes necessary.

Datseris · 2024-08-20T12:05:11Z

src/result_collection.jl

@@ -50,6 +50,7 @@ See also [`collect_results`](@ref).
 * `black_list = [:gitcommit, :gitpatch, :script]`: List of keys not to include from result-file.
 * `special_list = []`: List of additional (derived) key-value pairs
  to put in `df` as explained below.
+* `load_function = (filename) -> wload(filename)`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.


Suggested change

* `load_function = (filename) -> wload(filename)`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.

* `load_function = wload`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.

wload is already a function. Wrapping it in an anonymous function has no benefit.

Datseris · 2024-08-20T12:06:05Z

src/result_collection.jl

+    newfile=false, # keyword only for defining collect_results without !
+    rinclude=[r""],
+    rexclude=[r"^\b$"],
+    load_function=(filename) -> wload(filename),


Suggested change

load_function=(filename) -> wload(filename),

load_function = wload,

The purpose of this commit is to incorpoate comments from pull request - removes the anonymous function wrapper, except from `to_data_row` which requires the default be an anonymous function so that the "r" parameter can be fixed in the call. - reverts all the autoformatting As before, a test was added to `update_results_tests.jl`. All tests passed, 589 of 589.

NuclearPowerNerd · 2024-08-21T03:12:50Z

Sorry for late. Don't worry about windows failures we have a problem that we don't know how to solve and CI fails on Windows.

There is a small change necessary: there is no reason to use an anonumous wrapper.

There is also a larger change necessary: please revert all formatting changes. Probably simpler to start a new branch/pr or reset this one to its original state and then do the couple of lines changes necessary.

Thanks - I've updated the pull request per your comments.

Edit: So I just realized the formatting did not revert as expected. Here is what I attempted and thought would work.

Create new branch on fork. Make sure "editor.formatOnSave": false is set for vscode.
Make changes in branch, incorporating your feedback on not using anonymous wrapper.
Squash commits from the "clean" branch and then cherry pick that commit onto the pull request branch by doing git cherry-pick SHA256 --strategy-option theirs. My thinking was that since the "clean" branch (created in step 1) did not have the formatting changes, then those would be interpreted as changes in whitespace and would revert the autoformatting changes on the pull request branch.

It seems the changes in whitespace were ignored though (as the current pull request still has the formatting changes) so I have more to look into here. I will try to get this sorted in the new few days.

Edit 2: I think I have found my answer here. Quoting from the docs

If their version only introduces whitespace changes to a line, our version is used;

NuclearPowerNerd · 2024-08-22T03:36:53Z

So I've prepared a new branch without the whitespace changes. Would it just be easier to close this PR and open a new one with the cleaned up branch?

Datseris · 2024-08-22T14:12:19Z

sorry for the late reply!!! Yes, a new PR is easiest!

The changes in this branch are a follow up from a previous pull request based on commit 6e6ff07 in PR JuliaDynamics#421. In that PR there were issues with whitespace changes inadvertantly coming from the autoformatter in vscode. Reverting the whitespace only changes proved to be more difficult than anticicpated. So to resolve this, this branch was created and a new PR will be created from it. The whitespace issues are gone but all the feedback and changes from the original PR are retained. The commit makes the following changes. - add the `load_function` kwarg to `collect_results`. This allows customizing how data is loaded from file before being processed into a dataframe by `collect_results`. - add a test to `update_result_tests.jl` - update docstring of `collect_results` - increase package version to 2.16.0 - update `CHANGELOG.md` All tests passed, 589 of 589.

The changes in this branch are a follow up from a previous pull request based on commit 6e6ff07 in PR #421. In that PR there were issues with whitespace changes inadvertantly coming from the autoformatter in vscode. Reverting the whitespace only changes proved to be more difficult than anticicpated. So to resolve this, this branch was created and a new PR will be created from it. The whitespace issues are gone but all the feedback and changes from the original PR are retained. The commit makes the following changes. - add the `load_function` kwarg to `collect_results`. This allows customizing how data is loaded from file before being processed into a dataframe by `collect_results`. - add a test to `update_result_tests.jl` - update docstring of `collect_results` - increase package version to 2.16.0 - update `CHANGELOG.md` All tests passed, 589 of 589.

Datseris requested changes Aug 14, 2024

View reviewed changes

refactor to use load_function and add test to update_results_tests.jl

71dd90b

Datseris requested changes Aug 20, 2024

View reviewed changes

NuclearPowerNerd closed this Aug 23, 2024

NuclearPowerNerd mentioned this pull request Aug 23, 2024

implement load_function kwarg for collect_results! #424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement custom load function for collect_results #421

implement custom load function for collect_results #421

NuclearPowerNerd commented Aug 12, 2024

codecov bot commented Aug 12, 2024 •

edited

Loading

Datseris left a comment

Datseris Aug 14, 2024

NuclearPowerNerd Aug 14, 2024

Datseris Aug 15, 2024

NuclearPowerNerd Aug 16, 2024

Datseris Aug 14, 2024

NuclearPowerNerd commented Aug 16, 2024 •

edited

Loading

Datseris left a comment

Datseris Aug 20, 2024

Datseris Aug 20, 2024

Datseris Aug 20, 2024

NuclearPowerNerd commented Aug 21, 2024 •

edited

Loading

NuclearPowerNerd commented Aug 22, 2024

Datseris commented Aug 22, 2024

	load_function=(filename) -> wload(filename),
	load_function = wload,

implement custom load function for collect_results #421

implement custom load function for collect_results #421

Conversation

NuclearPowerNerd commented Aug 12, 2024

codecov bot commented Aug 12, 2024 • edited Loading

Codecov Report

Datseris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NuclearPowerNerd commented Aug 16, 2024 • edited Loading

Datseris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NuclearPowerNerd commented Aug 21, 2024 • edited Loading

NuclearPowerNerd commented Aug 22, 2024

Datseris commented Aug 22, 2024

codecov bot commented Aug 12, 2024 •

edited

Loading

NuclearPowerNerd commented Aug 16, 2024 •

edited

Loading

NuclearPowerNerd commented Aug 21, 2024 •

edited

Loading