Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement custom load function for collect_results #421

Conversation

NuclearPowerNerd
Copy link
Contributor

Here is my first pass at implementing #420

This adds an optional wrapper around the loading of data when being processed by collect_results. This is useful in case the data being loaded from file needs to be modified before being processed by collect_results. The wrapper can be passed to collect_results using the new kwarg load_function_wrapper.

For example, consider you have a simulation result that is saved to disc as a struct. When loaded from disc it will be as a one-element dict containing that struct. The struct may contain many fields that should form the columns of the DataFrame. In that case you'd want to convert the struct to a dictionary after being loaded but before being processed by collect_results. The load_function_wrapper provides such a possibility.

By default load_function_wrapper=nothing and the default methods will be used to load data (so wload and jld2.jldopen).

Here is an example usage.

struct Dummy
    a::Float64
    b::Int64
    c::Matrix{Float64}
end

dummy = Dummy(1.0, 1, rand(3,3))
wsave("dummy.jld2", "dummy", dummy)

tmp = wload("dummy.jld2") # one-element Dict with key "dummy" that contains the struct Dummy(1.0, 1, rand(3,3))

tmp["dummy"] # now the struct itself instead of a dictionary
struct2dict(tmp["dummy"]) # Now a dict where the fields of the structs are now the keys of the dict

# To get desired behavior when calling collect_results we do the following
tabulated_results = collect_results("./", load_function_wrapper = (d) -> struct2dict(d["dummy"]))

All 587 tests passed.

I did try to implement something along the lines of the following.

function collect_results!(filename, folder;
    valid_filetypes=[".bson", "jld", ".jld2"],
    subfolders=false,
    rpath=nothing,
    verbose=true,
    update=false,
    newfile=false, # keyword only for defining collect_results without !
    rinclude=[r""],
    rexclude=[r"^\b$"],
    load_function_wrapper=(args...; kwargs...) -> wload(args...; kwargs...),
    kwargs...)

But I had difficulties getting this to work with to_data_row for the specialized JLD2 case. In that case it doesn't expect wload (as far as I could tell, I'm not an expert on JLD2 though) so that as a default doesn't work. But I am happy to refactor this PR if there is a better or more idiomatic way to achieve this.

This adds an optional wrapper around the loading of data when being
processed by collect_results. This is useful in case the data being
loaded from file needs to be modified before being processed by
collect_results. The wrapper can be passed to collect_results using the
new kwarg `load_function_wrapper`.

For example, consider you have a simulation result that is saved to disc
as a struct. When loaded from disc it will be as a one-element dict
containing that struct. The struct may contain many fields that should
form the columns of the DataFrame. In that case you'd want to convert
the struct to a dictionary after being loaded but before being processed
by collect_results. The `load_function_wrapper` provides such a
possibility.

By default `load_function_wrapper=nothing` and the default methods will
be used to load data (so `wload` and `jld2.jldopen`).
Copy link

codecov bot commented Aug 12, 2024

Codecov Report

Attention: Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.59%. Comparing base (59118d2) to head (71dd90b).
Report is 8 commits behind head on main.

Files Patch % Lines
src/result_collection.jl 94.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
- Coverage   90.47%   89.59%   -0.88%     
==========================================
  Files           8        8              
  Lines         787      788       +1     
==========================================
- Hits          712      706       -6     
- Misses         75       82       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, besides the minor changes for the source code, you should also add a test in the test suite using exactly the situation you describe in the docstring as well.

Comment on lines 106 to 109
if isnothing(load_function_wrapper)
data = wload(filename)
else
data = load_function_wrapper(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using nothing here? Why don't we have the keyword itself be the function? Make load_function = wload and then use load_function throughout the source code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a look at the pull request I mention why underneath the code snippet. Probably should have included it farther up in the pull request.

But I had difficulties getting this to work with to_data_row for the specialized JLD2 case. In that case it doesn't expect wload (as far as I could tell, I'm not an expert on JLD2 though) so that as a default doesn't work. But I am happy to refactor this PR if there is a better or more idiomatic way to achieve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't understand the problem. Please explain it again. The way I see the code right now, if load_function is nothing, it becomes wload. Otherwise, it becomes a user-provided function. So... load_function can be by default wload. That's what it becomes anyways. So I am overall confused because in the current PR wload is the default but you say it doesn't work as the default.

The quote you quoted: what does "this" mean in its first sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't understand the problem. Please explain it again. The way I see the code right now, if load_function is nothing, it becomes wload. Otherwise, it becomes a user-provided function. So... load_function can be by default wload. That's what it becomes anyways. So I am overall confused because in the current PR wload is the default but you say it doesn't work as the default.

The quote you quoted: what does "this" mean in its first sentence?

Yeah sorry for the confusion, I probably made this more complicated than it needed to be. I've pushed some new changes based on your advice. Thanks for your patience. Please take a look and let me know what you think.

@@ -50,6 +50,7 @@ See also [`collect_results`](@ref).
* `black_list = [:gitcommit, :gitpatch, :script]`: List of keys not to include from result-file.
* `special_list = []`: List of additional (derived) key-value pairs
to put in `df` as explained below.
* `load_function_wrapper = nothing`: Wrapper around calls to loading the data. This is useful in case you need to do some conversion to your data before `collect_results` processes it. For example, if you are loading a struct, it will be stored as a one-element dictionary on disc. When processing it with `collect_results` you might want to convert the struct to a dict so that the fields of the struct become the elements of the dictionary. One way to do this would be to pass `load_function_wrapper = (d) -> struct2dict(d["key_name"])`. Note, this assumes the session you are loading the struct into is aware of the struct type and that the struct definition has not changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think load_function_wrapper is unecessarily complicated. There is no reason to talk about wrappers here. We can keep it simple and say the truth: this is just the loading function. We can call it load_function.

@NuclearPowerNerd
Copy link
Contributor Author

NuclearPowerNerd commented Aug 16, 2024

All good, besides the minor changes for the source code, you should also add a test in the test suite using exactly the situation you describe in the docstring as well.

Thanks - I've updated the pull request. Also just realized my autoformat in vscode made tons of changes to update_results_tests.jl. That probably makes it harder to see my actual changes, which begin on line 73 of that file.

Edit: I just realized some of the tests are not passing. I did run the tests on my PC before pushing this and I got 589 of 589 passing (see below) so I'll have to look into whats breaking.

image

Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late. Don't worry about windows failures we have a problem that we don't know how to solve and CI fails on Windows.

There is a small change necessary: there is no reason to use an anonumous wrapper.

There is also a larger change necessary: please revert all formatting changes. Probably simpler to start a new branch/pr or reset this one to its original state and then do the couple of lines changes necessary.

@@ -50,6 +50,7 @@ See also [`collect_results`](@ref).
* `black_list = [:gitcommit, :gitpatch, :script]`: List of keys not to include from result-file.
* `special_list = []`: List of additional (derived) key-value pairs
to put in `df` as explained below.
* `load_function = (filename) -> wload(filename)`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `load_function = (filename) -> wload(filename)`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.
* `load_function = wload`: function for loading data from file. This is useful in the event you have data saved as a struct. When loaded from file it will be as a one-element `Dict` which is not what you want passed to the dataframe. Instead, you'd rather have the fields of the struct to be available as columns of the dataframe. In that case you can use this function to ensure the struct is converted to a dict before being processed by `collect_results!`. For example, `load_function = (filename) -> struct2dict(wload(filename)["my_key"])`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wload is already a function. Wrapping it in an anonymous function has no benefit.

newfile=false, # keyword only for defining collect_results without !
rinclude=[r""],
rexclude=[r"^\b$"],
load_function=(filename) -> wload(filename),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
load_function=(filename) -> wload(filename),
load_function = wload,

The purpose of this commit is to incorpoate comments from pull request
 - removes the anonymous function wrapper, except from `to_data_row`
which requires the default be an anonymous function so that the "r"
parameter can be fixed in the call.
 - reverts all the autoformatting

As before, a test was added to `update_results_tests.jl`. All tests
passed, 589 of 589.
@NuclearPowerNerd
Copy link
Contributor Author

NuclearPowerNerd commented Aug 21, 2024

Sorry for late. Don't worry about windows failures we have a problem that we don't know how to solve and CI fails on Windows.

There is a small change necessary: there is no reason to use an anonumous wrapper.

There is also a larger change necessary: please revert all formatting changes. Probably simpler to start a new branch/pr or reset this one to its original state and then do the couple of lines changes necessary.

Thanks - I've updated the pull request per your comments.

Edit: So I just realized the formatting did not revert as expected. Here is what I attempted and thought would work.

  1. Create new branch on fork. Make sure "editor.formatOnSave": false is set for vscode.
  2. Make changes in branch, incorporating your feedback on not using anonymous wrapper.
  3. Squash commits from the "clean" branch and then cherry pick that commit onto the pull request branch by doing git cherry-pick SHA256 --strategy-option theirs. My thinking was that since the "clean" branch (created in step 1) did not have the formatting changes, then those would be interpreted as changes in whitespace and would revert the autoformatting changes on the pull request branch.

It seems the changes in whitespace were ignored though (as the current pull request still has the formatting changes) so I have more to look into here. I will try to get this sorted in the new few days.

Edit 2: I think I have found my answer here. Quoting from the docs

If their version only introduces whitespace changes to a line, our version is used;

@NuclearPowerNerd
Copy link
Contributor Author

So I've prepared a new branch without the whitespace changes. Would it just be easier to close this PR and open a new one with the cleaned up branch?

@Datseris
Copy link
Member

sorry for the late reply!!! Yes, a new PR is easiest!

NuclearPowerNerd added a commit to NuclearPowerNerd/DrWatson.jl that referenced this pull request Aug 23, 2024
The changes in this branch are a follow up from a previous pull request
based on commit 6e6ff07 in PR JuliaDynamics#421. In that PR there were issues
with whitespace changes inadvertantly coming from the autoformatter
in vscode. Reverting the whitespace only changes proved to be more
difficult than anticicpated.

So to resolve this, this branch was created and a new PR will be created
from it. The whitespace issues are gone but all the feedback and changes
from the original PR are retained.

The commit makes the following changes.
 - add the `load_function` kwarg to `collect_results`. This allows
customizing how data is loaded from file before being processed into a
dataframe by `collect_results`.
 - add a test to `update_result_tests.jl`
 - update docstring of `collect_results`
 - increase package version to 2.16.0
 - update `CHANGELOG.md`

All tests passed, 589 of 589.
Datseris pushed a commit that referenced this pull request Aug 23, 2024
The changes in this branch are a follow up from a previous pull request
based on commit 6e6ff07 in PR #421. In that PR there were issues
with whitespace changes inadvertantly coming from the autoformatter
in vscode. Reverting the whitespace only changes proved to be more
difficult than anticicpated.

So to resolve this, this branch was created and a new PR will be created
from it. The whitespace issues are gone but all the feedback and changes
from the original PR are retained.

The commit makes the following changes.
 - add the `load_function` kwarg to `collect_results`. This allows
customizing how data is loaded from file before being processed into a
dataframe by `collect_results`.
 - add a test to `update_result_tests.jl`
 - update docstring of `collect_results`
 - increase package version to 2.16.0
 - update `CHANGELOG.md`

All tests passed, 589 of 589.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants