Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataValidator.apply() #368

Merged
merged 19 commits into from
Aug 20, 2024
Merged

Conversation

danielhuppmann
Copy link
Member

@danielhuppmann danielhuppmann commented Aug 19, 2024

This is the implementation for the apply method of the DataValidator.

The current implementation writes each validation-item to the log separately, together with the failing data rows.

2024-08-19 10:10:33 ERROR    Failed data validation (file validate_data/validate_data_fails.yaml):
  Criteria: variable: ['Primary Energy'], upper_bound: 5.0
         model scenario region        variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy  EJ/yr  2010    6.0
    1  model_a   scen_b  World  Primary Energy  EJ/yr  2010    7.0

  Criteria: variable: ['Primary Energy|Coal'], lower_bound: 2.0
         model scenario region             variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy|Coal  EJ/yr  2005    0.5

  Criteria: variable: ['Primary Energy'], year: [2005], upper_bound: 1.9, lower_bound: 1.1
         model scenario region        variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy  EJ/yr  2005    1.0
    1  model_a   scen_b  World  Primary Energy  EJ/yr  2005    2.0

I initially had all failing validations (per yaml file) concatenated to one DataFrame, but I then thought that it would not be helpful if users can't see which criteria they specifically do not pass.

If you agree with this approach in principle, I can make the error-message a bit nicer (indentation, more concise criteria representation).

@danielhuppmann danielhuppmann marked this pull request as ready for review August 19, 2024 09:18
Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some smaller comments in line and one file to delete (I think).
After that, good to merge.

tests/data/validation/Untitled.ipynb Outdated Show resolved Hide resolved
nomenclature/processor/data_validator.py Outdated Show resolved Hide resolved
error_list = []

with adjust_log_level():
for item in self.criteria_items:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could turn this into a single list comprehension and use the walrus operator:

if error_list := [
                "  Criteria: "
                + ", ".join([f"{key}: {value}" for key, value in item.criteria.items()])
                + "\n"
                + textwrap.indent(str(df.validate(**item.criteria)), prefix="    ")
                + "\n"
                for item in self.criteria_items
                if df.validate(**item.criteria) is not None
            ]:
                logger.error(
                    "Failed data validation (file %s):\n%s",
                    get_relative_path(self.file),
                    "\n".join(error_list),
                )

not sure if that's more readable though.
Feel free to keep whatever is most readable to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems less readable than the current implementation, so suggest to get it unless we make a utility-function that moves the string-concat somewhere else - maybe together with a refactor of the RequiredDataValidator?

@danielhuppmann danielhuppmann merged commit 18c0b12 into main Aug 20, 2024
12 checks passed
@danielhuppmann danielhuppmann deleted the feature/validate-data-apply branch August 20, 2024 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants