Proposal for for data validation syntax #104

danielhuppmann · 2024-07-10T12:41:45Z

This PR proposes a syntax for data validation as part of the scenario-processing infrastructure.

This PR is intended as a minimum viable product for scenario data validation. This feature is not yet supported by the nomenclature package, but will be added as a new class DataValidator once we reach agreement about the syntax.

The proposed syntax tries to strike a balance between readability and flexibility, using a nested yaml-style syntax to define

filters: any of model, scenario, region, variable, unit, year
bounds: upper_bound, lower_bound [value, rtol to be supported]

Any datapoint in an IAMC-style timeseries format matching the given filters must satisfy the bounds, otherwise an error is raised. The structure directly matches the signature of the method IamDataFrame.validate() so that the implementation can build on the existing functionality. For simplicity, alternative kwargs (value, rtol) will be added to the validate() method for more direct configuration.

The syntax works as follows:

yaml and csv files in a folder validation in a workflow repository (or subfolders)
a list of yaml dictionaries with (some of) the arguments specified above (see the final-energy prototype)
(optional) a nested structure where arguments in the upper level (variable in the emissions prototype) are combined with all lower-level dictionaries (years and regions)
a file attribute in the yaml dictionary to import validation attributes from a csv file (with # as comment)

# simple validation item
- filter-dimension: filter-value-A
  validation-argument: validation-value-A

# named validation item
- <description of validation item B>:
    filter-dimension: filter-value-B
    validation-argument: validation-value-B

# named nested validation items
- <description of validation item C>:
    filter-dimension: filter-value-C
    validation-argument: validation-value-C
    <description of nested validation D>:
      filter-dimension: filter-value-D
      validation-argument: validation-value-D
    <description of nested validation E>:
      filter-dimension: filter-value-E
      validation-argument: validation-value-E

This structure will yield four validation items:

A (name: None)
B (name: description of validation item B)
C & D (name: description of validation C - description of nested validation D)
C & E (name: description of validation C - description of nested validation E)

The name could be used when reporting failed validation of a scenario.

Going forward, we can also implement more features

a keyword argument required.
direct import of a csv file with all relevant attributes (risk of duplication and

danielhuppmann · 2024-07-10T12:45:51Z

@phackstock @gunnar-pik @Renato-Rodrigues @orichters @robertpietzcker - please let me know if this is a useful step towards automated validation of scenario submissions...

orichters · 2024-07-10T13:09:16Z

Maybe for your inspiration, @pweigmann has worked on a similar approach with a config file that looks like this.

I like the following features of our approach:

matching of variables: Price|** means all sub-variables have to satisfy a condition such as having min = 0
Price|* with just on * means only one chain, so matches Price|Final Energy but not Price|Final Energy|Industry
scenario-specific variables (such as net Zero 2050, or Temperature|Global Mean < 1.5 in 2100 for a Below 1.5°C scenario)
comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)
comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)
The yaml format seems very nice.

danielhuppmann · 2024-07-10T13:21:44Z

Thanks @orichters, yes, I've seen your format before and we want to develop in this direction too (and I hope that the yaml file is less heavy and more reliable for forward/backward compatibility).

matching of variables: Price|** means all sub-variables have to satisfy a condition such as having min = 0

Price|* with just on * means only one chain, so matches Price|Final Energy but not Price|Final Energy|Industry

This is already implemented where * is interpreted as a wildcard and you can pass a level argument to specify how "deep" the filter works on the hierarchy, see here.

scenario-specific variables (such as net Zero 2050, or Temperature|Global Mean < 1.5 in 2100 for a Below 1.5°C scenario)

I didn't consider it yet, but you can pass a "model" or "scenario" filter argument.

comparison between scenarios (all variables in all scenarios must be equal for period <= 2020, for example)

comparison to reference periods (in 2030, not more than 20% reduction compared to the year 2020, for example)

Very useful suggestions, to be implemented in the future.

gunnar-pik · 2024-07-10T13:48:29Z

Thanks @danielhuppmann very useful! Yes, great to loop in @pweigmann, who started this for COMMITTED and will also be involved in the SCI project, and also @PhilippVerpoort, who will join SCI as well.
I prefer would editing the upper/lower threshold levels in a classical spreadsheet format like csv over working with yamls. SInce we will end up with a large number of entries, a table format would make it easier to keep an overview. So good to have functionality to read in csvs.

pweigmann · 2024-07-10T14:11:05Z

Hello @danielhuppmann , always fascinating to see when different people come up with a similar solution to the same problem, it does invoke confidence that this type of tool can be useful! On the other hand, it also means a lot of parallel work in different languages, I suppose.

You can follow the current development efforts of our validation tool here: https://github.com/pik-piam/piamValidation

Don't hesitate to reach out in case you would like to exchange ideas or learn more about what we have done so far, I could see this being a great area for collaboration.

danielhuppmann · 2024-07-16T07:14:16Z

Based on further discussions with @phackstock, I have modified the PR and the description (see at the top) to include a way to import a csv file but minimize duplication of columns/rows.

I also switched from upper_bound/lower_bound to value/rtol (still to be implemented in pyam) for better readibility.

phackstock · 2024-07-16T08:31:12Z

Looks very good to me, would be happy to implement it like this.

If we wanted (which I'm not sure we do) we could try to make the syntax of the validation file more compact.
In the below proposal I've changed two things:

Moved the variable to be a top-level value
Put the individual validations as list items, so that they don't require a keyword anymore

- Emissions|CO2|Energy and Industrial Processes:
    - region: World
      rtol: 5%
      file: data_emissions_global.csv
    - region: Asia (R5)
      year: 2020
      rtol: 10%
      value: 20520

This would save 3 lines compared to the current proposal. If it makes readability worse we should stick to the current format though.

danielhuppmann · 2024-07-16T08:37:34Z

Thanks @phackstock - I'm hesitant to define any dimension implicitly: first, I think it's better for readibility to always write "variable: ...", and second, we may run into a use case where the variable is not the primary sorting dimension, which will then make life difficult...

phackstock · 2024-07-16T08:52:11Z

@danielhuppmann fair point about the variable. Regarding your point on having a use case where the variable is not the main dimension I'm not sure if we'd want to put everything into the same file anyway. If we're trying to make one format that fits every possible use case I'm afraid we'd end up with something pretty unwieldy.

What do you think about my second point of moving the constraints into a list rather than having to give them names?
So doing this:

- Historical fossil CO2 emissions data:
  variable: Emissions|CO2|Energy and Industrial Processes
  constraints:
      - region: World
        rtol: 5%
        file: data_emissions_global.csv
      - region: Asia (R5)
        year: 2020
        rtol: 10%
        value: 20520

instead of:

- Historical fossil CO2 emissions data:
  variable: Emissions|CO2|Energy and Industrial Processes
  World:
    region: World
    rtol: 5%
    file: data_emissions_global.csv
  Asia (R5):
    region: Asia (R5)
    year: 2020
    rtol: 10%
    value: 20520

to me, using constraints (or any other keyword that might fit better) looks a bit cleaner and if there's a lot of constraints, you'd save a lot of lines and I think improve readability.

gunnar-pik · 2024-07-19T08:02:31Z

Short note: I think it is important that we can use multiple threshold levels, especially as we go to the vetting of neart-term projections - higher and lower, and also soft ones (yellow traffic light) and hard constraints (red traffic light). So would this be added as lim_lower_yellow, lim_upper_red or similar?

Add simple examples for validation syntax

520f617

danielhuppmann self-assigned this Jul 10, 2024

danielhuppmann added 2 commits July 10, 2024 15:30

Illustrate wildcard-filter

8d66a43

Fix also checking top-level variable

12e774b

Explicit support for tabular format

a9fe291

This was referenced Jul 20, 2024

Validate timeseries data using explicit value and rtol IAMconsortium/pyam#866

Closed

Create new IamcDataFilter class IAMconsortium/nomenclature#360

Merged

orichters mentioned this pull request Aug 6, 2024

Add sanity checks for variables? #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for for data validation syntax #104

Proposal for for data validation syntax #104

danielhuppmann commented Jul 10, 2024 •

edited

Loading

danielhuppmann commented Jul 10, 2024

orichters commented Jul 10, 2024 •

edited

Loading

danielhuppmann commented Jul 10, 2024 •

edited

Loading

gunnar-pik commented Jul 10, 2024

pweigmann commented Jul 10, 2024

danielhuppmann commented Jul 16, 2024

phackstock commented Jul 16, 2024

danielhuppmann commented Jul 16, 2024

phackstock commented Jul 16, 2024

gunnar-pik commented Jul 19, 2024

Proposal for for data validation syntax #104

Are you sure you want to change the base?

Proposal for for data validation syntax #104

Conversation

danielhuppmann commented Jul 10, 2024 • edited Loading

danielhuppmann commented Jul 10, 2024

orichters commented Jul 10, 2024 • edited Loading

danielhuppmann commented Jul 10, 2024 • edited Loading

gunnar-pik commented Jul 10, 2024

pweigmann commented Jul 10, 2024

danielhuppmann commented Jul 16, 2024

phackstock commented Jul 16, 2024

danielhuppmann commented Jul 16, 2024

phackstock commented Jul 16, 2024

gunnar-pik commented Jul 19, 2024

danielhuppmann commented Jul 10, 2024 •

edited

Loading

orichters commented Jul 10, 2024 •

edited

Loading

danielhuppmann commented Jul 10, 2024 •

edited

Loading