Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensionless Units #55

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
668c052
root: start commit of dimensionless units
pgierz Nov 8, 2024
c05ba93
wip: adds example dimensionless_mappings
pgierz Nov 8, 2024
b1ef7f2
fix(dimensionless_mappings): Inner key should be dict, not list
pgierz Nov 8, 2024
7900103
feat: allows rules to have access to dimensionless unit mapping table
pgierz Nov 8, 2024
9da4cf8
fix: removes custom cache function from pipeline, as it is still wip
pgierz Nov 8, 2024
367a1fc
fix: Update data/dimensionless_mappings.yaml
pgierz Nov 8, 2024
7d9d910
fixes the issue #54
siligam Nov 8, 2024
0d7f331
Update data/dimensionless_mappings.yaml
siligam Nov 9, 2024
9b0b05e
Update src/pymorize/units.py
siligam Nov 9, 2024
1c5b69b
doc(cmorizer.py): adds documentation for init method
pgierz Nov 11, 2024
e88b52a
added frequency check to validate method
siligam Nov 12, 2024
cc4f159
fix(cmorizer): better constructor for parallelization
pgierz Nov 11, 2024
78f82a6
fix(cmorizer): validate should not be called during object creation
pgierz Nov 12, 2024
c6c8a7f
test: add unit test for unitless PSU conversion
pgierz Nov 12, 2024
6fbc69f
Merge branch 'main' into feat/dimensionless-units
pgierz Nov 12, 2024
aa51c32
added units check to validate method in CMORizer
siligam Nov 12, 2024
aaa1f19
redefined psu
siligam Nov 12, 2024
0ba5fac
added missing import statement
siligam Nov 12, 2024
6bb99ff
re-write unit checker
siligam Nov 12, 2024
d066700
fixed typo
siligam Nov 12, 2024
64d94ae
updated unit_mapping and tests for unit psu
siligam Nov 12, 2024
4eb1478
reverted the dimsionless unit test case
siligam Nov 13, 2024
36fcefd
fix in cmorizer, rework on dimensionless mapping tests, and setting o…
mandresm Nov 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions data/dimensionless_mappings.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# In general:
# model_variable_name:
# cmor_unit_string: pint_friendly_SI_units
so:
"0.001": g/kg
siligam marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions examples/sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ pymorize:
fixed_jobs: 12
# minimum_jobs: 8
# maximum_jobs: 30
dimensionless_mapping_table: ../data/dimensionless_mappings.yaml
rules:
- name: paul_example_rule
description: "You can put some text here"
Expand Down
62 changes: 60 additions & 2 deletions src/pymorize/cmorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import dask
import pandas as pd
import xarray as xr
import questionary
import yaml
from dask.distributed import Client
Expand All @@ -12,9 +13,14 @@
from prefect_dask import DaskTaskRunner
from rich.progress import track

from .data_request import (DataRequest, DataRequestTable, DataRequestVariable,
IgnoreTableFiles)
from .data_request import (
DataRequest,
DataRequestTable,
DataRequestVariable,
IgnoreTableFiles,
)
from .filecache import fc
from .units import handle_unit_conversion
from .logging import logger
from .pipeline import Pipeline
from .rule import Rule
Expand Down Expand Up @@ -52,6 +58,7 @@ def __init__(
self._post_init_create_data_request()
self._post_init_populate_rules_with_tables()
self._post_init_data_request_variables()
self._post_init_read_dimensionless_unit_mappings()

def _post_init_configure_dask(self):
"""
Expand Down Expand Up @@ -148,6 +155,36 @@ def _post_init_data_request_variables(self):
self._rules_expand_drvs()
self._rules_depluralize_drvs()

def _post_init_read_dimensionless_unit_mappings(self):
mandresm marked this conversation as resolved.
Show resolved Hide resolved
"""
Reads the dimensionless unit mappings from a configuration file and
updates the rules with these mappings.

This method reads the dimensionless unit mappings from a file specified
in the configuration. If the file is not specified or does not exist,
an empty dictionary is used. The mappings are then added to each rule
in the `rules` attribute.

Parameters
----------
None

Returns
-------
None
"""
pymorize_cfg = self._pymorize_cfg
unit_map_file = pymorize_cfg.get("dimensionless_mapping_table", None)
if unit_map_file is None:
logger.warning("No dimensionless unit mappings file specified!")
dimensionless_unit_mappings = {}
else:
with open(unit_map_file, "r") as f:
dimensionless_unit_mappings = yaml.safe_load(f)
# Add to rules:
for rule in self.rules:
rule.dimensionless_unit_mappings = dimensionless_unit_mappings

def find_matching_rule(
self, data_request_variable: DataRequestVariable
) -> Rule or None:
Expand Down Expand Up @@ -225,6 +262,7 @@ def validate(self):
# self._check_rules_for_table()
# self._check_rules_for_output_dir()
self._check_is_subperiod()
self._check_units()

def _check_is_subperiod(self):
logger.info("checking frequency in netcdf file and in table...")
Expand Down Expand Up @@ -257,6 +295,25 @@ def _check_is_subperiod(self):
if errors:
for err in errors:
logger.error(err)
raise errors[0]

def _check_units(self):
errors = []
for rule in self.rules:
for input_collection in rule.inputs:
try:
filename = input_collection.files[0]
except IndexError:
break
da = xr.open_dataarray(filename, use_cftime=True)
mandresm marked this conversation as resolved.
Show resolved Hide resolved
try:
handle_unit_conversion(da, rule)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a check. You just call the actual conversion function. That's supposed to happen in process. I think you instead should see if the variable has supplied dimensionless units, and if those are defined in a way that pint will be able to handle it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. will rewrite it.

Copy link
Contributor

@siligam siligam Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: how to know the unit in the netcdf file without accessing it. I can check the cmor_variable (table unit) alone from the rule object if it dimensionless and if the unit mapping is found in dimensionless_units but how about the source units. If model_variable is defined then I can check it as well but that means the user wants to use model_variable instead of the units found in the netcdf file.

consider the following table

var.name modelunit tableunit converted_representation
sidmasstranx kg s-1 m-1 kg s-1
sisnconc 1 % 100.0 percent
sitimefrac 1.0 1 1.0 dimensionless
so psu 0.001 0.001 gram / gram

If I just check the tableunit, then I can not catch sidmasstranx failing and for sitimefrac the check will fail as it is dimensionless and there is no entry in the mapping for it yet but if the user puts an mapping entry for this variable then it still fails as source units (from netcdf file or user defined modelunits) are still dimensionless and the unit conversion function does not consider using mapping for source units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-written code only deals with dimensionless variables (i.e., ignoring sidmasstranx case) and addresses the rest of the cases.

except ValueError as e:
msg = f"Error while converting units for {filename}: {e}"
logger.error(msg)
errors.append(e)
if errors:
raise errors[0]

@classmethod
def from_dict(cls, data):
Expand Down Expand Up @@ -285,6 +342,7 @@ def from_dict(cls, data):
instance._post_init_populate_rules_with_tables()
instance._post_init_create_data_request()
instance._post_init_data_request_variables()
instance._post_init_read_dimensionless_unit_mappings()
return instance

def add_rule(self, rule):
Expand Down
41 changes: 22 additions & 19 deletions src/pymorize/units.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,10 @@
In case the units in model files differ from CMIP Tables, this module attempts to
convert them automatically.

In case of missing units in either model files or CMIP Tables, this module can
not convert from a dimentionless base to something with dimension. Dealing with
such thing have to done with `action` section in the Rules module on a per
variable basis.

Additionally, the cmip frequencies are mapped here. The CMIP6 frequency
names and corresponding number of days are available as a dictionary in the
``CMIP_FREQUENCIES`` variable. Assignment of these frequencies to the unit registry
can be done with the ``assign_frequency_to_unit_registry`` function.
Conversion to-or-from a dimensionless quantity is ambiguous. In this case,
provide a mapping of what this dimensionless quantity represents and that
is used for the conversion. `data/dimensionless_mappings.yaml` contains some
examples on how the mapping is written.
"""

import re
Expand All @@ -29,17 +24,12 @@
import xarray as xr
from chemicals import periodic_table

from .frequency import CMIP_FREQUENCIES
from .logging import logger
from .rule import Rule

ureg = pint_xarray.unit_registry


def assign_frequency_to_unit_registry():
"""Assign the CMIP6 frequencies to the unit registry."""
for freq_name, days in CMIP_FREQUENCIES.items():
ureg.define(f"{freq_name} = {days} * d")
ureg.define("practical_salinity_unit = g/kg = psu = PSU")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ureg.define("practical_salinity_unit = g/kg = psu = PSU")

Talking to @pgierz we decided to not support psu in the end because in models different than FESOM there might be different species of salt and the g/kg equivalence will only be approximate. In that case, the users will need to create a rule themselves to integrate their species of salt and provide and adequate conversion.

Apologies for changing the mind about it @siligam, I have not considered these details when I suggested to support psu.



def handle_chemicals(
Expand Down Expand Up @@ -112,15 +102,28 @@ def handle_unit_conversion(da: xr.DataArray, rule: Rule) -> xr.DataArray:
model_unit = rule.get("model_unit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the function handle_unit_conversion(da: xr.DataArray, rule: Rule) would it make sense to have a 3rd argument dryrun=False so that this function can be used before computation in a loop over the rules, to report wrong units in the model side, missing mappings for the tables, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this as long as it has a default argument of dryrun=False, otherwise we would break the pipeline API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I would set the parameter to be always False, unless specified otherwise. I suggested dryrun as name but if you rather have something like check that's also a good name for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of the option dryrun=False which breaks the API, the same thing can be achieved through a command line option with a meaningful name like "check" or "validate". At the moment, the check for frequency of source data and the frequency that the data needs to be converted to (according to frequency mention in table) is implemented in CMORizer._post_init_checks. The unit conversion check can also be added there. The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks has all the information it needs to verify and validate the user inputs for errors.

Copy link
Contributor

@mandresm mandresm Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following break the pipeline API? Why?

def  handle_unit_conversion(da: xr.DataArray, rule: Rule, dryrun=False):

The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks has all the information it needs to verify and validate the user inputs for errors.

Isn't _post_init_checks run before the dask cluster initialization?

The unit conversion check can also be added there

I agree with this.

Copy link
Member Author

@pgierz pgierz Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following break the pipeline API? Why?

At the moment, any step in a pipeline must have the signature:

def my_step(data, rule):
   ...
   return data

I think it would be cleanest to turn _post_init_checks into a main CMORizer object method (e.g. validate), which you can run independently of process. In my view, constructing the CMORizer object should still be possible, even if you fill it will rubbish rules (this makes testing easier). So, something like this:

cmorizer = CMORizer.from_dict(cfg)
cmorizer.validate()
cmorizer.process()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pgierz, for clarifying that.

I think it would be cleanest to turn _post_init_checks into a main CMORizer object method (e.g. validate), which you can run independently of process.

+1

from_unit = da.attrs.get("units")
if model_unit is not None:
logger.debug(
f"using user defined unit ({model_unit}) instead of ({from_unit}) from DataArray "
logger.info(
f"using user defined unit ({model_unit}) instead of ({from_unit}) from the original file"
)
from_unit = model_unit
handle_chemicals(from_unit)
handle_chemicals(to_unit)
new_da = da.pint.quantify(from_unit)
logger.debug(f"Converting units: {from_unit} -> {to_unit}")
new_da = new_da.pint.to(to_unit).pint.dequantify()
dimless = rule.get("dimensionless_unit_mappings", {})
cmor_variable = rule.get("cmor_variable")
if cmor_variable in dimless:
_to_unit = dimless[cmor_variable][to_unit]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_unit might not exists in the dimensionless_unit_mappings because we might have made a mistake while writing the yaml. Can you add an error handling here? Something like:

if to_unit not int dimless[cmor_variable]:
    raise KeyError(f"Unit {to_unit} not found for variable {da.name} in the {pymorize_cfg.get("dimensionless_mapping_table", None)}")

I guess the pymorize_cfg get dimensionless_mapping_table won't work since it it out of scope, but it could be included in the rule, so that we can report in the error exactly in which file the unit is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. will write the error handling thing.

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere here, there should be a dimensionless, or "weird-unit" checker, so that it raises and error if a variable won't be able to be transformed by pint and suggesting the user/developer to include a mapping of units in the dimensionless_unit_mappings

_to_unit = to_unit
if _to_unit == to_unit:
logger.info(
f"Converting units: ({da.name} -> {cmor_variable}) {from_unit} -> {to_unit}"
)
else:
logger.info(
f"Converting units: ({da.name} -> {cmor_variable}) {from_unit} -> {_to_unit} ({to_unit})"
)
new_da = new_da.pint.to(_to_unit).pint.dequantify()
if new_da.attrs.get("units") != to_unit:
logger.debug(
"Pint auto-unit attribute setter different from requested unit string, setting manually."
Expand Down
33 changes: 33 additions & 0 deletions tests/unit/test_dimensionless_units.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import numpy as np
import xarray as xr

from pymorize.data_request import DataRequestVariable
from pymorize.rule import Rule
from pymorize.units import handle_unit_conversion


def test_units_with_psu():
da = xr.DataArray(10, name="sos", attrs={"units": "psu"})
rule = Rule(
cmor_variable="sos",
dimensionless_unit_mappings={"sos": {"0.001": "g/kg"}},
)
drv = DataRequestVariable(
variable_id="sos",
unit="0.001",
description="Salinity of the ocean",
time_method="MEAN",
table="Omon",
frequency="Monthly",
realms="Ocean",
standard_name="salinity_ocean",
cell_methods="area: mean where sea",
cell_measures="area: areacello",
)
rule.data_request_variable = drv
new_da = handle_unit_conversion(da, rule)
assert new_da.attrs.get("units") == "0.001"
# Check the magnitude of the data
# 1 g/kg is approximately 1 psu, so the values should be 10
# after conversion:
assert np.allclose(new_da.values, 10)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be because of the unit defined in cf_xarray https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/units.py#L111
Let me try to redefine it to g/kg and see if that test pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if 1 g/kg is approx. is 1 psu then psu must be defined as "practical_salinity_unit = g/kg = psu = PSU" instead of "practical_salinity_unit = [] = psu = PSU" (as defined in cf_xarray) and then it works.

Copy link
Contributor

@siligam siligam Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).

Consider the following code snippet

>>> ureg.define('partical_salinity_unit = g/kg = psu = PSU')
>>> ureg('psu')
<Quantity(1, 'partical_salinity_unit')>
>>> ureg('psu').to('g/kg')
<Quantity(1.0, 'gram / kilogram')>
>>>
>>> ureg('psu').to('mg/g')
<Quantity(1.0, 'milligram / gram')>
>>>
>>> ureg('psu').to('kg/kg')
<Quantity(0.001, 'dimensionless')>
>>> ureg('psu').to('g/g')
<Quantity(0.001, 'dimensionless')>

As long as the salinity is expressed in gram/kilogram or milligram/gram the magnitude is 1. The magnitude is 0.001 when it is expressed in kg/kg or g/g which means in dimensionless_units_mapping 0.001 should be either kg/kg or g/g.

This means, in model output data is expressed in psu (g/kg) but in CMOR tables, the data is expected to be expressed as kg/kg or g/g.

1 psu => 1 g/kg => 1 g/(1000 g) => 0.001 g/g => 0.001 dimensionless
1 psu => 1 g/kg => 1 (0.001 kg/kg) => 0.001 kg/kg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christian-stepanek, @chrisdane, could you please explain here? I'm 100% confident that we should not be changing the magnitude of the data in this case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, g/kg is in this case the same as the CMOR unit, just called by a different name. Unit conversions may involve factors or adding offsets, but not in this case.