Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensionless Units #55

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

Dimensionless Units #55

wants to merge 23 commits into from

Conversation

pgierz
Copy link
Member

@pgierz pgierz commented Nov 8, 2024

Work for dimensionless units. Will close #54

@pgierz pgierz linked an issue Nov 8, 2024 that may be closed by this pull request
@pgierz pgierz added this to the Beta Release milestone Nov 8, 2024
@pgierz pgierz added documentation Improvements or additions to documentation enhancement New feature or request interface backend labels Nov 8, 2024
@pgierz
Copy link
Member Author

pgierz commented Nov 8, 2024

For developers: the unitless mapping table is read when the CMORizer object gets initialised (either directly, this is uncommon) or via the classmethod from_dict (this is the common case), and attached to all rules that the CMORizer knows about. You can access them on the rule object as attribute dimensionless_unit_mappings.

@siligam
Copy link
Contributor

siligam commented Nov 8, 2024

@pgierz , nice to see the dimensionless_unit_mappings set on rule. I guess unit conversion function should looked out for this attribute and act accordingly. If so, I will make necessary changes to that function. One follow up question: Should the units in the final output point to units in the CMIP table or the one used in this unit_mapping? For instance, "0.001" is the unit in the table. Its unit_mapping says "g/kg". The later is used by the unit conversion function. I guess the unit representation in the final output should still be "0.001" as suggested in the table. Right?

@pgierz
Copy link
Member Author

pgierz commented Nov 8, 2024

I guess the unit representation in the final output should still be "0.001" as suggested in the table. Right?

Correct. At the end of the CMOR process, we need to have metadata attributes as defined in the CMOR Tables.

@mandresm
Copy link
Contributor

mandresm commented Nov 8, 2024

Hi guys, just so I make sure I am following, this are the steps and tooling needed correct?

  • read the dimensionless_unit_mappings. @pgierz this is also what you are saying here as well right?

the unitless mapping table is read when the CMORizer object gets initialised (either directly, this is uncommon) or via the classmethod from_dict (this is the common case), and attached to all rules that the CMORizer knows about

  • Prepare checks raise errors if:

    • a. The user has not specified a "pint-understandable" unit and the original unit in the file doesn't make any sense to pint
    • b. One of the rules is trying to convert to a unit in the CMOR table that has no mapping to a pint-meaningfull physical variable
  • Modify the unit conversion to get the attribute from CMOR table where we stored the physically meaningfull unit (g/kg) and use that to do the conversion.

Any step missing?

I'd suggest we start writing the dimensionless_unit_mappings yaml with just one variable to test, for example one of the salinities that @siligam have shown from his tables.

@mandresm mandresm mentioned this pull request Nov 8, 2024
@pgierz
Copy link
Member Author

pgierz commented Nov 8, 2024

  1. read the dimensionless_unit_mappings and assign them to their corresponding CMOR variables. @pgierz this is also what you are saying here as well right?

the unitless mapping table is read when the CMORizer object gets initialised (either directly, this is uncommon) or via the classmethod from_dict (this is the common case), and attached to all rules that the CMORizer knows about

Partially. The mappings are read and all are assigned to each Rule: no filtering is done. If you want each Rule to only know about one specific mapping, that is easy enough to do, I can add that.

@pgierz
Copy link
Member Author

pgierz commented Nov 8, 2024

I'd suggest we start writing the dimensionless_unit_mappings yaml with just one variable to test, for example one of the salinities that @siligam have shown from his tables.

Probably it would be a good idea to turn that into a unit test so it is run automatically.

@siligam
Copy link
Contributor

siligam commented Nov 8, 2024

I will implement the tastcase for salinity using the dimensionless_unit_mappings feature.

Just for reference, model-table unit listing https://notes.desy.de/CXe0axlgSbStWgOLjaXnlQ?view

@mandresm
Copy link
Contributor

mandresm commented Nov 8, 2024

  1. read the dimensionless_unit_mappings and assign them to their corresponding CMOR variables. @pgierz this is also what you are saying here as well right?

the unitless mapping table is read when the CMORizer object gets initialised (either directly, this is uncommon) or via the classmethod from_dict (this is the common case), and attached to all rules that the CMORizer knows about

Partially. The mappings are read and all are assigned to each Rule: no filtering is done. If you want each Rule to only know about one specific mapping, that is easy enough to do, I can add that.

I'm not understanding something: don't the rules belong to the model side of things?

Why would we need to append it to the rule and not to the CMOR object? The mapping is for the CMOR unit, not the model unit. Please guys, let me know if I need to reread the documentation or the code, I might be saying bullshit here

@pgierz
Copy link
Member Author

pgierz commented Nov 8, 2024

Please guys, let me know if I need to reread the documentation or the code, I might be saying bullshit here

This is badly documented, so read the code ;-) I am talking about the CMORizer object here (the main controller) which has information about everything It's the part that controls the parallelisation and knows which rules exist, what the user configuration looks like, how to match up model files to cmor table entries etc etc.

At the moment, it looks like this:

>>> user_config_file = pathlib.Path("/some/file.yaml")
>>> f = yaml.safe_load(user_config_file)
>>> cmorizer = CMORizer.from_dict(f)
>>> type(cmorizer.rules)
list
>>> first_rule = cmorizer.rules[0]
>>> first_rule.model_variable
"sst"
>>> first_rule.cmor_variable
"sos"
>>> first_rule.dimensionless_unit_mappings
{"sst": {"0.001": "g/kg"}, "other_key": {"1": "µmol mol^-1"}, ...}

Of course this example is nonsense, sst isn't a concentration, but just for illustrative purposes of what sorts of things are where in the objects.

@christian-stepanek
Copy link

I guess the unit representation in the final output should still be "0.001" as suggested in the table. Right?

Correct. At the end of the CMOR process, we need to have metadata attributes as defined in the CMOR Tables.

Yes, the units in the tables are the target, they are the convention.

pgierz and others added 3 commits November 8, 2024 13:59
Better example in the example. SST isn't meaningful for this use case, we instead will test with salinity of the ocean (variable `so`)

Co-authored-by: Miguel <[email protected]>
@siligam
Copy link
Contributor

siligam commented Nov 8, 2024

So, I have integrated dimensionless_mappings logic into unit conversion function and it appears work as expected. I have run examples/sample.yaml which produced the following:

The slurm log shows the expected unit conversion details:

> grep Converting slurm-13789839.out
Converting units: (thetao -> thetao) degC -> degC
Converting units: (CO2f -> fgco2) mmolC/m2/d -> kg m-2 s-1
Converting units: (so -> so) psu -> g/kg (0.001)

The units in processed netcdf file shows the correct unit as per the Table.

> ncdump -h so_Omon_AWI-ocean_piControl_r1i1p1f1_gn_278401-278412.nc | grep so:units
		so:units = "0.001" ;

Copy link
Contributor

@mandresm mandresm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @siligam and @pgierz, fantastic work guys! This is looking very promising :) I have some suggestions.

data/dimensionless_mappings.yaml Outdated Show resolved Hide resolved
src/pymorize/cmorizer.py Show resolved Hide resolved
src/pymorize/units.py Outdated Show resolved Hide resolved
dimless = rule.get("dimensionless_unit_mappings", {})
cmor_variable = rule.get("cmor_variable")
if cmor_variable in dimless:
_to_unit = dimless[cmor_variable][to_unit]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_unit might not exists in the dimensionless_unit_mappings because we might have made a mistake while writing the yaml. Can you add an error handling here? Something like:

if to_unit not int dimless[cmor_variable]:
    raise KeyError(f"Unit {to_unit} not found for variable {da.name} in the {pymorize_cfg.get("dimensionless_mapping_table", None)}")

I guess the pymorize_cfg get dimensionless_mapping_table won't work since it it out of scope, but it could be included in the rule, so that we can report in the error exactly in which file the unit is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. will write the error handling thing.

cmor_variable = rule.get("cmor_variable")
if cmor_variable in dimless:
_to_unit = dimless[cmor_variable][to_unit]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere here, there should be a dimensionless, or "weird-unit" checker, so that it raises and error if a variable won't be able to be transformed by pint and suggesting the user/developer to include a mapping of units in the dimensionless_unit_mappings

src/pymorize/units.py Outdated Show resolved Hide resolved
@@ -110,15 +98,24 @@ def handle_unit_conversion(da: xr.DataArray, rule: Rule) -> xr.DataArray:
model_unit = rule.get("model_unit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the function handle_unit_conversion(da: xr.DataArray, rule: Rule) would it make sense to have a 3rd argument dryrun=False so that this function can be used before computation in a loop over the rules, to report wrong units in the model side, missing mappings for the tables, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this as long as it has a default argument of dryrun=False, otherwise we would break the pipeline API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I would set the parameter to be always False, unless specified otherwise. I suggested dryrun as name but if you rather have something like check that's also a good name for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of the option dryrun=False which breaks the API, the same thing can be achieved through a command line option with a meaningful name like "check" or "validate". At the moment, the check for frequency of source data and the frequency that the data needs to be converted to (according to frequency mention in table) is implemented in CMORizer._post_init_checks. The unit conversion check can also be added there. The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks has all the information it needs to verify and validate the user inputs for errors.

Copy link
Contributor

@mandresm mandresm Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following break the pipeline API? Why?

def  handle_unit_conversion(da: xr.DataArray, rule: Rule, dryrun=False):

The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks has all the information it needs to verify and validate the user inputs for errors.

Isn't _post_init_checks run before the dask cluster initialization?

The unit conversion check can also be added there

I agree with this.

Copy link
Member Author

@pgierz pgierz Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the following break the pipeline API? Why?

At the moment, any step in a pipeline must have the signature:

def my_step(data, rule):
   ...
   return data

I think it would be cleanest to turn _post_init_checks into a main CMORizer object method (e.g. validate), which you can run independently of process. In my view, constructing the CMORizer object should still be possible, even if you fill it will rubbish rules (this makes testing easier). So, something like this:

cmorizer = CMORizer.from_dict(cfg)
cmorizer.validate()
cmorizer.process()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pgierz, for clarifying that.

I think it would be cleanest to turn _post_init_checks into a main CMORizer object method (e.g. validate), which you can run independently of process.

+1

src/pymorize/cmorizer.py Outdated Show resolved Hide resolved
@pgierz
Copy link
Member Author

pgierz commented Nov 12, 2024

What I am still missing here is a unit test for the dimensionless unit conversion.

@pgierz
Copy link
Member Author

pgierz commented Nov 12, 2024

I'm going to have to do a force push on this branch at some point, there are a few features mixed together in this one:

  1. disentangling of the parallel initialisation
  2. validate vs. post_init
  3. Actual unitless conversion.

@mandresm: is it worth it, just to keep things "clean"? Or more confusing?

Update: No force push needed, changes in this PR now exclusively deal with the relevant conversion logic.

Comment on lines 9 to 33
def test_units_with_psu():
da = xr.DataArray(10, name="sos", attrs={"units": "psu"})
rule = Rule(
cmor_variable="sos",
dimensionless_unit_mappings={"sos": {"0.001": "g/kg"}},
)
drv = DataRequestVariable(
variable_id="sos",
unit="0.001",
description="Salinity of the ocean",
time_method="MEAN",
table="Omon",
frequency="Monthly",
realms="Ocean",
standard_name="salinity_ocean",
cell_methods="area: mean where sea",
cell_measures="area: areacello",
)
rule.data_request_variable = drv
new_da = handle_unit_conversion(da, rule)
assert new_da.attrs.get("units") == "0.001"
# Check the magnitude of the data
# 1 g/kg is approximately 1 psu, so the values should be 10
# after conversion:
assert np.allclose(new_da.values, 10)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be because of the unit defined in cf_xarray https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/units.py#L111
Let me try to redefine it to g/kg and see if that test pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if 1 g/kg is approx. is 1 psu then psu must be defined as "practical_salinity_unit = g/kg = psu = PSU" instead of "practical_salinity_unit = [] = psu = PSU" (as defined in cf_xarray) and then it works.

Copy link
Contributor

@siligam siligam Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).

Consider the following code snippet

>>> ureg.define('partical_salinity_unit = g/kg = psu = PSU')
>>> ureg('psu')
<Quantity(1, 'partical_salinity_unit')>
>>> ureg('psu').to('g/kg')
<Quantity(1.0, 'gram / kilogram')>
>>>
>>> ureg('psu').to('mg/g')
<Quantity(1.0, 'milligram / gram')>
>>>
>>> ureg('psu').to('kg/kg')
<Quantity(0.001, 'dimensionless')>
>>> ureg('psu').to('g/g')
<Quantity(0.001, 'dimensionless')>

As long as the salinity is expressed in gram/kilogram or milligram/gram the magnitude is 1. The magnitude is 0.001 when it is expressed in kg/kg or g/g which means in dimensionless_units_mapping 0.001 should be either kg/kg or g/g.

This means, in model output data is expressed in psu (g/kg) but in CMOR tables, the data is expected to be expressed as kg/kg or g/g.

1 psu => 1 g/kg => 1 g/(1000 g) => 0.001 g/g => 0.001 dimensionless
1 psu => 1 g/kg => 1 (0.001 kg/kg) => 0.001 kg/kg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christian-stepanek, @chrisdane, could you please explain here? I'm 100% confident that we should not be changing the magnitude of the data in this case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, g/kg is in this case the same as the CMOR unit, just called by a different name. Unit conversions may involve factors or adding offsets, but not in this case.

src/pymorize/cmorizer.py Outdated Show resolved Hide resolved
break
da = xr.open_dataarray(filename, use_cftime=True)
try:
handle_unit_conversion(da, rule)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a check. You just call the actual conversion function. That's supposed to happen in process. I think you instead should see if the variable has supplied dimensionless units, and if those are defined in a way that pint will be able to handle it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. will rewrite it.

Copy link
Contributor

@siligam siligam Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: how to know the unit in the netcdf file without accessing it. I can check the cmor_variable (table unit) alone from the rule object if it dimensionless and if the unit mapping is found in dimensionless_units but how about the source units. If model_variable is defined then I can check it as well but that means the user wants to use model_variable instead of the units found in the netcdf file.

consider the following table

var.name modelunit tableunit converted_representation
sidmasstranx kg s-1 m-1 kg s-1
sisnconc 1 % 100.0 percent
sitimefrac 1.0 1 1.0 dimensionless
so psu 0.001 0.001 gram / gram

If I just check the tableunit, then I can not catch sidmasstranx failing and for sitimefrac the check will fail as it is dimensionless and there is no entry in the mapping for it yet but if the user puts an mapping entry for this variable then it still fails as source units (from netcdf file or user defined modelunits) are still dimensionless and the unit conversion function does not consider using mapping for source units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-written code only deals with dimensionless variables (i.e., ignoring sidmasstranx case) and addresses the rest of the cases.

@pgierz pgierz mentioned this pull request Nov 12, 2024
@pgierz pgierz linked an issue Nov 12, 2024 that may be closed by this pull request
"""Assign the CMIP6 frequencies to the unit registry."""
for freq_name, days in CMIP_FREQUENCIES.items():
ureg.define(f"{freq_name} = {days} * d")
ureg.define("practical_salinity_unit = g/kg = psu = PSU")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ureg.define("practical_salinity_unit = g/kg = psu = PSU")

Talking to @pgierz we decided to not support psu in the end because in models different than FESOM there might be different species of salt and the g/kg equivalence will only be approximate. In that case, the users will need to create a rule themselves to integrate their species of salt and provide and adequate conversion.

Apologies for changing the mind about it @siligam, I have not considered these details when I suggested to support psu.

…f the data/dimensionless_mappings.yaml as the default fault to load if dimensionless_mapping_table is not given by the user
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend documentation Improvements or additions to documentation enhancement New feature or request interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dimensionless and other strange unit conversion PSU Conversion
5 participants