-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dimensionless Units #55
base: main
Are you sure you want to change the base?
Conversation
For developers: the unitless mapping table is read when the CMORizer object gets initialised (either directly, this is uncommon) or via the |
@pgierz , nice to see the |
Correct. At the end of the CMOR process, we need to have metadata attributes as defined in the CMOR Tables. |
Hi guys, just so I make sure I am following, this are the steps and tooling needed correct?
Any step missing? I'd suggest we start writing the dimensionless_unit_mappings yaml with just one variable to test, for example one of the salinities that @siligam have shown from his tables. |
Partially. The mappings are read and all are assigned to each |
Probably it would be a good idea to turn that into a unit test so it is run automatically. |
I will implement the tastcase for salinity using the dimensionless_unit_mappings feature. Just for reference, model-table unit listing https://notes.desy.de/CXe0axlgSbStWgOLjaXnlQ?view |
I'm not understanding something: don't the rules belong to the model side of things? Why would we need to append it to the rule and not to the CMOR object? The mapping is for the CMOR unit, not the model unit. Please guys, let me know if I need to reread the documentation or the code, I might be saying bullshit here |
This is badly documented, so read the code ;-) I am talking about the At the moment, it looks like this: >>> user_config_file = pathlib.Path("/some/file.yaml")
>>> f = yaml.safe_load(user_config_file)
>>> cmorizer = CMORizer.from_dict(f)
>>> type(cmorizer.rules)
list
>>> first_rule = cmorizer.rules[0]
>>> first_rule.model_variable
"sst"
>>> first_rule.cmor_variable
"sos"
>>> first_rule.dimensionless_unit_mappings
{"sst": {"0.001": "g/kg"}, "other_key": {"1": "µmol mol^-1"}, ...} Of course this example is nonsense, sst isn't a concentration, but just for illustrative purposes of what sorts of things are where in the objects. |
Yes, the units in the tables are the target, they are the convention. |
Better example in the example. SST isn't meaningful for this use case, we instead will test with salinity of the ocean (variable `so`) Co-authored-by: Miguel <[email protected]>
So, I have integrated dimensionless_mappings logic into unit conversion function and it appears work as expected. I have run The slurm log shows the expected unit conversion details: > grep Converting slurm-13789839.out
Converting units: (thetao -> thetao) degC -> degC
Converting units: (CO2f -> fgco2) mmolC/m2/d -> kg m-2 s-1
Converting units: (so -> so) psu -> g/kg (0.001) The units in processed netcdf file shows the correct unit as per the Table. > ncdump -h so_Omon_AWI-ocean_piControl_r1i1p1f1_gn_278401-278412.nc | grep so:units
so:units = "0.001" ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dimless = rule.get("dimensionless_unit_mappings", {}) | ||
cmor_variable = rule.get("cmor_variable") | ||
if cmor_variable in dimless: | ||
_to_unit = dimless[cmor_variable][to_unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_unit
might not exists in the dimensionless_unit_mappings
because we might have made a mistake while writing the yaml. Can you add an error handling here? Something like:
if to_unit not int dimless[cmor_variable]:
raise KeyError(f"Unit {to_unit} not found for variable {da.name} in the {pymorize_cfg.get("dimensionless_mapping_table", None)}")
I guess the pymorize_cfg get dimensionless_mapping_table won't work since it it out of scope, but it could be included in the rule, so that we can report in the error exactly in which file the unit is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. will write the error handling thing.
cmor_variable = rule.get("cmor_variable") | ||
if cmor_variable in dimless: | ||
_to_unit = dimless[cmor_variable][to_unit] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere here, there should be a dimensionless, or "weird-unit" checker, so that it raises and error if a variable won't be able to be transformed by pint
and suggesting the user/developer to include a mapping of units in the dimensionless_unit_mappings
@@ -110,15 +98,24 @@ def handle_unit_conversion(da: xr.DataArray, rule: Rule) -> xr.DataArray: | |||
model_unit = rule.get("model_unit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the function handle_unit_conversion(da: xr.DataArray, rule: Rule)
would it make sense to have a 3rd argument dryrun=False
so that this function can be used before computation in a loop over the rules, to report wrong units in the model side, missing mappings for the tables, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this as long as it has a default argument of dryrun=False
, otherwise we would break the pipeline API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I would set the parameter to be always False
, unless specified otherwise. I suggested dryrun
as name but if you rather have something like check
that's also a good name for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of the option dryrun=False
which breaks the API, the same thing can be achieved through a command line option with a meaningful name like "check" or "validate". At the moment, the check for frequency of source data and the frequency that the data needs to be converted to (according to frequency mention in table) is implemented in CMORizer._post_init_checks
. The unit conversion check can also be added there. The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks
has all the information it needs to verify and validate the user inputs for errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the following break the pipeline API? Why?
def handle_unit_conversion(da: xr.DataArray, rule: Rule, dryrun=False):
The command line option can skip dask cluster creating thing, construct the CMORizer object by populating the rules so that _post_init_checks has all the information it needs to verify and validate the user inputs for errors.
Isn't _post_init_checks
run before the dask cluster initialization?
The unit conversion check can also be added there
I agree with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the following break the pipeline API? Why?
At the moment, any step in a pipeline must have the signature:
def my_step(data, rule):
...
return data
I think it would be cleanest to turn _post_init_checks
into a main CMORizer object method (e.g. validate
), which you can run independently of process
. In my view, constructing the CMORizer object should still be possible, even if you fill it will rubbish rules (this makes testing easier). So, something like this:
cmorizer = CMORizer.from_dict(cfg)
cmorizer.validate()
cmorizer.process()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pgierz, for clarifying that.
I think it would be cleanest to turn _post_init_checks into a main CMORizer object method (e.g. validate), which you can run independently of process.
+1
Co-authored-by: Miguel <[email protected]>
Co-authored-by: Miguel <[email protected]>
What I am still missing here is a unit test for the dimensionless unit conversion. |
I'm going to have to do a force push on this branch at some point, there are a few features mixed together in this one:
Update: No force push needed, changes in this PR now exclusively deal with the relevant conversion logic. |
def test_units_with_psu(): | ||
da = xr.DataArray(10, name="sos", attrs={"units": "psu"}) | ||
rule = Rule( | ||
cmor_variable="sos", | ||
dimensionless_unit_mappings={"sos": {"0.001": "g/kg"}}, | ||
) | ||
drv = DataRequestVariable( | ||
variable_id="sos", | ||
unit="0.001", | ||
description="Salinity of the ocean", | ||
time_method="MEAN", | ||
table="Omon", | ||
frequency="Monthly", | ||
realms="Ocean", | ||
standard_name="salinity_ocean", | ||
cell_methods="area: mean where sea", | ||
cell_measures="area: areacello", | ||
) | ||
rule.data_request_variable = drv | ||
new_da = handle_unit_conversion(da, rule) | ||
assert new_da.attrs.get("units") == "0.001" | ||
# Check the magnitude of the data | ||
# 1 g/kg is approximately 1 psu, so the values should be 10 | ||
# after conversion: | ||
assert np.allclose(new_da.values, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be because of the unit defined in cf_xarray https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/units.py#L111
Let me try to redefine it to g/kg
and see if that test pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if 1 g/kg
is approx. is 1 psu
then psu
must be defined as "practical_salinity_unit = g/kg = psu = PSU"
instead of "practical_salinity_unit = [] = psu = PSU"
(as defined in cf_xarray) and then it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something with the conversion still isn't working correctly, this new unit test fails and is an order 1000 too large (which is exactly the quantity of g per kg).
Consider the following code snippet
>>> ureg.define('partical_salinity_unit = g/kg = psu = PSU')
>>> ureg('psu')
<Quantity(1, 'partical_salinity_unit')>
>>> ureg('psu').to('g/kg')
<Quantity(1.0, 'gram / kilogram')>
>>>
>>> ureg('psu').to('mg/g')
<Quantity(1.0, 'milligram / gram')>
>>>
>>> ureg('psu').to('kg/kg')
<Quantity(0.001, 'dimensionless')>
>>> ureg('psu').to('g/g')
<Quantity(0.001, 'dimensionless')>
As long as the salinity is expressed in gram/kilogram
or milligram/gram
the magnitude is 1
. The magnitude is 0.001
when it is expressed in kg/kg
or g/g
which means in dimensionless_units_mapping 0.001
should be either kg/kg
or g/g
.
This means, in model output data is expressed in psu (g/kg)
but in CMOR tables, the data is expected to be expressed as kg/kg
or g/g
.
1 psu => 1 g/kg => 1 g/(1000 g) => 0.001 g/g => 0.001 dimensionless
1 psu => 1 g/kg => 1 (0.001 kg/kg) => 0.001 kg/kg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christian-stepanek, @chrisdane, could you please explain here? I'm 100% confident that we should not be changing the magnitude of the data in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, g/kg is in this case the same as the CMOR unit, just called by a different name. Unit conversions may involve factors or adding offsets, but not in this case.
src/pymorize/cmorizer.py
Outdated
break | ||
da = xr.open_dataarray(filename, use_cftime=True) | ||
try: | ||
handle_unit_conversion(da, rule) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a check. You just call the actual conversion function. That's supposed to happen in process
. I think you instead should see if the variable has supplied dimensionless units, and if those are defined in a way that pint will be able to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. will rewrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: how to know the unit in the netcdf file without accessing it. I can check the cmor_variable
(table unit) alone from the rule object if it dimensionless and if the unit mapping is found in dimensionless_units but how about the source units. If model_variable
is defined then I can check it as well but that means the user wants to use model_variable
instead of the units found in the netcdf file.
consider the following table
var.name | modelunit | tableunit | converted_representation |
---|---|---|---|
sidmasstranx | kg s-1 m-1 | kg s-1 | |
sisnconc | 1 | % | 100.0 percent |
sitimefrac | 1.0 | 1 | 1.0 dimensionless |
so | psu | 0.001 | 0.001 gram / gram |
If I just check the tableunit
, then I can not catch sidmasstranx
failing and for sitimefrac
the check will fail as it is dimensionless and there is no entry in the mapping for it yet but if the user puts an mapping entry for this variable then it still fails as source units (from netcdf file or user defined modelunits) are still dimensionless and the unit conversion function does not consider using mapping for source units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re-written code only deals with dimensionless variables (i.e., ignoring sidmasstranx
case) and addresses the rest of the cases.
"""Assign the CMIP6 frequencies to the unit registry.""" | ||
for freq_name, days in CMIP_FREQUENCIES.items(): | ||
ureg.define(f"{freq_name} = {days} * d") | ||
ureg.define("practical_salinity_unit = g/kg = psu = PSU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ureg.define("practical_salinity_unit = g/kg = psu = PSU") |
Talking to @pgierz we decided to not support psu
in the end because in models different than FESOM there might be different species of salt and the g/kg equivalence will only be approximate. In that case, the users will need to create a rule themselves to integrate their species of salt and provide and adequate conversion.
Apologies for changing the mind about it @siligam, I have not considered these details when I suggested to support psu.
…f the data/dimensionless_mappings.yaml as the default fault to load if dimensionless_mapping_table is not given by the user
Work for dimensionless units. Will close #54