Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mid/high level interface for HDF5 Dimension Scale #1124

Open
11 tasks
fergu opened this issue Oct 21, 2023 · 1 comment
Open
11 tasks

Add mid/high level interface for HDF5 Dimension Scale #1124

fergu opened this issue Oct 21, 2023 · 1 comment

Comments

@fergu
Copy link
Contributor

fergu commented Oct 21, 2023

Opening this as a stub issue for adding a mid/high level interface to the Dimension Scale functions. I meant to do this a while back when the low level library calls were added, but I never got around to it. See also #720. I will also add a short blurb on what HDF5 dimension scales are useful for in a reply to this issue for anyone who is unfamiliar.

Here are some current ideas for things to implement. This is currently closer to a train of thought rather than anything set in stone, so I'd welcome feedback on how this could/should be changed:

High-level interface:

  • Add a subtype of the HDF5.Dataset type called HDF5.Scale for cases where a Dataset is a dimension scale. This seems like the easiest way to add this functionality to the existing interface without breaking changes, but I'm open to suggestions.
  • Add checks for existing dimension scales during reading of an HDF5.Dataset and returning them alongside the dataset. Alternatively, add a function to check if a dataset has scales (HDF5.has_scales or HDF5.dimensions_with_scales?) and read the scale given a dataset, a la HDF5.read_dimension_scale( ds::HDF5.Dataset, dim::Integer)
  • Add a method to write dimension scales using the Dictionary interface as well as the existing write_dataset calls.

Mid-level implementations of low level library calls:

  • Add HDF5.is_scale(), which should return true if a supplied HDF5.Dataset (or HDF5.Scale) is a dimension scale
  • Add HDF5.attach_scale() and HDF5.detach_scale() to attach/detach a dimension scale from a supplied HDF5.Dataset
  • Add HDF5.set_label() and HDF5.get_label() to assign a label to a dimension scale. This might be ambiguous without the HDF5.Scale datatype though, so it might be best to exclude this if HDF5.Scale also isn't a good idea.
  • Add HDF5.set_scale() to convert an existing HDF5.Dataset to a dimension scale. Interestingly, there does not appear to be a corresponding unset_scale library call, though it seems like the only thing that makes a dataset a scale is the attributes, so HDF5.unset_scale() might just need to delete those attributes.
  • Add HDF5.get_num_scales() to return the number of scales attached to a given HDF5.Dataset
  • Add HDF5.dimensions_with_scales() to return a Vector{Integer} of dimensions with attached scales
  • Add HDF5.get_scale_name() to return the name (label) attached to a scale, if there is one
  • Add HDF5.is_attached() to return true if a supplied HDF5.Dataset (or HDF5.Scale) is attached to a dataset as a scale.
@fergu
Copy link
Contributor Author

fergu commented Oct 21, 2023

For a bit of context on what dimension scales do for anyone unfamiliar, in order to help the discussion (this blurb is written by me, not from HDF5 docs or anything, so take it with a grain of salt and all that)

HDF5 Dimension Scales are basically just a way to attach coordinate information to a given axis of a dataset. If I am given a new HDF5 file that someone else made, and I want to know the coordinate information associated with an axis of a dataset, I can just query the dimension scale associated with that axis, and it will give me a dataset with the corresponding coordinate information. This is much smoother than trying to infer what other dataset in the file is meant to be the coordinate data for that axis based on names or context or an email from the creator of the file. This additionally allows multiple datasets to share a single dimension scale, with all of those datasets pointing to a single piece of data in the file (as opposed to copies of identical data scattered around the file). In other words, these are another tool in helping make HDF5 files "self-describing".

Practically, dimension scales are just regular HDF5 Datasets with some extra attributes added to track where they are being used. You can write an HDF5 dataset to file, and then use the HDF5 library function h5ds_set_scale() to specify that the dataset is a scale. This adds a few attributes to the new scale to indicate things like the "name" of the scale (which is different than the path of the scale), and a list of datasets that a given scale is attached to. The handy thing about that last point is that attaching a scale to a dataset (using h5ds_attach_scale()) also adds a link to the scale as an attribute of the target dataset. You can read that attribute/link, and it will return an HDF5.Dataset (well, currently an hid_t, but that's what this task aims to fix) directly, without having to find a name/parse a path or anything like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant