Skip to content

DV's fuse netcdf proposal

Declan Valters edited this page May 30, 2018 · 1 revision

Proposed solution

The proposed solution will make the inspection and editing of netCDF files more user-friendly and intuitive than the currently available set of commonly used netCDF utilities such as "ncview", "ncdump", and the utilities CDO (Climate data operators) and NCO (netCDF operators). This will be achieved by making netCDF files appear as mountable objects in the filesystem, through the use of a Python script which will take a specified netCDF file and mount it at a given mount point of the user's choice. The netCDF file will then appear as a set of directories at the specified mount-point, with a hierarchical structure of sub directories representing netCDF variables. Each variable directory will contain text file(s) with the variable's metadata information and either a image-file or text-based representation of each data array. Multidimensional data with dimension greater than n=2 will automatically be populated into a series of subdirectories for each dimension, allowing for example the easy viewing of model data that has multiple levels associated with it.

The solution will implement a FUSE system (file-system in user-space) using the "fusepy" library. This will avoid any need for root or administrative permissions to use the tool. The FUSE approach also has several other advantages; if the FUSE driver were to crash, the operating system kernel will be unaffected and avoid a kernel panic. The development of the FUSE driver, particularly using the Python fusepy library, will be rapid compared to using the traditional C-language FUSE library directly. This will aid the successful development of the driver over the short four month development period.

The approach of the solution will integrate the netcdf-python library to facilitate the decoding of netCDF data from the user specified location. However, due to the speed-limitations of the netcdf-python library, initial work will also explore using the "pynco" module - python bindings to the nco (netCDF operators) C-library, which has significant speed advantages over netcdf-python. Initial work done in the pilot stage of the project will assess whether speed benefits can come from using pynco in place (or possibly in addition) to the commonly used netcdf-python library.

The software implementation will then take the decoded netCDF data and write it out in the appropriate format to the mount point set up by the fusepy module. The conversion of data will be aided by python packages such as PIL (the Python Imaging Library), where appropriate for the data type. The initial pilot stage of the project will scope out which libraries are most appropriate, seeking to balance dependency bloat with functionality of the newly developed software.

The modification of the data in the mounted filesystem will pass back changes to the original netCDF file. (So deleting a variable in the virtual filesystem will remove it from the original netCDF file) However, we will investigate implementing safety features for the user such as writing backup files, or providing user options to mount the netCDF file in 'read-only' mode or 'dummy-mode' if the user wants to explore the virtual netCDF filesystem without actually making permanent changes to the original netCDF file.

Optimisation:

To avoid excessive computational demand placed when opening a large, multidimensional netCDF file, a "lazy-loading" approach would be used for large files, generating the contents of subdirectories (such as image files) only when requested.

Software Development Life-cycle:

After a period of gathering the initial requirements for the software, and finalising the initial specification, a test-driven development approach will be used to develop the netcdf-fusepy software in Python. Software functionality will be built up iteratively using the python built-in unit testing framework, "unittest". As the the software matures and features are refined, a series of integration and regression tests will be deployed to check that any new functionality introduced at a later date does not break existing features or functionality. Hopefully this can be deployed using a continuous-integration environment, but this will be dependent on the infrastructure available from the ECMWF computing systems. Documentation will be developed at the API level through the use of Python docstrings, as well as a user guide to give a general overview of how to use the software, and the functionality available.

Clone this wiki locally