Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_sync and missing data #26

Open
jklymak opened this issue May 27, 2024 · 8 comments
Open

get_sync and missing data #26

jklymak opened this issue May 27, 2024 · 8 comments
Assignees

Comments

@jklymak
Copy link
Contributor

jklymak commented May 27, 2024

We sometimes have to turn failing sensors off, so many *.dbd files may be missing sensor values.

Currently if that happens, then dbd.get_sync fails with

DbdError: The requested parameter(s) was(were) not found. Parameter sci_oxy4_oxygen is an unknown glider sensor name. (delayed_raw/00890000.ebd)

However, some of the files have sci_oxy4_oxygen. It would be nice if we could just return NaN here, and maybe warn, rather than error out. I'm happy to submit a patch, but cannot for a couple of weeks. Thought I'd post here in case it is easy.

@smerckel smerckel self-assigned this May 30, 2024
@smerckel
Copy link
Owner

What exactly is done when "failing sensors are turned off"? Is the sample interval of the particular sensor just set to -1 (off), or is in addition the [st]bd_list.dat file also modified, so that the parameter is not included in the files to be transmitted? Would it be possible to provide me with some files for which the problem occurs?

It would also be helpful to know whether it is DBD.get_sync() method, or the MultiDBD.get_sync() method. For the MultiDBD.get_sync() method it can be complicated to implement something that is robust and always correct. For example if MultiDBD opens a number of files, and a few of them don't include a specific sensor, should it interpolate or return NaNs? Both options are problematic.

@jklymak
Copy link
Contributor Author

jklymak commented May 30, 2024

@smerckel Thanks - we have done both setting the sample interval off and taking the sensor out of service. In this case the error arrises in MultiDBD.get_sync and the variable does not exist in some of the files. In that situation, rather than erroring, I'd propose get_sync returns NaN for the variable for that file, or maybe make that optional behaviour.

@jklymak
Copy link
Contributor Author

jklymak commented Jul 3, 2024

@smerckel

There is a small example of data at https://cproof.uvic.ca/_temp/example_missing.tgz

If you run just_get.py then it fails as above with "sci_oxy4_oxygen" missing

If I comment the block at

if invalid_parameters:
out, then I get the behaviour I want where all the files are read, and the missing values filled with NaN.

It seems you had the desired behaviour (abaea54), but then subsequently made things more strict. 6d84d97

I'm not sure for the reason to make things more strict, so I'm not sure whether the newer behaviour was desired, or a bug. I could easily make a PR to "fix" this, or to add another kwarg to allow it to be flexible.

@smerckel
Copy link
Owner

@jklymak

It took a while to find the time for this, but if you would like to try 6ad4858 to see if this works for you that would be great.

This is now the intended behaviour:

DBD.get() for a parameter that is not listed in the header -> DbdError
DBD.get(..., check_for_invalid_parameters=False) for a parameter that is not listed in the header -> empty arrays
DBD.get() for a parameter that is listed in the header but has no data -> empty arrays + warning

MultiDBD.get() for a parameter that is not listed in any header -> DbdError
MultiDBD.get() for a parameter that is not listed in any header, except for at least one header -> only availble data

MultiDBD.get_sync() for a parameter that is not listed in any header -> DbdError
MultiDBD.get_sync() for a parameter that is not listed in any header, except for at least one header -> interpolates.
the value array is padded with leading and trailing nans, where needed.

Current behaviour is also this:
suppose MultiDBD reads in 5 files, of which the 2nd and 4th have data for parameter y, the 1st, 3rd and 5th, not. Then
MultiDBD.get_sync(x,y) would return a tuple of 3: (time, values_of_x, values_of_y). The variable values_of_y would have leading and trailing nans, but for the timestamps for file 3, values_of_y would contain the linear interpolated values from the last value of y in file 2 to the first value of y in file 4. Is that acceptable?

Also, as of this version you can put at the beginning of your script

import dbdreader
dbdreader.DBDCache('path_to_non_default_cache_directory')

to set the cache directory as a default where all instances of DBD and MultiDBD will look for caceh files by default.

Let me know how this works for you.

@jklymak
Copy link
Contributor Author

jklymak commented Jul 16, 2024

@smerckel Thanks so much for working on this.

I'm not sure what I think of

suppose MultiDBD reads in 5 files, of which the 2nd and 4th have data for parameter y, the 1st, 3rd and 5th, not. Then
MultiDBD.get_sync(x,y) would return a tuple of 3: (time, values_of_x, values_of_y). The variable values_of_y would have leading and trailing nans, but for the timestamps for file 3, values_of_y would contain the linear interpolated values from the last value of y in file 2 to the first value of y in file 4. Is that acceptable?

That is maybe OK, and consistent with what happens now I think.

However, I wonder if files with no data at all from a sensor should be NaN? eg imagine if you only sample optics on down casts to save power, then interpolating across the upcast will give pretty misleading results. I guess my understanding of how you do get_sync was file by file, so that shouldn't be too hard?

@smerckel
Copy link
Owner

@jklymak

The get_sync() method of multiDBD relies on the get() method, which does indeed work on a file per file basis. If get() detects that one or more variables don't have any data at all, then to me that looks like an error. If there is even only data point in some parameter, it will accept that. get() will return then only that data point. If this parameter is one of the parameters that is supposed to be interpolated, when calling get_sync() and friends, then this will introduce nans for the times before the first data point and after the last, essentially the behaviour of numpy.interp without boundary checking and nan as fill value.

This may produce undesirable results in these three scenarios:

  1. there is data in only down or up casts, to save energy
  2. there is data in only nth every yo
  3. there is or are some files without data (because the sensor is switched of) in between files with data.

For scenarios 1) and 2) I consider that bad luck, and it is not dbdreader's responsibility to deal with that. I suppose pyglider is more appropriate to code for that. dbdreader does not care about profiles, or splitting data into up and down casts. To do so, you would need to specify a gap length between data points that you would allow to be interpolated over. That can differ vastly between users.

My point of view is that dbdreader provides basically the get() method for both the DBD and the MultiDBD classes. The get_sync and derivates are kind of convenience methods, because they do something that I need to do quite often. I would like to keep these methods general. So, if I try to interpolate a variable that has no data, then I don't see the point in this, and I would like to be told that it does not make sense. For data processing, this may be different, and you are happy to feed lots of nan's into an array. Perhaps subclassing may be a better strategy to implement user specific needs?

This also applies to point 3. I can see why you would not want to have some variable interpolated over a whole dive segment (or more) when the sensors is switched off. This could be solved as follows. When going through all files separately, and a file returns no data for a specific sensor, then MultiDBD.get() inserts an empty array. The result is that a vector is returned that has the values as recorded and no nans. If this result is than passed on to get_sync, we get the behaviour as described. Instead of inserting empty arrays, we could also insert arrays for time, containing the minimum and maximum values of time encountered in this file, and for the value vector two nans. Then the interpolation would be fine for case 3. This would not change anything for case 1 and 2 though.

One thing that could be done to solve all cases, is to allow the user to specify an interpolation function as a keyword to MultiDBD.get_sync. Then, depending on the situation the user has, a case specific interpolation function can be written and passed to get_sync. I have implemented such a thing in the branch 8fc7075.

You can supply the keyword interpolating_function_factory to get_sync. Example code:

import dbdreader
dbd = dbdreader.MultiDBD('*.[de]bd')
sensors=['sci_water_temp', 'm_heading', 'sci_oxy4_oxygen']
t,T,heading,oxygen= dbd.get_sync(*sensors, interpolating_function_factory={"m_heading":dbdreader.heading_interpolating_function_factory})

This code snipped interpolates temperature and oxygen as usual, but uses the heading_interpolating_factor as a template for creating an interpolating function that properly interpolates heading. See the dbdreader.py module on how this is implemented. You could then write your own to deal with gaps in the data as you please.

Let me know if you find this useful, and I integrate this in the main branch. Also I would make the methods get_sync_CTD and get_xy to accept user interpolating functions factories.

@smerckel
Copy link
Owner

@jklymak

The other interpolation methods of MultiDBD have been upgraded as well and updated the documentation. I kind of like this solution. If you have no strong objections and you think your problem has been solved (or can be solved) with this approach, then let me know, and I merge it into master.

@jklymak
Copy link
Contributor Author

jklymak commented Jul 19, 2024

Hi @smerckel This sounds pretty cool and flexible. I'd have to play with it a bit, but I also agree with your major point that get_sync is a convenience, and folks who want a different behaviour could be expected to customize. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants