Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read after write dev #77

Closed

Conversation

grg2rsr
Copy link
Collaborator

@grg2rsr grg2rsr commented Sep 21, 2024

not everything implemented yet and documentation missing, but putting it up here now to keep thing moving. Works in conjunction with convert_brainwide_map_processed_only_local_testing.py

@CodyCBakerPhD CodyCBakerPhD linked an issue Sep 23, 2024 that may be closed by this pull request
@@ -56,18 +56,20 @@
# sessions_to_run = list(set(brain_wide_sessions) - set(already_written_processed_sessions))

nwbfile_path = base_path / "nwbfiles" / session / f"{session}.nwb"
nwbfile_path.parent.mkdir(exist_ok=True)
os.makedirs(nwbfile_path.parent, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.makedirs(nwbfile_path.parent, exist_ok=True)
nwbfile_path.parent.mkdir(exist_ok=True)

Why change this? Pathlib is generally more convenient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add parents=True if that is necessary

# Download behavior and spike sorted data for this session
session_path = base_path / "ibl_conversion" / session
cache_folder = base_path / "ibl_conversion" / session / "cache"
os.makedirs(session_path, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should exist at this point from the previous calls

from typing import Optional
from numpy import testing

def check_arrays(array_a: np.ndarray, array_b: np.ndarray, full_check: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style of redirection is not recommended (a helper function just to decide between uses of numpy.testing)

TBH any data array in the processed files will be small enough such that a full (non-subset) usage of numpy.testing.assert_allclose should always be fast enough to just do that

The only question for submitting would be if you want to do something like this on the raw data, but that's probably important enough for it to be a separate check altogether

Also, the NWB write step should not affect floating point precision (and if it did we would actually want to be informed about that) so I recommend using numpy.testing.assert_array_equal instead. Or in general, recommend numpy.testing.assert_array_almost_equal since it should check other aspects of the arrays beyond the data values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only question for submitting would be if you want to do something like this on the raw data, but that's probably important enough for it to be a separate check altogether

I had this in mind when I was writing the function, and that small speed improvements are probably going to result in large total time savings. That said, I haven't done any benchmarking yet (besides the np.searchsorted way to access the spike times).

Also, the NWB write step should not affect floating point precision (and if it did we would actually want to be informed about that) so I recommend using numpy.testing.assert_array_equal instead. Or in general, recommend numpy.testing.assert_array_almost_equal since it should check other aspects of the arrays beyond the data values

I wasn't entirely sure if this is true across all architectures or if we need to consider some other form of rounding error imprecision, so I thought using allclose would allow to adjust for that if necessary, and if true equality is desired rtol=0 should do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to think of the NWB conversion process (for bulk data anyway) is that 'bytes in' are still just 'bytes out', so if you decoded the original SpikeGLX binaries into their raw bytes (things that look like \xff\x01r), those sequences should be exactly the same as when you read them from the data array after it's written to the NWB file

Floating point errors would mostly come up if you performed operations like scaling (such as by the conversion factor to get values into scientific units)

Or if you used lossless compression to save more space in the file; though that would often exceed standard 'floating point' error




def check_series(series_a: pd.Series, series_b: pd.Series, full_check: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not really recommended to define a one-line redirection function. Just making more work than necessary for ourselves here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had somewhat historical reasons and I left it in as a stub. When I started coding, I wasn't even aware of the existence of numpy.testing so the initial flow of the data was in chained calls where I could go starting from a complex datatype like a pd.DataFrame consecutively to simple ones, where in the end an assert() call would take care of the comparison of the selected numbers.

Things have changed, but I left the general structure up for discussion in case it could still be useful for a case I haven't thought about.

check_arrays(series_a.values, series_b.values, full_check=full_check)


def check_tables(table_a: pd.DataFrame, table_b: pd.DataFrame, naming_map: dict = None, full_check: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for checking that two pandas data frames are equivalent, why not usepandas.testing.assert_frame_equal? (and resolve any column name mapping by creating a copy of the NWB or ONE frame and renaming the columns on the copy passed to the assertion?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) thanks for pointing me towards pandas.testing

table_a (pd.DataFrame): _description_
table_b (pd.DataFrame): _description_
naming_map (dict, optional): if naming map is given, it is used to map the names of columns of table_a to those of table_b. Defaults to None, checks if columns are identical.
full_check (bool, optional): _description_. Defaults to False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An index subset approach should especially not be needed for tables since any data here is already loaded into RAM (making any non-subset assertion super fast)

from pynwb import NWBHDF5IO, NWBFile
from one.api import ONE
import logging
import read_after_write as raw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly recommend not aliasing to real words that also have other contextual meaning.

Also don't recommend importing as an improper local module

Suggested change
import read_after_write as raw
from ibl_to_nwb.brainwide_map import read_after_write

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

io = NWBHDF5IO(file=h5py_file, load_namespaces=True)
nwbfile = io.read()

raw.test_IblSortingInterface(nwbfile, one, eid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw.test_IblSortingInterface(nwbfile, one, eid)
read_after_write.test_IblSortingInterface(nwbfile, one, eid)

Comment on lines +34 to +42
nwbfile = io.read()

raw.test_IblSortingInterface(nwbfile, one, eid)
raw.test_WheelInterface(nwbfile, one, eid)
raw.test_RoiMotionEnergyInterface(nwbfile, one, eid)
raw.test_BrainwideMapTrialsInterface(nwbfile, one, eid)
raw.test_IblPoseEstimationInterface(nwbfile, one, eid)
raw.test_LickInterface(nwbfile, one, eid)
raw.test_PupilTrackingInterface(nwbfile, one, eid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step of reading the NWB file and running the tests on it could be its own function, and we could even back it into the run_conversion procedure so that it's checked every time and errors out if there's ever an issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined that the final format would be something like what you suggested in the #88 , I had in here just to verify quickly and easily that it all runs, to be integrated into a more proper structure later on.

@CodyCBakerPhD CodyCBakerPhD marked this pull request as draft September 24, 2024 05:15
@grg2rsr
Copy link
Collaborator Author

grg2rsr commented Sep 24, 2024

Thank you for the code review. Really helpful for me!

@CodyCBakerPhD
Copy link
Member

replaced by #88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IBL TODO: Add roundtrip consistency tests
2 participants