Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README #43

Merged
merged 8 commits into from
Apr 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
*.zarr.json
*.lindi.json
*.nwb

.coverage
Expand Down
89 changes: 70 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,17 @@

:warning: Please note, LINDI is currently under development and should not yet be used in practice.

For a more up-to-date introduction to LINDI, see the [README on the dev branch](https://github.com/NeurodataWithoutBorders/lindi/tree/dev).
**HDF5 as Zarr as JSON for NWB**

LINDI is a Python library that facilitates handling NWB (Neurodata Without Borders) files in an efficient, flexible manner, especially when dealing with large datasets on remote servers. The goal is to enable composition of NWB files by integrating data from multiple sources without the need to copy or move large datasets.
LINDI provides a JSON representation of NWB (Neurodata Without Borders) data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as [DANDI](https://www.dandiarchive.org/) without duplicating the large data blobs.

LINDI features include:
LINDI provides:

- A specification for representing arbitrary HDF5 files as Zarr stores. This handles scalar datasets, references, soft links, and compound data types for datasets.
- A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore). This involves pointers to remote files for remote data chunks.
- A function for generating a reference file system .zarr.json file from a Zarr store. This is inspired by [kerchunk](https://github.com/fsspec/kerchunk).
- An h5py-like interface for accessing these Zarr stores that can be used with [pynwb](https://pynwb.readthedocs.io/en/stable/). Both read and write operations are supported.
- A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore).
- A mechanism for creating .lindi.json (or .nwb.lindi.json) files that reference data chunks in external files, inspired by [kerchunk](https://github.com/fsspec/kerchunk).
- An h5py-like interface for reading from and writing to these data sources that can be used with [pynwb](https://pynwb.readthedocs.io/en/stable/).
- A mechanism for uploading and downloading these data sources to and from cloud storage, including DANDI.

This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and [hdmf-zarr](https://hdmf-zarr.readthedocs.io/en/latest/index.html) and depends on [zarr](https://zarr.readthedocs.io/en/stable/), [h5py](https://www.h5py.org/), [remfile](https://github.com/magland/remfile) and [numcodecs](https://numcodecs.readthedocs.io/en/stable/).

Expand All @@ -25,23 +26,29 @@ This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and
pip install lindi
```

Or install from source
Or from source

```bash
cd lindi
pip install -e .
```

## Example usage
## Use cases

```python
# examples/example1.py
* Represent a remote NWB/HDF5 file as a .nwb.lindi.json file.
* Read a local or remote .nwb.lindi.json file using pynwb or other tools.
* Edit a .nwb.lindi.json file using pynwb or other tools.
* Add datasets to a .nwb.lindi.json file using a local staging area.
* Upload a .nwb.lindi.json file to a cloud storage service such as DANDI.

### Represent a remote NWB/HDF5 file as a .nwb.lindi.json file

```python
import json
import pynwb
import lindi

# Define the URL for a remote NWB file
# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Create a read-only Zarr store as a wrapper for the h5 file
Expand All @@ -51,7 +58,8 @@ store = lindi.LindiH5ZarrStore.from_file(h5_url)
rfs = store.to_reference_file_system()

# Save it to a file for later use
with open("example.zarr.json", "w") as f:
with open("example.lindi.json", "w") as f:
with open("example.lindi.json", "w") as f:
json.dump(rfs, f, indent=2)

# Create an h5py-like client from the reference file system
Expand All @@ -63,18 +71,16 @@ with pynwb.NWBHDF5IO(file=client, mode="r") as io:
print(nwbfile)
```

Or if you already have a .zarr.json file prepared (loading is much faster)
### Read a local or remote .nwb.lindi.json file using pynwb or other tools

```python
# examples/example2.py

import pynwb
import lindi

# Define the URL for a remote .zarr.json file
# URL of the remote .nwb.lindi.json file
url = 'https://kerchunk.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but could you change the filename of this file to use .lindi.json?


# Load the h5py-like client from the reference file system
# Load the h5py-like client for the reference file system
client = lindi.LindiH5pyFile.from_reference_file_system(url)

# Open using pynwb
Expand All @@ -83,9 +89,54 @@ with pynwb.NWBHDF5IO(file=client, mode="r") as io:
print(nwbfile)
```

## Mixing and matching data from multiple sources
### Edit a .nwb.lindi.json file using pynwb or other tools
### Edit a .nwb.lindi.json file using pynwb or other tools

```python
import json
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
# in read-write mode
client = lindi.LindiH5pyFile.from_reference_file_system(url, mode="r+")

# Edit an attribute
client.attrs['new_attribute'] = 'new_value'

# Save the changes to a new .nwb.lindi.json file
rfs_new = client.to_reference_file_system()
with open('new.nwb.lindi.json', 'w') as f:
with open('new.nwb.lindi.json', 'w') as f:
f.write(json.dumps(rfs_new, indent=2, sort_keys=True))
```

### Add datasets to a .nwb.lindi.json file using a local staging area
### Add datasets to a .nwb.lindi.json file using a local staging area

```python
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client for the reference file system
# in read-write mode with a staging area
with lindi.StagingArea.create(base_dir='lindi_staging') as staging_area:
client = lindi.LindiH5pyFile.from_reference_file_system(
url,
mode="r+",
staging_area=staging_area
)
# add datasets to client using pynwb or other tools
# upload the changes to the remote .nwb.lindi.json file
```

### Upload a .nwb.lindi.json file to a cloud storage service such as DANDI

Once we have NWB files represented by relatively small reference file systems (e.g., .zarr.json files), we can begin to mix and match data from multiple sources. More on this to come.
See [this example](https://github.com/magland/lindi-dandi/blob/main/devel/lindi_test_2.py).

## For developers

Expand Down
2 changes: 1 addition & 1 deletion examples/example1.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
rfs = store.to_reference_file_system()

# Save it to a file for later use
with open("example.zarr.json", "w") as f:
with open("example.nwb.lindi.json", "w") as f:
json.dump(rfs, f, indent=2)

# Create an h5py-like client from the reference file system
Expand Down
2 changes: 1 addition & 1 deletion examples/example2.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import pynwb
import lindi

# Define the URL for a remote .zarr.json file
# Define the URL for a remote .nwb.lindi.json file
url = 'https://kerchunk.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client from the reference file system
Expand Down
4 changes: 2 additions & 2 deletions examples/example_edit_nwb.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import pynwb


# Define the URL for a remote .zarr.json file
# Define the URL for a remote .nwb.lindi.json file
url = 'https://kerchunk.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client from the reference file system
Expand All @@ -20,7 +20,7 @@

# Optionally write to a file
# import json
# with open('new.zarr.json', 'w') as f:
# with open('new.nwb.lindi.json', 'w') as f:
# json.dump(rfs_new, f)

# Load a new h5py-like client from the new reference file system
Expand Down
4 changes: 2 additions & 2 deletions lindi/LindiH5pyFile/LindiH5pyFile.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def from_reference_file_system(rfs: Union[dict, str], mode: Literal["r", "r+"] =
----------
rfs : Union[dict, str]
The reference file system. This can be a dictionary or a URL or path
to a .zarr.json file.
to a .lindi.json file.
mode : Literal["r", "r+"], optional
The mode to open the file object in, by default "r". If the mode is
"r", the file object will be read-only. If the mode is "r+", the
Expand All @@ -56,7 +56,7 @@ def from_reference_file_system(rfs: Union[dict, str], mode: Literal["r", "r+"] =
if isinstance(rfs, str):
if rfs.startswith("http") or rfs.startswith("https"):
with tempfile.TemporaryDirectory() as tmpdir:
filename = f"{tmpdir}/temp.zarr.json"
filename = f"{tmpdir}/temp.lindi.json"
_download_file(rfs, filename)
with open(filename, "r") as f:
data = json.load(f)
Expand Down
4 changes: 2 additions & 2 deletions tests/test_remote_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def test_remote_data_1():
rfs = store.to_reference_file_system()

# Save it to a file for later use
with open("example.zarr.json", "w") as f:
with open("example.nwb.lindi.json", "w") as f:
json.dump(rfs, f, indent=2)

# Create an h5py-like client from the reference file system
Expand All @@ -34,7 +34,7 @@ def test_remote_data_1():
def test_remote_data_2():
import pynwb

# Define the URL for a remote .zarr.json file
# Define the URL for a remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/zarr.json'

# Load the h5py-like client from the reference file system
Expand Down