Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support .lindi.d directory representation (in addition to .lindi.tar) #92

Merged
merged 43 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
3bd8903
Split large unchunked contiguous HDF5 datasets into smaller chunks
magland Jul 9, 2024
9f9a2d4
relax timing test
magland Jul 9, 2024
a6ebaac
Add auto chunk padding in LindiH5ZarrStore and LindiReferenceFileSyst…
magland Jul 9, 2024
ac7acd2
Fix split chunk size calculations
magland Jul 13, 2024
75a95d4
Remove debug print statements from SplitDatasetH5Item
magland Jul 16, 2024
ffcd45a
Bump version to 0.3.13 in pyproject.toml
magland Jul 16, 2024
af32db8
Add method to retrieve keys in LindiH5pyFile class
magland Jul 17, 2024
24c5371
Implement items() for LindiH5pyFile
magland Jul 18, 2024
1db7e81
Merge branch 'impl-keys' into split-contiguous-datasets
magland Jul 18, 2024
b7fb391
Bump version to 0.3.14 in pyproject.toml
magland Jul 18, 2024
18e3ecc
Improve code comments
magland Jul 20, 2024
3b9f2e9
_get_padded_size()
magland Jul 20, 2024
34a21f9
lindi tar format
magland Aug 2, 2024
8c4a310
CI fix
magland Aug 2, 2024
ab9abc6
Add lindi tar nwb example
magland Aug 2, 2024
f63ac99
Update example
magland Aug 2, 2024
4e983ef
Update LindiTarFile to handle remote file operations
magland Aug 2, 2024
11174b5
Write local tar from remote h5
magland Aug 2, 2024
dec2e6e
do not rely on tarfile
magland Aug 3, 2024
46f7d5f
Custom tarfile writer
magland Aug 5, 2024
52a3672
Update readme
magland Aug 5, 2024
075419b
Update README.md: fix typo "ammending" to "amending"
magland Aug 5, 2024
dcf15ae
Update README.md
magland Aug 5, 2024
facf095
Update README.md
magland Aug 5, 2024
85d9bee
Batch set items
magland Aug 5, 2024
d591c1b
Merge branch 'lindi-tar' of https://github.com/neurodatawithoutborder…
magland Aug 5, 2024
7e3f3a7
Revise README
magland Aug 6, 2024
bbbf24e
Update README
magland Aug 6, 2024
effbb1e
Improve lindi tar functionality
magland Aug 7, 2024
06f7c8c
Add DANDI example
magland Aug 7, 2024
673a343
Fix typo in assertion message for ROS3 support check
magland Aug 7, 2024
cf4ddb1
Update index handling in tar operations
magland Aug 7, 2024
e8b8278
Update source URL or path in LindiH5pyFile classes
magland Aug 8, 2024
c1c5d02
Update internal references to remote tar file
magland Aug 8, 2024
aeac788
Update reference check to include total size in calculation
magland Aug 8, 2024
62873ea
Require .tar extension on tar lindi files
magland Aug 9, 2024
2aa0bde
Update file format description and allowed filename extension
magland Aug 9, 2024
7e5e8c1
Default chunking
magland Aug 13, 2024
32ed84b
Automatic retries for loading data from URL
magland Aug 13, 2024
1c495e5
Add exception handling for failed data loading
magland Aug 13, 2024
a585d12
Support .lindi.d directory representation
magland Aug 28, 2024
f58f883
Make pypi pre-release 0.4.0.a1
magland Aug 31, 2024
7be5fb3
Merge pull request #93 from NeurodataWithoutBorders/prerelease
rly Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/linter_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ jobs:
- name: Run flake8
run: cd lindi && flake8 --config ../.flake8
- name: Run pyright
run: cd lindi && pyright
run: cd lindi && pyright .
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
benchmark.*
*.lindi
*.lindi.json*
*.nwb

Expand Down
2 changes: 1 addition & 1 deletion .vscode/tasks/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ set -ex

cd lindi
flake8 .
pyright
pyright .
cd ..

pytest --cov=lindi --cov-report=xml --cov-report=term tests/
175 changes: 90 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,39 @@

:warning: Please note, LINDI is currently under development and should not yet be used in practice.

**HDF5 as Zarr as JSON for NWB**
LINDI is a cloud-friendly file format and Python library designed for managing scientific data, especially Neurodata Without Borders (NWB) datasets. It offers an alternative to [HDF5](https://docs.hdfgroup.org/hdf5/v1_14/_intro_h_d_f5.html) and [Zarr](https://zarr.dev/), maintaining compatibility with both, while providing features tailored for linking to remote datasets stored in the cloud, such as those on the [DANDI Archive](https://www.dandiarchive.org/). LINDI's unique structure and capabilities make it particularly well-suited for efficient data access and management in cloud environments.

LINDI provides a JSON representation of NWB (Neurodata Without Borders) data where the large data chunks are stored separately from the main metadata. This enables efficient storage, composition, and sharing of NWB files on cloud systems such as [DANDI](https://www.dandiarchive.org/) without duplicating the large data blobs.
**What is a LINDI file?**

LINDI provides:
A LINDI file is a cloud-friendly format for storing scientific data, designed to be compatible with HDF5 and Zarr while offering unique advantages. It comes in two types: JSON/text format (.lindi.json) and binary format (.lindi.tar).

- A specification for representing arbitrary HDF5 files as Zarr stores. This handles scalar datasets, references, soft links, and compound data types for datasets.
- A Zarr wrapper for remote or local HDF5 files (LindiH5ZarrStore).
- A mechanism for creating .lindi.json (or .nwb.lindi.json) files that reference data chunks in external files, inspired by [kerchunk](https://github.com/fsspec/kerchunk).
- An h5py-like interface for reading from and writing to these data sources that can be used with [pynwb](https://pynwb.readthedocs.io/en/stable/).
- A mechanism for uploading and downloading these data sources to and from cloud storage, including DANDI.
In the JSON format, the hierarchical group structure, attributes, and small datasets are stored in a JSON structure, with references to larger data chunks stored in external files (inspired by [kerchunk](https://github.com/fsspec/kerchunk)). This format is human-readable and easily inspected and edited. On the other hand, the binary format is a .tar file that contains the JSON file along with optional internal data chunks referenced by the JSON file, in addition to external chunks. This format allows for efficient cloud storage and random access.

This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and [hdmf-zarr](https://hdmf-zarr.readthedocs.io/en/latest/index.html) and depends on [zarr](https://zarr.readthedocs.io/en/stable/), [h5py](https://www.h5py.org/) and [numcodecs](https://numcodecs.readthedocs.io/en/stable/).
The main advantage of the JSON LINDI format is its readability and ease of modification, while the binary LINDI format offers the ability to include internal data chunks, providing flexibility in data storage and retrieval. Both formats are optimized for cloud use, enabling efficient downloading and access from cloud storage.

**What are the main use cases?**

LINDI files are particularly useful in the following scenarios:

**Efficient NWB File Representation on DANDI**: A LINDI JSON file can represent an NWB file stored on the DANDI Archive (or other remote system). By downloading a condensed JSON file, the entire group structure can be retrieved in a single request, facilitating efficient loading of NWB files. For instance, [Neurosift](https://github.com/flatironinstitute/neurosift) utilizes pre-generated LINDI JSON files to streamline the loading process of NWB files from DANDI.

**Creating Amended NWB Files**: LINDI allows for the creation of amended NWB files that add new data objects to existing NWB files without duplicating the entire file. This is achieved by generating a binary LINDI file that references the original NWB file and includes additional data objects stored as internal data chunks. This approach saves storage space and reduces redundancy.

**Why not use Zarr?**

While Zarr is a cloud-friendly alternative to HDF5, it has notable limitations. Zarr archives often consist of thousands of individual files, making them cumbersome to manage. In contrast, LINDI files adopt a single file approach similar to HDF5, enhancing manageability while retaining cloud-friendliness. Another limitation of Zarr is the lack of a mechanism to reference data chunks in external datasets as LINDI has. Additionally, Zarr does not support certain features utilized by PyNWB, such as compound data types and references, which are supported by both HDF5 and LINDI.

**Why not use HDF5?**

HDF5 is not well-suited for cloud environments because accessing a remote HDF5 file often requires a large number of small requests to retrieve metadata before larger data chunks can be downloaded. LINDI addresses this by storing the entire group structure in a single JSON file, which can be downloaded in one request. Additionally, HDF5 lacks a built-in mechanism for referencing data chunks in external datasets. Furthermore, HDF5 does not support custom Python codecs, a feature available in both Zarr and LINDI. These advantages make LINDI a more efficient and versatile option for cloud-based data storage and access.

**Does LINDI use Zarr?**

Yes, LINDI leverages the Zarr format to store data, including attributes and group hierarchies. However, instead of using directories and files like Zarr, LINDI stores all data within a single JSON structure. This structure includes references to large data chunks, which can reside in remote files (e.g., an HDF5 NWB file on DANDI) or within internal data chunks in the binary LINDI file. Although NWB relies on certain HDF5 features not supported by Zarr, LINDI provides mechanisms to represent these features in Zarr, ensuring compatibility and extending functionality.

**Is tar format really cloud-friendly**

With LINDI, yes. See [docs/tar.md](docs/tar.md) for details.

## Installation

Expand All @@ -33,116 +53,101 @@ cd lindi
pip install -e .
```

## Use cases
## Usage

* Lazy-load a remote NWB/HDF5 file for efficient access to metadata and data.
* Represent a remote NWB/HDF5 file as a .nwb.lindi.json file.
* Read a local or remote .nwb.lindi.json file using pynwb or other tools.
* Edit a .nwb.lindi.json file using pynwb or other tools.
* Add datasets to a .nwb.lindi.json file using a local staging area.
* Upload a .nwb.lindi.json file with staged datasets to a cloud storage service such as DANDI.
**Creating and reading a LINDI file**

### Lazy-load a remote NWB/HDF5 file for efficient access to metadata and data
The simplest way to start is to use it like HDF5.

```python
import pynwb
import lindi

# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Set up a local cache
local_cache = lindi.LocalCache(cache_dir='lindi_cache')

# Create the h5py-like client
client = lindi.LindiH5pyFile.from_hdf5_file(h5_url, local_cache=local_cache)

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
nwbfile = io.read()
print(nwbfile)

# The downloaded data will be cached locally, so subsequent reads will be faster
# Create a new lindi.json file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(10,), dtype='f')
ds[...] = 12

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.json', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
```

### Represent a remote NWB/HDF5 file as a .nwb.lindi.json file
You can inspect the example.lindi.json file to get an idea of how the data are stored. If you are familiar with the internal Zarr format, you will recognize the .group and .zarray files and the layout of the chunks.

Because the above dataset is very small, it can all fit reasonably inside the JSON file. For storing larger arrays (the usual case) it is better to use the binary format. Just leave off the .json extension.

```python
import json
import numpy as np
import lindi

# URL of the remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Create the h5py-like client
client = lindi.LindiH5pyFile.from_hdf5_file(h5_url)

client.write_lindi_file('example.lindi.json')

# See the next example for how to read this file
# Create a new lindi binary file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.tar', mode='w') as f:
f.attrs['attr1'] = 'value1'
f.attrs['attr2'] = 7
ds = f.create_dataset('dataset1', shape=(1000, 1000), dtype='f')
ds[...] = np.random.rand(1000, 1000)

# Later read the file
with lindi.LindiH5pyFile.from_lindi_file('example.lindi.tar', mode='r') as f:
print(f.attrs['attr1'])
print(f.attrs['attr2'])
print(f['dataset1'][...])
```

### Read a local or remote .nwb.lindi.json file using pynwb or other tools
**Loading a remote NWB file from DANDI**

```python
import json
import pynwb
import lindi

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'

# Load the h5py-like client
client = lindi.LindiH5pyFile.from_lindi_file(url)
# Define the URL for a remote NWB file
h5_url = "https://api.dandiarchive.org/api/assets/11f512ba-5bcf-4230-a8cb-dc8d36db38cb/download/"

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
# Load as LINDI and view using pynwb
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url)
with pynwb.NWBHDF5IO(file=f, mode="r") as io:
nwbfile = io.read()
print('NWB via LINDI')
print(nwbfile)
```

### Edit a .nwb.lindi.json file using pynwb or other tools
print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

```python
import json
import lindi
print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'
# Save as LINDI JSON
f.write_lindi_file('example.nwb.lindi.json')

# Load the h5py-like client for the reference file system
# in read-write mode
client = lindi.LindiH5pyFile.from_lindi_file(url, mode="r+")
# Later, read directly from the LINDI JSON file
g = lindi.LindiH5pyFile.from_lindi_file('example.nwb.lindi.json')
with pynwb.NWBHDF5IO(file=g, mode="r") as io:
nwbfile = io.read()
print('')
print('NWB from LINDI JSON:')
print(nwbfile)

# Edit an attribute
client.attrs['new_attribute'] = 'new_value'
print('Electrode group at shank0:')
print(nwbfile.electrode_groups["shank0"]) # type: ignore

# Save the changes to a new .nwb.lindi.json file
client.write_lindi_file('new.nwb.lindi.json')
print('Electrode group at index 0:')
print(nwbfile.electrodes.group[0]) # type: ignore
```

### Add datasets to a .nwb.lindi.json file using a local staging area
## Amending an NWB file

```python
import lindi
Basically you save the remote NWB as a local binary LINDI file, and then add additional data objects to it.

# URL of the remote .nwb.lindi.json file
url = 'https://lindi.neurosift.org/dandi/dandisets/000939/assets/56d875d6-a705-48d3-944c-53394a389c85/nwb.lindi.json'

# Load the h5py-like client for the reference file system
# in read-write mode with a staging area
with lindi.StagingArea.create(base_dir='lindi_staging') as staging_area:
client = lindi.LindiH5pyFile.from_lindi_file(
url,
mode="r+",
staging_area=staging_area
)
# add datasets to client using pynwb or other tools
# upload the changes to the remote .nwb.lindi.json file
```
TODO: finish this section

### Upload a .nwb.lindi.json file with staged datasets to a cloud storage service such as DANDI
## Notes

See [this example](https://github.com/magland/lindi-dandi/blob/main/devel/lindi_test_2.py).
This project was inspired by [kerchunk](https://github.com/fsspec/kerchunk) and [hdmf-zarr](https://hdmf-zarr.readthedocs.io/en/latest/index.html) and depends on [zarr](https://zarr.readthedocs.io/en/stable/), [h5py](https://www.h5py.org/) and [numcodecs](https://numcodecs.readthedocs.io/en/stable/).

## For developers

Expand Down
Loading