Skip to content

Commit

Permalink
Sketch support for writing, reading sliced AwkwardArrays. (#549)
Browse files Browse the repository at this point in the history
* Sketch support for writing, reading sliced AwkwardArrays.

* Use py38-compatibility type hints.

* Add awkward to some dependency sets.

* Vendor zipfile from Python 3.9.17.

* Extract form keys with attributes from form.

* Pack before upload.

* Remove redundant default setting; param may soon be deprecated

Co-authored-by: Angus Hollands <[email protected]>

* Use public API added in awkward v2.4.0.

Co-authored-by: Angus Hollands <[email protected]>

* Handle UnionForm

For now, it will only work if len(step2) < 2. Work on the awkward side, tracked in scikit-hep/awkward#2666, is needed to enable the rest.

Co-authored-by: Jim Pivarski <[email protected]>

* Raise clear error for unsupported UnionForms.

* Add minimum version pin for awkward.

* Refactor to use Form.expected_form_buffers.

* Move buffers to dedicated route.

* Support /awkward/full

* Add support for JSON, Feather, Parquet.

* Test use of returned client and client obtained via lookup.

* Include 'buffers' link.

* Remove slicing options from export, not yet supported.

* Add file ext aliases for parquet, feather, arrow.

* Remove outdated comment.

Co-authored-by: Angus Hollands <[email protected]>

* Fix URLs.

* Add Awkward to structure lists in docs.

* Refactor AwkwardBuffersAdapter into AwkwardAdapter.

* Add awkward example.

* Document all writing methods, including new awkward ones.

* Rename AwkwardArrayClient -> AwkwardClient.

* Remove outdated plans.

* Fix bugs that failed tests

* Use allow_noncanonical_form.

---------

Co-authored-by: Angus Hollands <[email protected]>
Co-authored-by: Jim Pivarski <[email protected]>
  • Loading branch information
3 people authored Sep 28, 2023
1 parent ad8bccd commit 952d66e
Show file tree
Hide file tree
Showing 27 changed files with 3,517 additions and 19 deletions.
7 changes: 4 additions & 3 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ exclude =
dist,
versioneer.py,
tiled/_version.py,
docs/source/conf.py
share
web-frontend
tiled/serialization/_zipfile_py39.py,
docs/source/conf.py,
share,
web-frontend,
max-line-length = 115
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Tiled puts an emphasis on **structures** rather than formats, including:
* N-dimensional strided arrays (i.e. numpy-like arrays)
* Sparse arrays
* Tabular data (e.g. pandas-like "dataframes")
* Nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* Hierarchical structures thereof (e.g. xarrays, HDF5-compatible structures like NeXus)

Tiled implements extensible **access control enforcement** based on web security
Expand Down
78 changes: 70 additions & 8 deletions docs/source/explanations/structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@ potentially any language.
The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
[pandas](https://pandas.pydata.org/)
* container --- a of other structures, akin to a dictionary or a directory

Support for sparse arrays and [Awkward Array](https://awkward-array.org/) are
planned.

## How structure is encoded

Expand Down Expand Up @@ -57,7 +55,7 @@ a string label for each dimension.
This `(10, 10)`-shaped array fits in a single `(10, 10)`-shaped chunk.

```
$ http :8000/metadata/small_image | jq .data.attributes.structure
$ http :8000/api/v1/metadata/small_image | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -91,7 +89,7 @@ This `(10000, 10000)`-shaped array is subdivided into 4 × 4 = 16 chunks,
which is why the size of each chunk is given explicitly.

```
$ http :8000/metadata/big_image | jq .data.attributes.structure
$ http :8000/api/v1/metadata/big_image | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -130,7 +128,7 @@ This is a 1D array where each item has internal structure,
as in numpy's [strucuted data types](https://numpy.org/doc/stable/user/basics.rec.html)

```
$ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
$ http :8000/api/v1/metadata/structured_data/pets | jq .data.attributes.structure
```

```json
Expand Down Expand Up @@ -180,6 +178,70 @@ $ http :8000/metadata/structured_data/pets | jq .data.attributes.structure
}
```

### Awkward

[AwkwardArrays](https://awkward-array.org/) express nested, variable-sized
data, including arbitrary-length lists, records, mixed types, and missing data.
This often comes up in the context of event-based data, such as is used in
high-energy physics, neutron experiments, quantum computing, and very high-rate
detectors.

AwkwardArrays are specified by:

* An outer `length` (always an integer)
* A JSON `form` (specified by AwkwardArray, giving the internal layout)
* Named buffers of bytes, whose names match information in the `form`

The first two are included in the structure.

```
$ http :8000/api/v1/metadata/awkward_array | jq .data.attributes.structure
```

```json
{
"length": 3,
"form": {
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"fields": [
"x",
"y"
],
"contents": [
{
"class": "NumpyArray",
"primitive": "float64",
"inner_shape": [],
"parameters": {},
"form_key": "node2"
},
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"primitive": "int64",
"inner_shape": [],
"parameters": {},
"form_key": "node4"
},
"parameters": {},
"form_key": "node3"
}
],
"parameters": {},
"form_key": "node1"
},
"parameters": {},
"form_key": "node0"
}
}

```

### Sparse Array

There are a variety of ways to represent
Expand Down Expand Up @@ -227,7 +289,7 @@ order, but we cannot make requests like "rows 100-200". (Dask has the same
limitation, for the same reason.)

```
$ http :8000/metadata/long_table | jq .data.attributes.structure
$ http :8000/api/v1/metadata/long_table | jq .data.attributes.structure
```

```json
Expand Down
58 changes: 57 additions & 1 deletion docs/source/reference/python-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,30 @@ structure_families, and specs of its children along with their counts.
tiled.client.container.Container.distinct
```

And, finally, there are convenience methods for writing:


```{eval-rst}
.. autosummary::
:toctree: generated
tiled.client.container.Container.create_container
tiled.client.container.Container.write_array
tiled.client.container.Container.write_awkward
tiled.client.container.Container.write_dataframe
tiled.client.container.Container.write_sparse
```

and a low-level method for creating a new node to write into:


```{eval-rst}
.. autosummary::
:toctree: generated
tiled.client.container.Container.new
```

## Structure Clients

For each *structure family* ("array", "table", etc.) there is a client
Expand Down Expand Up @@ -133,6 +157,32 @@ Tiled currently includes two clients for each structure family:
tiled.client.array.DaskArrayClient.read_block
tiled.client.array.DaskArrayClient.read
tiled.client.array.DaskArrayClient.export
tiled.client.array.DaskArrayClient.write
tiled.client.array.DaskArrayClient.write_block
```

```{eval-rst}
.. autosummary::
:toctree: generated
tiled.client.array.ArrayClient
tiled.client.array.ArrayClient.read_block
tiled.client.array.ArrayClient.read
tiled.client.array.ArrayClient.export
tiled.client.array.ArrayClient.write
tiled.client.array.ArrayClient.write_block
```

### Awkward

```{eval-rst}
.. autosummary::
:toctree: generated
tiled.client.awkward.AwkwardClient
tiled.client.awkward.AwkwardClient.read
tiled.client.awkward.AwkwardClient.write
tiled.client.awkward.AwkwardClient.export
```

```{eval-rst}
Expand All @@ -154,6 +204,8 @@ Tiled currently includes two clients for each structure family:
tiled.client.sparse.SparseClient
tiled.client.sparse.SparseClient.read
tiled.client.sparse.SparseClient.export
tiled.client.sparse.SparseClient.write
tiled.client.sparse.SparseClient.write_block
```

### DataFrame
Expand All @@ -166,6 +218,8 @@ Tiled currently includes two clients for each structure family:
tiled.client.dataframe.DaskDataFrameClient.read_partition
tiled.client.dataframe.DaskDataFrameClient.read
tiled.client.dataframe.DaskDataFrameClient.export
tiled.client.dataframe.DaskDataFrameClient.write
tiled.client.dataframe.DaskDataFrameClient.write_partition
```

```{eval-rst}
Expand All @@ -175,7 +229,9 @@ Tiled currently includes two clients for each structure family:
tiled.client.dataframe.DataFrameClient
tiled.client.dataframe.DataFrameClient.read_partition
tiled.client.dataframe.DataFrameClient.read
tiled.client.dataframe.DaskDataFrameClient.export
tiled.client.dataframe.DataFrameClient.export
tiled.client.dataframe.DataFrameClient.write
tiled.client.dataframe.DataFrameClient.write_partition
```

### Xarray Dataset
Expand Down
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ all = [
"anyio",
"asyncpg",
"appdirs",
"awkward >=2.4.3",
"blosc",
"cachetools",
"cachey",
Expand Down Expand Up @@ -102,6 +103,7 @@ array = [
# This is the "kichen sink" fully-featured client dependency set.
client = [
"appdirs",
"awkward >=2.4.3",
"blosc",
"click !=8.1.0",
"dask[array]",
Expand Down Expand Up @@ -215,8 +217,9 @@ server = [
"aiosqlite",
"alembic",
"anyio",
"asyncpg",
"appdirs",
"asyncpg",
"awkward >=2.4.3",
"blosc",
"cachetools",
"cachey",
Expand Down Expand Up @@ -310,6 +313,7 @@ exclude = '''
)/
| versioneer.py
| tiled/_version.py
| tiled/serialization/_zipfile_py39.py
| docs/source/conf.py
| share
| web-frontend
Expand Down
118 changes: 118 additions & 0 deletions tiled/_tests/test_awkward.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import io

import awkward
import pyarrow.feather
import pyarrow.parquet

from ..catalog import in_memory
from ..client import Context, from_context, record_history
from ..server.app import build_app
from ..utils import APACHE_ARROW_FILE_MIME_TYPE


def test_slicing(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
returned = client.write_awkward(array, key="test")
# Test with client returned, and with client from lookup.
for aac in [returned, client["test"]]:
# Read the data back out from the AwkwardArrrayClient, progressively sliced.
assert awkward.almost_equal(aac.read(), array)
assert awkward.almost_equal(aac[:], array)
assert awkward.almost_equal(aac[0], array[0])
assert awkward.almost_equal(aac[0, "y"], array[0, "y"])
assert awkward.almost_equal(aac[0, "y", :1], array[0, "y", :1])

# When sliced, the serer sends less data.
with record_history() as h:
aac[:]
assert len(h.responses) == 1 # sanity check
full_response_size = len(h.responses[0].content)
with record_history() as h:
aac[0, "y"]
assert len(h.responses) == 1 # sanity check
sliced_response_size = len(h.responses[0].content)
assert sliced_response_size < full_response_size


def test_export_json(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

file = io.BytesIO()
aac.export(file, format="application/json")
actual = bytes(file.getbuffer()).decode()
assert actual == awkward.to_json(array)


def test_export_arrow(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

filepath = tmpdir / "actual.arrow"
aac.export(str(filepath), format=APACHE_ARROW_FILE_MIME_TYPE)
actual = pyarrow.feather.read_table(filepath)
expected = awkward.to_arrow_table(array)
assert actual == expected


def test_export_parquet(tmpdir):
catalog = in_memory(writable_storage=tmpdir)
app = build_app(catalog)
with Context.from_app(app) as context:
client = from_context(context)

# Write data into catalog. It will be stored as directory of buffers
# named like 'node0-offsets' and 'node2-data'.
array = awkward.Array(
[
[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}],
[],
[{"x": 3.3, "y": [1, 2, 3]}],
]
)
aac = client.write_awkward(array, key="test")

filepath = tmpdir / "actual.parquet"
aac.export(str(filepath), format="application/x-parquet")
actual = pyarrow.parquet.read_table(filepath)
expected = awkward.to_arrow_table(array)
assert actual == expected
Loading

0 comments on commit 952d66e

Please sign in to comment.