Functionality questions for ome-zarr formatted data #181

Christianfoley · 2022-10-14T00:24:37Z

Greetings from the Mehta Lab and apologies in advance for the long post!
I am attempting to use gunpowder as a dataloader for (float32) data in the ome-zarr format. I have run into a few issues trying to get functionality to work with some data I have. I have enumerated some questions I have below.

Support for multiple zarr stores in the OME-HCS Zarr format
If I have data stored in ome-zarr format as a series of hierarchical groups (row > col > position > data_arrays), when I create datasets inside of a source node, they need to be specified by inputting the full hierarchy path to the dataset source:

raw = gp.ArrayKey('RAW')
source = gp.ZarrSource(
    filename=zarr_dir,
    datasets={raw: 'Row_0/Col_1/Pos_1/arr_0'},
    array_specs={raw: gp.ArraySpec(interpolatable=True)}
)

Because of this format, we store arrays containing data in different rows that are all part of one 'dataset' in different zarr stores. Is it possible to create a single source that can access multiple zarr stores?

Inconsistent behavior of BatchRequest objects
When applying some augmentations (for example the SimpleAugment node), re-usage of a BatchRequest without redefining the request or pipelines will randomly result in data returned with the wrong indices:

For example, I define a dataset and a pipeline with and without a simple augmentation node:

raw = gp.ArrayKey('RAW')

source = gp.ZarrSource(
    zarr_dir,  # the zarr container
    {raw: 'Row_0/Col_1/Pos_1/arr_0'},  # arr_0 is 3 channels of 3D image stacks, dims: (1, 3, 41, 2048, 2048) 
    {raw: gp.ArraySpec(interpolatable=True)} 
)

simple_augment = gp.SimpleAugment(transpose_only=(-1,-2))

pipelines = [source, source + simple_augment]

Then I define a batch request:

request = gp.BatchRequest()
request[raw] = gp.Roi((0,0,0,0,0), (1,3,1,768,768))

Then I use that request to generate two batches from each pipeline in sequence:

#First loop is fine, second loop has 2nd dimension flipped/transposed
for n in range(2):
  batches = []
  for pipeline in pipelines: #for both augmented and plain pipeline
    with gp.build(pipeline):
      batch = pipeline.request_batch(request) #get batch
      batches.append(batch)

  # visualize the content of the batches
  fig, ax = plt.subplots(len(pipelines), 3, figsize = (14,10))
  for i in range(len(pipelines)):
    for j in range(3):
      ax[i][j].imshow(batches[i][raw].data[0,j,0])
  ax[0][1].set_title('source')
  ax[1][1].set_title('source + aug')    
  plt.show()

The result is the following:
Visualization of batch from loop 1 Visualization of batch from loop 2

I am confused as to why the behavior changes when the data, pipeline, and batch request haven't changed. Is there a reason that the second augmentation batch returns with reversed channels?

Thanks!!

mattersoflight · 2022-10-14T01:06:30Z

Thanks for summarizing your attempts @Christianfoley.
👋🏼 @pattonw @sheridana - any tips on how to handle our data?

Some notes for context:

We are upgrading our image translation/virtual staining pipeline from TF 1.x to PyTorch (Draft PR for code review: Pytorch implementation mehta-lab/microDL#176). We also are eager to switch over to a better data augmentation library in the process!
I am pasting an example hierarchy of the data (input and output channels). We split train/validation/test set by the field of view (Pos*). The hierarchy is based on OME-HCS format .

[shalin.mehta@login01 training_2022_04_20_Phase1e-3_Denconv_Nuc8e-4_Mem8e-4_pad15_bg50.zarr]$ tree -L 4
.
└── Row_0
    ├── Col_0
    │   └── Pos0
    │       └── arr_0
    ├── Col_1
    │   └── Pos1
    │       └── arr_0
    ├── Col_10
    │   └── Pos10
    │       └── arr_0

pattonw · 2022-10-14T14:40:51Z

Regarding the simple augment and the reversed channels, the SimpleAugment applies both Transpose and Mirror operations randomly. You limit the transpose operation to the last 2 dimensions, but the SimpleAugment is free to randomly mirror any axis, resulting in the channels sometimes being flipped.
Gunpowder has implicit support for "spatial axes" and "nonspatial axes". If you have an array of shape [batch, channels, z, y, x] and you define in associated ArraySpec that voxel_size=(1,1,1), gunpowder will only consider the last 3 dimensions (z,y,x) to be spatial and will avoid applying things like mirror/transpose/elastic augmentations to non-spatial axes. We do assume that spatial axes always come last so using [z,y,x,channels,batch] would not work.
When you define your source I would recommend including voxel size metadata, or adding it to the array metadata:
source = gp.ZarrSource( zarr_dir, # the zarr container {raw: 'Row_0/Col_1/Pos_1/arr_0'}, # arr_0 is 3 channels of 3D image stacks, dims: (1, 3, 41, 2048, 2048) {raw: gp.ArraySpec(interpolatable=True, voxel_size=gp.Coordinate(4,4,4)} )

pattonw · 2022-10-14T14:49:26Z

We don't have a source for reading from OME Zarr. Our ZarrSource is pretty basic and avoids making any assumptions about the layout of the zarr. The only assumptions that are made are checking for "resolution" and "offset" metadata. I think the best approach would be to write a custom OMEZarrSource that makes more assumptions about the layout of your zarr datasets.

funkey · 2022-10-14T14:51:28Z

Is it possible to create a single source that can access multiple zarr stores?

It is! You can create as many zarr sources as you like, pointing to different zarr containers and/or datasets. You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request. The ZarrSource is a light-weight object, it will not hurt to have hundreds or thousands of them in your pipeline.

funkey · 2022-10-14T14:55:39Z

We don't have a source for reading from OME Zarr. Our ZarrSource is pretty basic and avoids making any assumptions about the layout of the zarr. The only assumptions that are made are checking for "resolution" and "offset" metadata. I think the best approach would be to write a custom OMEZarrSource that makes more assumptions about the layout of your zarr datasets.

That would be a cleaner solution, agreed. For that, it would be helpful to know how you plan to use the OME zarr datasets during training. Is the idea to select a random row/column for each sample in a batch? Or will it be necessary to request specific rows/columns?

pattonw · 2022-10-14T15:03:09Z

You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request.

Ah I may have misunderstood. I thought the Row/Col/Pos related the arrays spatially and you essentially wanted to tile them together to create one larger array, which would be difficult to do with our existing nodes. Pairing many raw datasets with their associated gt and then randomly sampling is well supported.

Christianfoley · 2022-10-15T00:38:08Z

Hi Jan and William! Thank you so much for your quick and informative responses.

When you define your source I would recommend including voxel size metadata, or adding it to the array metadata
I was having trouble understanding the voxel size metadata! Thank you, this makes much more sense.

For that, it would be helpful to know how you plan to use the OME zarr datasets during training. Is the idea to select a random row/column for each sample in a batch? Or will it be necessary to request specific rows/columns?

We should be able to consider all arrays in all rows/columns for random selection. That being said, we would only like to load tiles (256x256) whose field of view must meet some qualifications; much of our actual image data is simply background that lacks significant features, and we've shown that only selecting regions that have some percentage of foreground in them is very valuable to network performance. Ideally we would like to do this without making another copy of the data.

We could do this on the fly by loading in a region and computing whether it meets the qualifications, but I think it may be more efficient to precompute these regions, and have some metadata specifying which regions contain foreground features. This way regions don't get loaded and then dumped when they aren't up to par.

On the topic of loading smaller regions, I noticed that it took significantly more time to load a 256x256 ROI from a larger 2048x2048 image than it did to load the same region (256x256) from a pre-saved tile of that region.
Is this because gunpowder is reading the full array into memory and then selecting that cropped ROI? If so, how can we go about requesting smaller regions from larger arrays most efficiently? Is there a way to do this in a time somewhat comparable to loading the entirety of a smaller array?

pattonw · 2022-10-15T19:32:43Z

We could do this on the fly by loading in a region and computing whether it meets the qualifications, but I think it may be more efficient to precompute these regions, and have some metadata specifying which regions contain foreground features. This way regions don't get loaded and then dumped when they aren't up to par.

Yes, we have also found that avoiding excessive sampling from background regions to be helpful. Gunpowder supports both lazy and precomputed methods for doing this.

Lazy: There is a Reject node that can reject batches based on the percentage of masked in data.
Precomputed: The RandomLocation node also supports taking a mask as an optional input. Given a mask it will precompute an integral array of the mask for quick an easy sampling of locations with at least some percentage of masked in voxels at the cost of storing a full copy of the mask in memory.

On the topic of loading smaller regions, I noticed that it took significantly more time to load a 256x256 ROI from a larger 2048x2048 image than it did to load the same region (256x256) from a pre-saved tile of that region.
Is this because gunpowder is reading the full array into memory and then selecting that cropped ROI? If so, how can we go about requesting smaller regions from larger arrays most efficiently? Is there a way to do this in a time somewhat comparable to loading the entirety of a smaller array?

hmm, without more details of your specific files I'm not sure if I can determine the cause of the slow down. If you're loading from a chunked data storage format such as zarr, it is unlikely that you will be reading exactly along block boundaries, so you will have to read and discard some extra data. Thus reading from a zarr will always come with some small penalty but I don't know of any formats that could avoid this penalty and still allow you to read random crops from a large volume with reasonable performance. If performance is a problem I recommend looking at the PreCache node that will start multiple cpu workers to prepare batches for you, in my experience this has usually been sufficient for keeping up with the gpu. There is also a PrintProfiling node that will track timings and help you identify which parts of your pipeline are most time consuming.

Christianfoley · 2022-10-17T21:05:35Z

You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request.

Thanks! I was able to use this node to get the random sampling working, but only after providing my sources as a tuple. Ex:

pipeline = (source_1 + source_2 + source3) + gp.RandomProvider() # does not work, only returns from last provider
pipeline = (source_1, source_2, source3) + gp.RandomProvider() # does work, chooses randomly from providers

In the API reference, the example usage is: (a + b + c) + RandomProvider(), which had me confused for a bit.

pattonw · 2022-10-19T15:55:34Z

Ah yes. That was recently fixed but we have not yet made new release to propagate the changes.

Christianfoley · 2022-11-08T18:11:40Z

Hi @funkey @pattonw, thanks again for your help so far. I have been running into an issue with building pipelines and it would be very helpful if you could help me understand a few things:

Currently I am creating multiple pipelines, one for training, test, and validation data. I would generally draw samples from the validation set each x epochs to generate a loss value for the learning rate scheduler. However, when I do this, I get the error: pipeline.setup() called more than once (build() inside build()?)

Is there a way recommended way to draw from multiple pipelines during training?
Example:

scheduler = lr_scheduler.some_scheduler()

with gp.build(pipeline1):
    for i in range(epochs):
        for sample in dataloader_from_pipeline1:
            #training loop ...
        with gp.build(pipeline2): #nesting issue caused here
            for sample in dataloader_from_pipeline2:
                #validation/testing loop ...
        scheduler.step(loss_from_validation_loop)

Additionally, I don't quite understand the functionality of build(). When I wrap the requesting of a batch inside a with gp.build(some_pipeline): statement, my pipelines build fine. However, when I try just using gp.build(pipeline) and batch requests sequentially, I have issues. Is there a reason for this?

In essence, why can I not simply call build when I initialize the pipeline, like so:

class Dataset():
    def __init__():
        #init some pipeline...
        gp.build(some_pipeline) #build pipeline on initiation
        self.pipeline = some_pipeline
        self.request = some_request
        self.key = some_key
    def __getitem__():
        sample = self.pipeline.request_batch(request=self.request)
        sample_data = sample[self.key].data
        return sample_data

train_dataset = Dataset()
dataloader = Dataloader(train_dataset)
def run_one_epoch():
    for sample in dataloader():
        #training loop ...

*The above results in an error deep inside the gunpowder callstack:

TypeError: argument of type 'NoneType' is not iterable

pattonw · 2022-11-08T18:59:50Z

Is there a way recommended way to draw from multiple pipelines during training?

This should be possible. Here's a quick working example:

import zarr
import gunpowder as gp
import numpy as np

container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})

request = gp.BatchRequest()
request.add(test_key, (5,5))
with gp.build(pipeline1):
    for _ in range(10):
        batch1 = pipeline1.request_batch(request)
        assert test_key in batch1

        with gp.build(pipeline2):
            batch2 = pipeline2.request_batch(request)
            assert test_key in batch2

You should only run into problems if you attempt to build the same pipeline multiple times, or maybe if you share nodes between both pipelines.

In essence, why can I not simply call build when I initialize the pipeline, like so:

The problem is that gp.build(some_pipeline) would just call the __init__ function of the build class, not the __enter__ function which runs the setup methods on each node in your pipeline. You need to use the context manager to ensure the pipeline gets built and torn down properly. This can be a bit frustrating to work around if you don't want the context manager, but one way around it is to turn your pipeline into a generator. Note that this will give you much less control over when the pipeline teardown runs. See the following example:

import zarr
import gunpowder as GP
import numpy as np

container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})

request = gp.BatchRequest()
request.add(test_key, (5,5))

def generate_batches(pipeline: gp.Pipeline, request: gp.BatchRequest) -> gp.Batch:
    with gp.build(pipeline):
        while True:
            yield pipeline.request_batch(request)

batch_generator1 = generate_batches(pipeline1, request)
batch_generator2 = generate_batches(pipeline2, request)
for _ in range(10):
    batch1 = next(batch_generator1)
    assert test_key in batch1

    batch2 = next(batch_generator2)
    assert test_key in batch2

In this form the pipeline should be not too difficult to incorporate into a data loader.

Christianfoley · 2022-11-08T19:19:05Z

Thanks so much for the quick reply! I will try these approaches.

Note that this will give you much less control over when the pipeline teardown runs.

Should I be worried about this impacting data loading performance? For example, with this create a lot of overhead for each setup and teardown if I include a PreCache node with many workers?

pattonw · 2022-11-08T19:36:27Z

Should not impact performance at all.
I think the likelihood of running into any issues is very low and probably not worth worrying about. If you had multiple pipelines that included a train or predict node (i.e. allocated gpu memory), then you might run into memory issues when trying to build them all.

Christianfoley · 2022-11-10T00:40:18Z

This can be a bit frustrating to work around if you don't want the context manager, but one way around it is to turn your pipeline into a generator.

The generator method seems to work well with our dataloaders, thanks!

We have previously dealt with bottlenecks in our data IO slowing down large training sessions. Because of the way our data is formatted (multi-channel, high depth zarr arrays, chunked by 2d slice), only a small subsection of the data returned by current batch-requests is necessary to us at one time. Example:

path_to_zarr = os.path.join("/data/VeroCell_40X_11NA/VeroPhaseFL_40X_pos4.zarr"

spec = gp.ArraySpec(interpolatable=True, voxel_size=gp.Coordinate((1,1,1))
raw = gp.ArrayKey('RAW') #array with 8 channels: [1,8,41,2048,2048]
mask = gp.ArrayKey('MASK')

source = gp.ZarrSource(
    filename=zarr_dir,
    datasets={raw: "arr_0", mask: "arr_0_mask"}
    array_specs=spec
)
random_location = gp.RandomLocation(min_masked = 0.3, mask = mask)   #reject background
pipeline = source + random_location

new_request = gp.BatchRequest()
new_request[raw] = gp.Roi((0, 0, 0), (5, 256, 256))

with gp.build(batch_pipeline) as pipeline:
    sample = pipeline.request_batch(request = new_request)
    data = sample[raw].data

print(data.shape)    # (1, 8, 5, 256, 256)

input_we_need = np.expand_dims(data[0,3], (0,1))
target_we_need = np.expand_dims(data[0,7, 3], (0,1,2))

print(input_we_need.shape) # (1, 1, 5, 256, 256)
print(target_we_need.shape) # (1, 1, 1, 256, 256)

It seems to me that in this process we need to read (num_channels)x(maximum z-depth)x(width)x(height) elements, when we are using only a fraction of that data.

Is it possible to specify the reading of only a certain subsection of indices to read in a non-spatial dimension? (in the above example these would be 3 and 7 in non-spatial dimension 1)
As in the above example, we need to read many z-slices (dimension 2) for the input channel(s), but only one for the target channel(s). Is there a 'best way' that we can optimize this process? I have considered building separate pipelines for input and target, but we need to standardize the random augmentations between them (target rotation in x-y needs to be the same as input rotation etc), and this would only make sense if we get specific channel selection working.

mattersoflight · 2023-03-26T20:29:36Z

Hello @funkey , @pattonw!
We have been using gunpowder for augmentation during virtual staining with data stored in ome-zarr's HCS format. Here is the key method.

We are at a stage where we need to improve the training speed. One of the bottlenecks is that with the current implementation of the data loading pipeline, we read more data than we need to.

More specifically, Christian's question above has become timely again:

It seems to me that in this process we need to read (num_channels)x(maximum z-depth)x(width)x(height) elements when we are using only a fraction of that data.

We are currently working with CZYX arrays, where we need to draw patches/slices along the C and Z dimensions. In a month or so, we will start working with TCZYX, where we need to draw patches/slices along T,C,Z dimensions.

Does gunpowder's structure allow users to draw ROIs and slices along the time, channel, and Z dimensions? Can you point us to some examples?

Christianfoley · 2023-03-26T20:51:20Z

Thanks @mattersoflight ! The updated key method might be helpful for contextualizing our structure.

pattonw · 2023-03-27T12:47:42Z

Hi @mattersoflight and @Christianfoley, there is minimal built in support for processing channel dimensions since most of the gunpowder operations are focused on spatial queries and augmentations. There are a couple approaches to handling this sort of thing.

save a copy of your data including only the channels you actually want (i.e. save a 2 channel version with just channels 3, 7). This would probably be the easiest but least reusable solution.
write a custom source node that is aware of your requirements and only reads channels 3/7 into memory. I would recommend this path if you need to do something like this more than once.
assign a voxel size to your channel dimension and include the channel dimension in your request (probably not the best approach since RandomLocation and will start providing random channels, unless this is what you want)

If you want a RandomLocation to work in T and Z,Y,X then the custom node would probably be best since you could also transpose the first two dimensions to get CTZYX allowing gunpowder to handle TZYX as your spatial dimensions for sampling and augmenting purposes. You would have to assign a 4D voxel size.

Here is an example of a CTZYX dataset where you read all the channels, or add a custom node to filter out a subset of the channels. You made need a slightly more complicated customized zarr source node if you want to transpose the first two dimensions to let gunpowder handle time as spatial:

import zarr
import gunpowder as gp
import numpy as np

class TakeChannels(gp.BatchFilter):
    def __init__(self, array_key: gp.ArrayKey, channels: list[int]):
        self.array_key = array_key
        self.channels = channels

    def process(self, batch, request):
         outputs = gp.Batch()
         array = batch[self.array_key]
         array.data = array.data[tuple(self.channels), ...]
         outputs[self.array_key] = array
         return outputs

container = zarr.open("test.zarr")
c, t, z, y, x = np.meshgrid(range(10),range(10),range(10),range(10),range(10))
container["test"] = c + t + z + y + x

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))})

request = gp.BatchRequest()
request.add(test_key, (5,5,5,5))

with gp.build(pipeline1):
    batch = pipeline1.request_batch(request)
    assert batch[test_key].data.shape == (10,5,5,5,5)

pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))}) + TakeChannels(test_key, [2,3])

with gp.build(pipeline2):
    batch = pipeline2.request_batch(request)
    assert batch[test_key].data.shape == (2,5,5,5,5)

bentaculum · 2023-03-30T16:59:31Z

Hi everyone, I also need an OMEZarrSource, has anyone already written one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality questions for ome-zarr formatted data #181

Functionality questions for ome-zarr formatted data #181

Christianfoley commented Oct 14, 2022 •

edited

Loading

mattersoflight commented Oct 14, 2022

pattonw commented Oct 14, 2022

pattonw commented Oct 14, 2022

funkey commented Oct 14, 2022

funkey commented Oct 14, 2022

pattonw commented Oct 14, 2022 •

edited

Loading

Christianfoley commented Oct 15, 2022

pattonw commented Oct 15, 2022

Christianfoley commented Oct 17, 2022

pattonw commented Oct 19, 2022

Christianfoley commented Nov 8, 2022 •

edited

Loading

pattonw commented Nov 8, 2022

Christianfoley commented Nov 8, 2022 •

edited

Loading

pattonw commented Nov 8, 2022

Christianfoley commented Nov 10, 2022

mattersoflight commented Mar 26, 2023

Christianfoley commented Mar 26, 2023 •

edited

Loading

pattonw commented Mar 27, 2023 •

edited

Loading

bentaculum commented Mar 30, 2023

Functionality questions for ome-zarr formatted data #181

Functionality questions for ome-zarr formatted data #181

Comments

Christianfoley commented Oct 14, 2022 • edited Loading

mattersoflight commented Oct 14, 2022

pattonw commented Oct 14, 2022

pattonw commented Oct 14, 2022

funkey commented Oct 14, 2022

funkey commented Oct 14, 2022

pattonw commented Oct 14, 2022 • edited Loading

Christianfoley commented Oct 15, 2022

pattonw commented Oct 15, 2022

Christianfoley commented Oct 17, 2022

pattonw commented Oct 19, 2022

Christianfoley commented Nov 8, 2022 • edited Loading

pattonw commented Nov 8, 2022

Christianfoley commented Nov 8, 2022 • edited Loading

pattonw commented Nov 8, 2022

Christianfoley commented Nov 10, 2022

mattersoflight commented Mar 26, 2023

Christianfoley commented Mar 26, 2023 • edited Loading

pattonw commented Mar 27, 2023 • edited Loading

bentaculum commented Mar 30, 2023

Christianfoley commented Oct 14, 2022 •

edited

Loading

pattonw commented Oct 14, 2022 •

edited

Loading

Christianfoley commented Nov 8, 2022 •

edited

Loading

Christianfoley commented Nov 8, 2022 •

edited

Loading

Christianfoley commented Mar 26, 2023 •

edited

Loading

pattonw commented Mar 27, 2023 •

edited

Loading