Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality questions for ome-zarr formatted data #181

Open
Christianfoley opened this issue Oct 14, 2022 · 19 comments
Open

Functionality questions for ome-zarr formatted data #181

Christianfoley opened this issue Oct 14, 2022 · 19 comments

Comments

@Christianfoley
Copy link

Christianfoley commented Oct 14, 2022

Greetings from the Mehta Lab and apologies in advance for the long post!
I am attempting to use gunpowder as a dataloader for (float32) data in the ome-zarr format. I have run into a few issues trying to get functionality to work with some data I have. I have enumerated some questions I have below.

Support for multiple zarr stores in the OME-HCS Zarr format
If I have data stored in ome-zarr format as a series of hierarchical groups (row > col > position > data_arrays), when I create datasets inside of a source node, they need to be specified by inputting the full hierarchy path to the dataset source:

raw = gp.ArrayKey('RAW')
source = gp.ZarrSource(
    filename=zarr_dir,
    datasets={raw: 'Row_0/Col_1/Pos_1/arr_0'},
    array_specs={raw: gp.ArraySpec(interpolatable=True)}
)

Because of this format, we store arrays containing data in different rows that are all part of one 'dataset' in different zarr stores. Is it possible to create a single source that can access multiple zarr stores?

Inconsistent behavior of BatchRequest objects
When applying some augmentations (for example the SimpleAugment node), re-usage of a BatchRequest without redefining the request or pipelines will randomly result in data returned with the wrong indices:

For example, I define a dataset and a pipeline with and without a simple augmentation node:

raw = gp.ArrayKey('RAW')

source = gp.ZarrSource(
    zarr_dir,  # the zarr container
    {raw: 'Row_0/Col_1/Pos_1/arr_0'},  # arr_0 is 3 channels of 3D image stacks, dims: (1, 3, 41, 2048, 2048) 
    {raw: gp.ArraySpec(interpolatable=True)} 
)

simple_augment = gp.SimpleAugment(transpose_only=(-1,-2))

pipelines = [source, source + simple_augment]

Then I define a batch request:

request = gp.BatchRequest()
request[raw] = gp.Roi((0,0,0,0,0), (1,3,1,768,768))

Then I use that request to generate two batches from each pipeline in sequence:

#First loop is fine, second loop has 2nd dimension flipped/transposed
for n in range(2):
  batches = []
  for pipeline in pipelines: #for both augmented and plain pipeline
    with gp.build(pipeline):
      batch = pipeline.request_batch(request) #get batch
      batches.append(batch)

  # visualize the content of the batches
  fig, ax = plt.subplots(len(pipelines), 3, figsize = (14,10))
  for i in range(len(pipelines)):
    for j in range(3):
      ax[i][j].imshow(batches[i][raw].data[0,j,0])
  ax[0][1].set_title('source')
  ax[1][1].set_title('source + aug')    
  plt.show()

The result is the following:
                        Visualization of batch from loop 1                                           Visualization of batch from loop 2

Screen Shot 2022-10-13 at 4 50 15 PM      Screen Shot 2022-10-13 at 4 49 28 PM

I am confused as to why the behavior changes when the data, pipeline, and batch request haven't changed. Is there a reason that the second augmentation batch returns with reversed channels?


Thanks!!

@mattersoflight
Copy link

Thanks for summarizing your attempts @Christianfoley.
👋🏼 @pattonw @sheridana - any tips on how to handle our data?

Some notes for context:

  • We are upgrading our image translation/virtual staining pipeline from TF 1.x to PyTorch (Draft PR for code review: Pytorch implementation mehta-lab/microDL#176). We also are eager to switch over to a better data augmentation library in the process!
  • I am pasting an example hierarchy of the data (input and output channels). We split train/validation/test set by the field of view (Pos*). The hierarchy is based on OME-HCS format .
[shalin.mehta@login01 training_2022_04_20_Phase1e-3_Denconv_Nuc8e-4_Mem8e-4_pad15_bg50.zarr]$ tree -L 4
.
└── Row_0
    ├── Col_0
    │   └── Pos0
    │       └── arr_0
    ├── Col_1
    │   └── Pos1
    │       └── arr_0
    ├── Col_10
    │   └── Pos10
    │       └── arr_0

@pattonw
Copy link
Collaborator

pattonw commented Oct 14, 2022

Regarding the simple augment and the reversed channels, the SimpleAugment applies both Transpose and Mirror operations randomly. You limit the transpose operation to the last 2 dimensions, but the SimpleAugment is free to randomly mirror any axis, resulting in the channels sometimes being flipped.
Gunpowder has implicit support for "spatial axes" and "nonspatial axes". If you have an array of shape [batch, channels, z, y, x] and you define in associated ArraySpec that voxel_size=(1,1,1), gunpowder will only consider the last 3 dimensions (z,y,x) to be spatial and will avoid applying things like mirror/transpose/elastic augmentations to non-spatial axes. We do assume that spatial axes always come last so using [z,y,x,channels,batch] would not work.
When you define your source I would recommend including voxel size metadata, or adding it to the array metadata:
source = gp.ZarrSource( zarr_dir, # the zarr container {raw: 'Row_0/Col_1/Pos_1/arr_0'}, # arr_0 is 3 channels of 3D image stacks, dims: (1, 3, 41, 2048, 2048) {raw: gp.ArraySpec(interpolatable=True, voxel_size=gp.Coordinate(4,4,4)} )

@pattonw
Copy link
Collaborator

pattonw commented Oct 14, 2022

We don't have a source for reading from OME Zarr. Our ZarrSource is pretty basic and avoids making any assumptions about the layout of the zarr. The only assumptions that are made are checking for "resolution" and "offset" metadata. I think the best approach would be to write a custom OMEZarrSource that makes more assumptions about the layout of your zarr datasets.

@funkey
Copy link
Member

funkey commented Oct 14, 2022

Is it possible to create a single source that can access multiple zarr stores?

It is! You can create as many zarr sources as you like, pointing to different zarr containers and/or datasets. You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request. The ZarrSource is a light-weight object, it will not hurt to have hundreds or thousands of them in your pipeline.

@funkey
Copy link
Member

funkey commented Oct 14, 2022

We don't have a source for reading from OME Zarr. Our ZarrSource is pretty basic and avoids making any assumptions about the layout of the zarr. The only assumptions that are made are checking for "resolution" and "offset" metadata. I think the best approach would be to write a custom OMEZarrSource that makes more assumptions about the layout of your zarr datasets.

That would be a cleaner solution, agreed. For that, it would be helpful to know how you plan to use the OME zarr datasets during training. Is the idea to select a random row/column for each sample in a batch? Or will it be necessary to request specific rows/columns?

@pattonw
Copy link
Collaborator

pattonw commented Oct 14, 2022

You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request.

Ah I may have misunderstood. I thought the Row/Col/Pos related the arrays spatially and you essentially wanted to tile them together to create one larger array, which would be difficult to do with our existing nodes. Pairing many raw datasets with their associated gt and then randomly sampling is well supported.

@Christianfoley
Copy link
Author

Hi Jan and William! Thank you so much for your quick and informative responses.

When you define your source I would recommend including voxel size metadata, or adding it to the array metadata
I was having trouble understanding the voxel size metadata! Thank you, this makes much more sense.

For that, it would be helpful to know how you plan to use the OME zarr datasets during training. Is the idea to select a random row/column for each sample in a batch? Or will it be necessary to request specific rows/columns?

We should be able to consider all arrays in all rows/columns for random selection. That being said, we would only like to load tiles (256x256) whose field of view must meet some qualifications; much of our actual image data is simply background that lacks significant features, and we've shown that only selecting regions that have some percentage of foreground in them is very valuable to network performance. Ideally we would like to do this without making another copy of the data.

We could do this on the fly by loading in a region and computing whether it meets the qualifications, but I think it may be more efficient to precompute these regions, and have some metadata specifying which regions contain foreground features. This way regions don't get loaded and then dumped when they aren't up to par.

On the topic of loading smaller regions, I noticed that it took significantly more time to load a 256x256 ROI from a larger 2048x2048 image than it did to load the same region (256x256) from a pre-saved tile of that region.
Is this because gunpowder is reading the full array into memory and then selecting that cropped ROI? If so, how can we go about requesting smaller regions from larger arrays most efficiently? Is there a way to do this in a time somewhat comparable to loading the entirety of a smaller array?

@pattonw
Copy link
Collaborator

pattonw commented Oct 15, 2022

We could do this on the fly by loading in a region and computing whether it meets the qualifications, but I think it may be more efficient to precompute these regions, and have some metadata specifying which regions contain foreground features. This way regions don't get loaded and then dumped when they aren't up to par.

Yes, we have also found that avoiding excessive sampling from background regions to be helpful. Gunpowder supports both lazy and precomputed methods for doing this.

  1. Lazy: There is a Reject node that can reject batches based on the percentage of masked in data.
  2. Precomputed: The RandomLocation node also supports taking a mask as an optional input. Given a mask it will precompute an integral array of the mask for quick an easy sampling of locations with at least some percentage of masked in voxels at the cost of storing a full copy of the mask in memory.

On the topic of loading smaller regions, I noticed that it took significantly more time to load a 256x256 ROI from a larger 2048x2048 image than it did to load the same region (256x256) from a pre-saved tile of that region.
Is this because gunpowder is reading the full array into memory and then selecting that cropped ROI? If so, how can we go about requesting smaller regions from larger arrays most efficiently? Is there a way to do this in a time somewhat comparable to loading the entirety of a smaller array?

hmm, without more details of your specific files I'm not sure if I can determine the cause of the slow down. If you're loading from a chunked data storage format such as zarr, it is unlikely that you will be reading exactly along block boundaries, so you will have to read and discard some extra data. Thus reading from a zarr will always come with some small penalty but I don't know of any formats that could avoid this penalty and still allow you to read random crops from a large volume with reasonable performance. If performance is a problem I recommend looking at the PreCache node that will start multiple cpu workers to prepare batches for you, in my experience this has usually been sufficient for keeping up with the gpu. There is also a PrintProfiling node that will track timings and help you identify which parts of your pipeline are most time consuming.

@Christianfoley
Copy link
Author

You can then combine them using RandomProvider, which will randomly pick any of the sources to satisfy a request.

Thanks! I was able to use this node to get the random sampling working, but only after providing my sources as a tuple. Ex:

pipeline = (source_1 + source_2 + source3) + gp.RandomProvider() # does not work, only returns from last provider
pipeline = (source_1, source_2, source3) + gp.RandomProvider() # does work, chooses randomly from providers

In the API reference, the example usage is: (a + b + c) + RandomProvider(), which had me confused for a bit.

@pattonw
Copy link
Collaborator

pattonw commented Oct 19, 2022

Ah yes. That was recently fixed but we have not yet made new release to propagate the changes.

@Christianfoley
Copy link
Author

Christianfoley commented Nov 8, 2022

Hi @funkey @pattonw, thanks again for your help so far. I have been running into an issue with building pipelines and it would be very helpful if you could help me understand a few things:

Currently I am creating multiple pipelines, one for training, test, and validation data. I would generally draw samples from the validation set each x epochs to generate a loss value for the learning rate scheduler. However, when I do this, I get the error: pipeline.setup() called more than once (build() inside build()?)

  • Is there a way recommended way to draw from multiple pipelines during training?
    Example:
    scheduler = lr_scheduler.some_scheduler()
    
    with gp.build(pipeline1):
        for i in range(epochs):
            for sample in dataloader_from_pipeline1:
                #training loop ...
            with gp.build(pipeline2): #nesting issue caused here
                for sample in dataloader_from_pipeline2:
                    #validation/testing loop ...
            scheduler.step(loss_from_validation_loop)

Additionally, I don't quite understand the functionality of build(). When I wrap the requesting of a batch inside a with gp.build(some_pipeline): statement, my pipelines build fine. However, when I try just using gp.build(pipeline) and batch requests sequentially, I have issues. Is there a reason for this?

  • In essence, why can I not simply call build when I initialize the pipeline, like so:

    class Dataset():
        def __init__():
            #init some pipeline...
            gp.build(some_pipeline) #build pipeline on initiation
            self.pipeline = some_pipeline
            self.request = some_request
            self.key = some_key
        def __getitem__():
            sample = self.pipeline.request_batch(request=self.request)
            sample_data = sample[self.key].data
            return sample_data
    
    train_dataset = Dataset()
    dataloader = Dataloader(train_dataset)
    def run_one_epoch():
        for sample in dataloader():
            #training loop ...

    *The above results in an error deep inside the gunpowder callstack:

    TypeError: argument of type 'NoneType' is not iterable
    

@pattonw
Copy link
Collaborator

pattonw commented Nov 8, 2022

Is there a way recommended way to draw from multiple pipelines during training?

This should be possible. Here's a quick working example:

import zarr
import gunpowder as gp
import numpy as np

container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})

request = gp.BatchRequest()
request.add(test_key, (5,5))
with gp.build(pipeline1):
    for _ in range(10):
        batch1 = pipeline1.request_batch(request)
        assert test_key in batch1

        with gp.build(pipeline2):
            batch2 = pipeline2.request_batch(request)
            assert test_key in batch2

You should only run into problems if you attempt to build the same pipeline multiple times, or maybe if you share nodes between both pipelines.

In essence, why can I not simply call build when I initialize the pipeline, like so:

The problem is that gp.build(some_pipeline) would just call the __init__ function of the build class, not the __enter__ function which runs the setup methods on each node in your pipeline. You need to use the context manager to ensure the pipeline gets built and torn down properly. This can be a bit frustrating to work around if you don't want the context manager, but one way around it is to turn your pipeline into a generator. Note that this will give you much less control over when the pipeline teardown runs. See the following example:

import zarr
import gunpowder as GP
import numpy as np

container = zarr.open("test.zarr")
x, y = np.meshgrid(range(10),range(10))
container["test"] = x + y

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"})
pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"})

request = gp.BatchRequest()
request.add(test_key, (5,5))

def generate_batches(pipeline: gp.Pipeline, request: gp.BatchRequest) -> gp.Batch:
    with gp.build(pipeline):
        while True:
            yield pipeline.request_batch(request)

batch_generator1 = generate_batches(pipeline1, request)
batch_generator2 = generate_batches(pipeline2, request)
for _ in range(10):
    batch1 = next(batch_generator1)
    assert test_key in batch1

    batch2 = next(batch_generator2)
    assert test_key in batch2

In this form the pipeline should be not too difficult to incorporate into a data loader.

@Christianfoley
Copy link
Author

Christianfoley commented Nov 8, 2022

Thanks so much for the quick reply! I will try these approaches.

Note that this will give you much less control over when the pipeline teardown runs.

Should I be worried about this impacting data loading performance? For example, with this create a lot of overhead for each setup and teardown if I include a PreCache node with many workers?

@pattonw
Copy link
Collaborator

pattonw commented Nov 8, 2022

Should not impact performance at all.
I think the likelihood of running into any issues is very low and probably not worth worrying about. If you had multiple pipelines that included a train or predict node (i.e. allocated gpu memory), then you might run into memory issues when trying to build them all.

@Christianfoley
Copy link
Author

This can be a bit frustrating to work around if you don't want the context manager, but one way around it is to turn your pipeline into a generator.

The generator method seems to work well with our dataloaders, thanks!

We have previously dealt with bottlenecks in our data IO slowing down large training sessions. Because of the way our data is formatted (multi-channel, high depth zarr arrays, chunked by 2d slice), only a small subsection of the data returned by current batch-requests is necessary to us at one time. Example:

path_to_zarr = os.path.join("/data/VeroCell_40X_11NA/VeroPhaseFL_40X_pos4.zarr"

spec = gp.ArraySpec(interpolatable=True, voxel_size=gp.Coordinate((1,1,1))
raw = gp.ArrayKey('RAW') #array with 8 channels: [1,8,41,2048,2048]
mask = gp.ArrayKey('MASK')

source = gp.ZarrSource(
    filename=zarr_dir,
    datasets={raw: "arr_0", mask: "arr_0_mask"}
    array_specs=spec
)
random_location = gp.RandomLocation(min_masked = 0.3, mask = mask)   #reject background
pipeline = source + random_location

new_request = gp.BatchRequest()
new_request[raw] = gp.Roi((0, 0, 0), (5, 256, 256))

with gp.build(batch_pipeline) as pipeline:
    sample = pipeline.request_batch(request = new_request)
    data = sample[raw].data

print(data.shape)    # (1, 8, 5, 256, 256)

input_we_need = np.expand_dims(data[0,3], (0,1))
target_we_need = np.expand_dims(data[0,7, 3], (0,1,2))

print(input_we_need.shape) # (1, 1, 5, 256, 256)
print(target_we_need.shape) # (1, 1, 1, 256, 256)

It seems to me that in this process we need to read (num_channels)x(maximum z-depth)x(width)x(height) elements, when we are using only a fraction of that data.

  • Is it possible to specify the reading of only a certain subsection of indices to read in a non-spatial dimension? (in the above example these would be 3 and 7 in non-spatial dimension 1)
  • As in the above example, we need to read many z-slices (dimension 2) for the input channel(s), but only one for the target channel(s). Is there a 'best way' that we can optimize this process? I have considered building separate pipelines for input and target, but we need to standardize the random augmentations between them (target rotation in x-y needs to be the same as input rotation etc), and this would only make sense if we get specific channel selection working.

@mattersoflight
Copy link

Hello @funkey , @pattonw!
We have been using gunpowder for augmentation during virtual staining with data stored in ome-zarr's HCS format. Here is the key method.

We are at a stage where we need to improve the training speed. One of the bottlenecks is that with the current implementation of the data loading pipeline, we read more data than we need to.

More specifically, Christian's question above has become timely again:

It seems to me that in this process we need to read (num_channels)x(maximum z-depth)x(width)x(height) elements when we are using only a fraction of that data.

We are currently working with CZYX arrays, where we need to draw patches/slices along the C and Z dimensions. In a month or so, we will start working with TCZYX, where we need to draw patches/slices along T,C,Z dimensions.

Does gunpowder's structure allow users to draw ROIs and slices along the time, channel, and Z dimensions? Can you point us to some examples?

@Christianfoley
Copy link
Author

Christianfoley commented Mar 26, 2023

Thanks @mattersoflight ! The updated key method might be helpful for contextualizing our structure.

@pattonw
Copy link
Collaborator

pattonw commented Mar 27, 2023

Hi @mattersoflight and @Christianfoley, there is minimal built in support for processing channel dimensions since most of the gunpowder operations are focused on spatial queries and augmentations. There are a couple approaches to handling this sort of thing.

  1. save a copy of your data including only the channels you actually want (i.e. save a 2 channel version with just channels 3, 7). This would probably be the easiest but least reusable solution.
  2. write a custom source node that is aware of your requirements and only reads channels 3/7 into memory. I would recommend this path if you need to do something like this more than once.
  3. assign a voxel size to your channel dimension and include the channel dimension in your request (probably not the best approach since RandomLocation and will start providing random channels, unless this is what you want)

If you want a RandomLocation to work in T and Z,Y,X then the custom node would probably be best since you could also transpose the first two dimensions to get CTZYX allowing gunpowder to handle TZYX as your spatial dimensions for sampling and augmenting purposes. You would have to assign a 4D voxel size.

Here is an example of a CTZYX dataset where you read all the channels, or add a custom node to filter out a subset of the channels. You made need a slightly more complicated customized zarr source node if you want to transpose the first two dimensions to let gunpowder handle time as spatial:

import zarr
import gunpowder as gp
import numpy as np

class TakeChannels(gp.BatchFilter):
    def __init__(self, array_key: gp.ArrayKey, channels: list[int]):
        self.array_key = array_key
        self.channels = channels

    def process(self, batch, request):
         outputs = gp.Batch()
         array = batch[self.array_key]
         array.data = array.data[tuple(self.channels), ...]
         outputs[self.array_key] = array
         return outputs

container = zarr.open("test.zarr")
c, t, z, y, x = np.meshgrid(range(10),range(10),range(10),range(10),range(10))
container["test"] = c + t + z + y + x

test_key = gp.ArrayKey("TEST")
pipeline1 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))})

request = gp.BatchRequest()
request.add(test_key, (5,5,5,5))

with gp.build(pipeline1):
    batch = pipeline1.request_batch(request)
    assert batch[test_key].data.shape == (10,5,5,5,5)

pipeline2 = gp.ZarrSource("test.zarr", {test_key:"test"}, {test_key:gp.ArraySpec(voxel_size=gp.Coordinate(1,1,1,1))}) + TakeChannels(test_key, [2,3])

with gp.build(pipeline2):
    batch = pipeline2.request_batch(request)
    assert batch[test_key].data.shape == (2,5,5,5,5)

@bentaculum
Copy link
Contributor

Hi everyone, I also need an OMEZarrSource, has anyone already written one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants