Read from Storage in file source #3869

rdettai · 2023-09-22T10:09:17Z

Description

How was this PR tested?

Running full test suite:

rdettai · 2023-09-22T15:08:48Z

@fmassot (cc @trinity-1686a) commit 9a74f5a contains the scaffolding for the streaming storage. The central point is that StorageResolver now has a resolve_streamable() method. I left the existing resolve() method (reasons explained in the docstring). Before proceeding to the implementation for each and every store, I would like your opinion on the general setup:

are you okay with the overhead it adds? Should we re-consider adding the method to the ~~StorageResolver~~Storage directly?
I'm not sure what the debouncer is used for. Is it okay to bypass it as I did for the streaming mode?

guilload · 2023-09-22T19:07:23Z

I read the comment for the resolve_streamable method but don't understand it. Why can't the Storage trait have a get_stream method directly?

rdettai · 2023-09-23T08:38:31Z

I read the comment for the resolve_streamable method but don't understand it.

Noted, I'll rephrase it if we move forward in this direction

Why can't the Storage trait have a get_stream method directly?

That's one of the questions (I actually made a typo in my previous comment, sorry!).
Pros: The idea behind composing traits instead of having one big Storage trait that does everything is to avoid leaving methods unimplemented. I believe the Storage trait should already be split into a ReadableStorage and a WriteableStorage, thus avoiding unimplemented methods in BundledStorage and StorageWIthCache. I didn't want to add yet another method that doesn't make sense in all cases. You will usually want to avoid get_stream when the actual data is already backed by a byte array (like the cached storage) as it implies an extra unnecessary copy (I know that's what the RamStorage does, but that's mostly for tests). It's implementation might also be a bit tricky on the BundledStorage, which is a shame because it will actually never be used 😅 .
Cons: the machinery of the supertrait is bit obscure (especially the upcasting but also some generics introduced by it), and most Storage implementations are streamable anyway.

guilload · 2023-09-25T16:22:24Z

The trade-off is adding a method that will remain unimplemented for some storage implementations vs adding two new traits, a new method to StorageResolver, and some "obscure machinery" for the supertrait/upcasting. I vote for 1.

@fulmicoton, you want to weigh in?

fulmicoton · 2023-09-26T01:53:02Z

We can add a get_stream method implementation that has a default impl that calls get_all and avoid any unimplemented!().

Pros: The idea behind composing traits instead of having one big Storage trait that does everything is to avoid leaving methods unimplemented.

We can have a "best-effort" default impl that calls get_all and returns a stream that does nto really stream.

I believe the Storage trait should already be split into a ReadableStorage and a WriteableStorage, thus avoiding unimplemented methods in BundledStorage and StorageWIthCache. I didn't want to add yet another method that doesn't make sense in all cases. You will usually want to avoid get_stream when the actual data is already backed by a byte array (like the cached storage) as it implies an extra unnecessary copy (I know that's what the RamStorage does, but that's mostly for tests). It's implementation might also be a bit tricky on the BundledStorage, which is a shame because it will actually never be used 😅 .

ReadableStorage/ WritableStorage is an interesting idea. The same idea exists on the Directory side. ({Read/Write}Directory.
I would wield the word "should" with more care however. I personally consider this change as a mildly on the overengineering side.

rdettai · 2023-09-27T08:54:41Z

Ok, thanks for your insight! I'll go with a get_stream method directly on the Storage abstraction then.

quickwit/quickwit-storage/src/local_file_storage.rs

quickwit/quickwit-storage/src/object_storage/s3_compatible_storage.rs

quickwit/quickwit-storage/src/prefix_storage.rs

quickwit/quickwit-storage/src/storage.rs

quickwit/quickwit-storage/src/object_storage/s3_compatible_storage.rs

quickwit/quickwit-storage/src/storage_resolver.rs

rdettai · 2023-09-28T09:28:02Z

This is taking shape. Before moving this PR out of draft, I have a few questions:

Cache implementation: I added a naive implementation that uses get_slice and wraps it into a cursor. This implementation is actually never used. Should we rather leave it un-implemented and let it to the person who will first use it to come up with the implementation they want? (if we leave this dummy implementation, I'll obviously add a unit test for it)
Retry on Azure: the Azure SDK opens a fully lazy stream, so nothing happens until the stream is first polled. This means that we cannot use the usual mechanism here. One workaround to enable the retry mechanism would be to peak into the stream and wrap that into a retry. Do you think it is worth it?
Buffered reader: The S3 SDK documentation seems to suggest that the AsyncRead should be wrapped into a BufReader. The existing file source implementation was also performing this wrapping around the tokio::fs::File. Currently, it is still the file source that performs the wrapping, but my feeling is that it should be the Storage implementation that takes care of this when it is relevant. Any opinion on this?
Debouncer: I'm not sure I fully understand the role of the ins an outs of the debouncer. I added a dummy implementation of get_slice_stream that bypasses it. Does that seem sound?

I also still need to look into the tests with external systems (Azure and S3)

fulmicoton · 2023-09-29T00:42:41Z

Cache implementation: I added a naive implementation that uses get_slice and wraps it into a cursor. This implementation is actually never used. Should we rather leave it un-implemented and let it to the person who will first use it to come up with the implementation they want? (if we leave this dummy implementation, I'll obviously add a unit test for it)

Yes unimplemented!() is ok here.

Retry on Azure: the Azure SDK opens a fully lazy stream, so nothing happens until the stream is first polled. This means that we cannot use the usual mechanism here. One workaround to enable the retry mechanism would be to peak into the stream and wrap that into a retry. Do you think it is worth it?

This means that we cannot use the usual mechanism here.
I'm not entirely sure what is not working in this case. If you want to make sure we poll the stream within the body of the get_stream method, a simple workaround could be to wrap the async reader in a tokio's async BufReader. It has a fill_buf method. Make sure you comment why you do that for future generations though.

Buffered reader: The S3 SDK documentation seems to suggest that the AsyncRead should be wrapped into a BufReader. The existing file source implementation was also performing this wrapping around the tokio::fs::File. Currently, it is still the file source that performs the wrapping, but my feeling is that it should be the Storage implementation that takes care of this
when it is relevant. Any opinion on this?

Which documentation?
I wonder if this is relevant here. Under the hood the s3 client wraps a stream of large Bytes of object, which size are independent from the way we use the AsyncRead. Does it prevent polling of the next block or something?

Debouncer: I'm not sure I fully understand the role of the ins an outs of the debouncer. I added a dummy implementation of get_slice_stream that bypasses it. Does that seem sound?

Yes that's ok. The debouncer detects that several queries currently in flight are actually fetching the same piece of data, and makes sure we only emit one GET request.

rdettai · 2023-09-29T18:07:12Z

Cache implementation

ok, switching to unimplemented

Buffered reader

Which documentation?

Very sorry for this, I had the link ready but forgot to paste it in:

https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/primitives/struct.ByteStream.html#method.into_async_read

But now that I think about it, it's probably just for the convenience of having .lines() which comes from [AsyncBufReadExt].(https://docs.rs/tokio/latest/tokio/io/trait.AsyncBufReadExt.html).

I took a deeper look at the code, and the BufReader is actually required in the FileSource for the same reason as in the docs mentioned above: to go through the file line by line (here). So my proposed approach for now is to leave things as they are, i.e return the finer grained AsyncRead from the Storage (exactly as it is returned by the SDKs/tokio::fs::File) and wrap it with BufReader in the file source, if not for performance reasons at least to get the utility of the line by line iterator.

Retry on Azure

If you want to make sure we poll the stream within the body of the get_stream method, a simple workaround could be to wrap the async reader in a tokio's async BufReader. It has a fill_buf method

I feel it would be relevant to ensure that the stream is polled at least once in the body, as a sort of equivalent to "opening the file". This should rarely be useful in practice as connectivity checks seem to be usually performed beforehand, but it would make the behavior more consistent with with the other Storage implementations. To be consistent with the approach I proposed above regarding buffered readers and avoid wrapping the stream in multiple layers of BufReader, I'd rather not solve this with a BufReader here. If that's fine, I'll stick to my "peek into the stream" idea as it's pretty simple an zero cost.

Debouncer

👍

quickwit/quickwit-indexing/src/source/file_source.rs

quickwit/quickwit-storage/src/cache/storage_with_cache.rs

quickwit/quickwit-storage/src/local_file_storage.rs

fulmicoton

I think there is a bug in the local file storage. See comment inline. The other comments are nitpicks.

codecov-commenter · 2023-10-04T10:15:50Z

Codecov Report

Attention: 35 lines in your changes are missing coverage. Please review.

Comparison is base (213b3bf) 81.83% compared to head (a2bc35d) 81.83%.
Report is 21 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3869      +/-   ##
==========================================
- Coverage   81.83%   81.83%   -0.01%     
==========================================
  Files         401      401              
  Lines      100563   100736     +173     
==========================================
+ Hits        82300    82433     +133     
- Misses      18263    18303      +40

Files	Coverage Δ
quickwit/quickwit-config/src/source_config/mod.rs	`96.86% <100.00%> (ø)`
quickwit/quickwit-index-management/src/index.rs	`81.51% <100.00%> (ø)`
.../quickwit-indexing/src/actors/indexing_pipeline.rs	`92.23% <100.00%> (+0.05%)`	⬆️
...t/quickwit-indexing/src/actors/indexing_service.rs	`89.40% <100.00%> (+<0.01%)`	⬆️
...uickwit/quickwit-indexing/src/source/ingest/mod.rs	`87.83% <100.00%> (+0.06%)`	⬆️
quickwit/quickwit-indexing/src/test_utils.rs	`98.94% <100.00%> (ø)`
quickwit/quickwit-storage/src/debouncer.rs	`96.72% <100.00%> (+0.09%)`	⬆️
quickwit/quickwit-storage/src/lib.rs	`100.00% <100.00%> (ø)`
...torage/src/object_storage/s3_compatible_storage.rs	`91.23% <100.00%> (+0.20%)`	⬆️
quickwit/quickwit-storage/src/ram_storage.rs	`92.66% <100.00%> (+0.41%)`	⬆️
... and 10 more

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rdettai self-assigned this Sep 22, 2023

rdettai added the enhancement New feature or request label Sep 22, 2023

rdettai force-pushed the object-source branch from 6186933 to 62836f6 Compare September 27, 2023 14:46