refactor: Store DataFile in FileScanTask instead #607

Xuanwo · 2024-09-06T16:24:01Z

This PR will store DataFile in FileScanTask directly so users can use the information from DataFile more easily like format, type, size and record_counts.

Signed-off-by: Xuanwo <[email protected]>

Xuanwo · 2024-09-06T16:26:24Z

cc @liurenjie1024 and @ZENOTME, I'm unsure why FileScanTask requires Serialize and Deserialize. Is it safe to eliminate this requirement?

I have seen #377 (comment). I encountered a situation where I need to estimate the data foramt, data size and data count, which requires most of the content of the data file. It doesn't seem ideal to add the same fields in FileScanTask and re-expose them. Any thoughts?

Xuanwo · 2024-09-06T16:31:09Z

Also inviting @sdd to join the discussion since you recently worked on the file scan.

ZENOTME · 2024-09-06T18:24:49Z

I'm unsure why FileScanTask requires Serialize and Deserialize. Is it safe to eliminate this requirement?

As in #377 (comment), I think FileScanTask need to be able to Serialize and Deserialize.

I have seen #377 (comment). I encountered a situation where I need to estimate the data foramt, data size and data count, which requires most of the content of the data file. It doesn't seem ideal to add the same fields in FileScanTask and re-expose them. Any thoughts?

The reason why we only contain data file paths is that I find that the reader only needs the file path in ManifestEntry. Lots of information in DataFile is about statistics, like partition, lower_bounds, and upper_bounds, and this information in mainly used in the planning phase to prune the file.
Can you elaborate on the information needed? If what we need is: data format, data size, and data count, I think it's ok to store these three extra fields in FileScanTask now.
But if there is more information needed, I think we can consider containing DataFile in FileScanTask and make it Serializable and Deserializable.

sdd · 2024-09-06T18:27:20Z

Thanks for the mention, @Xuanwo 👍🏼

I have a couple of points:

Firstly, FileScanTask definitely needs to be Serialize and Deserialize. I'm already using this, and I don't think I'l be the only one. In cases such as mine where iceberg-rust forms the core of an Apache Arrow Flight service, it is crucial to be able to distribute tasks from the planner / orchestrator (where I run scan.plan_files()) to multiple executors on other machines that can together stream results (using reader.read()) back to the API consumer.

Because of this sharing of FileScanTask over the wire to other machines, I'm also quite sensitive to the size of this. In situations where the file plan could contain over a thousand tasks, I don't want to have the size of these tasks massively bloated. DataFile can be pretty big, and it's size is unbounded as it grows unavoidably with both the number of columns in the table, as well as the number of row groups in the file.

Thirdly, even though we don't use it yet, DataFile contains key_metadata, which is certainly not something we'd want to stream back to consumers.

You could argue that if we did switch to having DataFile in FileScanTask then I could simply extract out just the file path, projected field ids, predicate and schema and send them in a custom serializable struct instead. The problem with that is that reader::read() expects a stream of FileScanTasks and I'd be unable to reconstruct them exactly on the executor nodes.

So, on the whole, I'm against having DataFile in the FileScanTask.

sdd · 2024-09-06T18:34:40Z

One option would be to have an alternative method to scan.plan_files that returns a stream of more detailed objects that themselves embed FileScanTask. This new method could be aimed towards use cases where the consumer is in the same process or at least on the same machine, and so bandwidth and privacy are less of a concern.

If we have both FileScanTask and, for want of a better name, RichFileScanTask, implement a ScanTask trait, then perhaps reader.read() could take a stream of impl ScanTask so that it would work with both streams?

sdd · 2024-09-06T18:39:04Z

Or we can have a scan.plan_files_rich method that returns richer objects, with a scan.to_file_scan_task_stream method that takes a stream of fat objects and transforms it into a stream of FileScanTasks. reader.read() can then stay unchanged, and scan.plan_files would be refactored into a thin wrapper that calls scan.plan_files_rich and pipes it through scan.to_file_scan_task_stream. That way the changes would be non-breaking for existing users.

Xuanwo · 2024-09-07T04:21:02Z

Thank you @ZENOTME and @sdd for the prompt response! I now understand the context and background. I realize that FileScanTask must be Serializable and Deserializable.

Given this context, we have two options:

Add fields such as data_file_content, data_file_path, data_file_format, data_file_size_in_bytes, and data_file_record_count to FileScanTask.
Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable.

What do you think?

ZENOTME · 2024-09-07T06:56:38Z

Add fields such as data_file_content, data_file_path, data_file_format, data_file_size_in_bytes, and data_file_record_count to FileScanTask.

I prefer this if this solution works in your scene.

Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable.

I think we should propose this until there is a real requirement.

Xuanwo · 2024-09-07T11:36:53Z

Great, let's me follow the option 1 first. We can discuss about option 2 while needed.

amitgilad3 · 2024-10-04T12:40:45Z

Hi @sdd @Xuanwo & @ZENOTME - For the past few days i have been looking into compaction and i saw this discussion , i have been looking into the java implementation of the FileScanTask class and in the java implementation it has a DataFile object which is used when working with the rewrite in order to retrieve information about partition and file.specId now i know @sdd raised a legit reason why not to have it in the FileScanTask but i wondering what would be the preferred approach so i can try to add the necessary data to allow with implementing compaction:

add the fields to FileScanTask like partition & specId?
Propose a new API that can return something similar (perhaps FileScanPlan?) to FileScanTask with more information but not serializable?
any other solution ?

also i know that there are many other features that are missing to implement this but i would like to assist were possible

Xuanwo · 2024-10-05T03:37:32Z

Hi, @amitgilad3 would you like to start a new issue for this discussion? It's possible that comments on closed PR being missed.

Thanks in advance.

refactor: Store DataFile in FileScanTask instead

f4a79d1

Signed-off-by: Xuanwo <[email protected]>

Xuanwo requested a review from liurenjie1024 September 6, 2024 16:24

Xuanwo mentioned this pull request Sep 6, 2024

Feature: Integrate Databend PushDown with Iceberg Rust's Scan Planning databendlabs/databend#16415

Open

4 tasks

Xuanwo closed this Sep 7, 2024

Xuanwo deleted the file_scan_task branch September 7, 2024 11:36

Xuanwo mentioned this pull request Sep 7, 2024

feat: Add more fields in FileScanTask #609

Merged

amitgilad3 mentioned this pull request Oct 5, 2024

Exploring Enhanced Compaction Support in Rust #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Store DataFile in FileScanTask instead #607

refactor: Store DataFile in FileScanTask instead #607

Xuanwo commented Sep 6, 2024

Xuanwo commented Sep 6, 2024 •

edited

Loading

Xuanwo commented Sep 6, 2024

ZENOTME commented Sep 6, 2024

sdd commented Sep 6, 2024

sdd commented Sep 6, 2024

sdd commented Sep 6, 2024 •

edited

Loading

Xuanwo commented Sep 7, 2024

ZENOTME commented Sep 7, 2024

Xuanwo commented Sep 7, 2024

amitgilad3 commented Oct 4, 2024 •

edited

Loading

Xuanwo commented Oct 5, 2024

refactor: Store DataFile in FileScanTask instead #607

refactor: Store DataFile in FileScanTask instead #607

Conversation

Xuanwo commented Sep 6, 2024

Xuanwo commented Sep 6, 2024 • edited Loading

Xuanwo commented Sep 6, 2024

ZENOTME commented Sep 6, 2024

sdd commented Sep 6, 2024

sdd commented Sep 6, 2024

sdd commented Sep 6, 2024 • edited Loading

Xuanwo commented Sep 7, 2024

ZENOTME commented Sep 7, 2024

Xuanwo commented Sep 7, 2024

amitgilad3 commented Oct 4, 2024 • edited Loading

Xuanwo commented Oct 5, 2024

Xuanwo commented Sep 6, 2024 •

edited

Loading

sdd commented Sep 6, 2024 •

edited

Loading

amitgilad3 commented Oct 4, 2024 •

edited

Loading