Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate #588

Merged
merged 14 commits into from
Sep 23, 2024

Conversation

a-agmon
Copy link
Contributor

@a-agmon a-agmon commented Aug 28, 2024

This PR closes #585

  • Adds datafusion filters to the IcebergTableScan struct
  • Converts datafusion filters (or Exp) to Iceberg Predicate and apply them to the TableScan On data fusion execute() thereby enabling data files pruning by data fusion to boost scan.
  • Currently supports binary expressions (e.g., (X > 1 AND Y = 'something') OR Z > 100), and any combination of them.

Update:

  • added support in schema to convert Date32 to correct arrow type
  • refactored scan to use new predicate converter as visitor and separated it to a new mod
  • added support for simple predicates

///
/// * `Some(Predicate)` if the expression could be successfully converted.
/// * `None` if the expression couldn't be converted to an Iceberg predicate.
fn expr_to_predicate(expr: &Expr) -> Option<Predicate> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use visitor pattern to make code more cleaner to avoid code bomb if supporting more expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm not a DataFusion expert but I think this makes sense being implemented as a visitor of Expr, probably by implementing a https://docs.rs/datafusion-common/41.0.0/datafusion_common/tree_node/trait.TreeNodeVisitor.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd, @FANNG1 , I appriciate it
I will try to refactor this, though I am wondering whether Visitor will indeed be the most suitable. Let me know what you think, but I think visitor shines when we have to run different logic on different kinds of elements (chained in some way) while we want to keep the logic in one place - i.e., the Visitor. whereas here we have one kind of element - Expr - which is an enum that can be deconstructed in different ways, for example -

Expr::Column(col), op, Expr::Literal(lit)
OR 
Expr::BinaryExpr(left_expr), Operator::Or, Expr::BinaryExpr(right_expr)

so what Im trying to say is that using visitor will simply move the matching complexity to another place - to the visitor.
Does this make sense?
I will continue to try and refactor this but please let me know what you think

Alon Agmon added 2 commits September 1, 2024 12:16
- refactored scan to use new predicate converter as visitor and seperated it to a new mod
- added support for simple predicates with column cast expressions
- added testing, mostly around date functions
@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 1, 2024

Hey @sdd, @FANNG1 and @liurenjie1024
Please see updated PR. I have refactored predicate_converter into a separate module that follows a more visitor-ish approach.
The scan mod is now clean and convert predicates using:

fn convert_filters_to_predicate(filters: &[Expr]) -> Option<Predicate> {
    PredicateConverter.visit_many(filters)
}

PredicateConverter is now more modular and allows adding more expressions and patterns. Though I think that it currently covers quite a lot of the cases that are commonly supported in pushdown strategies. Most of the complexity is just in identifying and matching the particular patterns of the Expr we need to deal with in different ways.
Please let me know what do you think about this

also note that I have added a fix to the schema mod that correctly converts Date type to its Arrow type.

Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work you've put into this! I think there are some structural changes that could be made, by aligning more closely to the visitor style that we already have, but I also think that perfect is the enemy of done, and since this datafusion integration is still quite experimental, and that Rust makes refactorings quite easy, I'm happy to proceed here with something working that we can evolve further as we go forwards.

My main concern is that it seems like you have taken a "soft-fail" approach, where, for unsupported operators or nodes, you return None and proceed by effectively ignoring as-yet-unimplemented operations? I think the whole thing should return a Result rather than an Option, and when an unimplemented aspect of Expr is encountered, we should fail, returning a Err with ErrorKind::Unimplemented. This prevents confusion and unexpected behaviour by users whose datafusion queries succeed but give them unexpected results due to partial filters.

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 1, 2024

My main concern is that it seems like you have taken a "soft-fail" approach, where, for unsupported operators or nodes, you return None and proceed by effectively ignoring as-yet-unimplemented operations? I think the whole thing should return a Result rather than an Option, and when an unimplemented aspect of Expr is encountered, we should fail, returning a Err with ErrorKind::Unimplemented. This prevents confusion and unexpected behaviour by users whose datafusion queries succeed but give them unexpected results due to partial filters.

Thanks for the review and comment, @sdd !
I Will be happy to update this accordingly (will also learn more carefully the visitor approach in the code base to perhaps refactor this further in the future).
I do have one question though. The idea I have followed here is that the conversion process operates in an eager-like mode. It works through the expressions, some of which are not implemented yet, but some of which will not be implemented by design (for example: Expr::ScalarFunction(ScalarFunction)), and basically returns a predicate that might not reflect exactly your filters but will filter as much as possible to speed up the query.

for example, suppose that we have an expression like:
(A > 1 AND B = func(X)) OR (A < 0)
and we know that B = func(X) is not an expression we support or can evaluate. So the idea here of soft-fail, is to not fail the entire conversion but to return (A > 1) OR (A < 0), which will still contain unnecessary data but will filter out many files that are not relevant and let you enjoy the metadata-based file pruning.

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 1, 2024

you can see what I mean in this test case:

    #[test]
    fn test_predicate_conversion_with_unsupported_condition_or() {
        let sql = "(foo > 1 and bar in ('test', 'test2')) or foo < 0 ";
        let df_schema = create_test_schema();
        let expr = SessionContext::new()
            .parse_sql_expr(sql, &df_schema)
            .unwrap();
        let predicate = convert_filters_to_predicate(&[expr]).unwrap();
        let expected_predicate = Predicate::or(
            Reference::new("foo").greater_than(Datum::long(1)),
            Reference::new("foo").less_than(Datum::long(0)),
        );
        assert_eq!(predicate, expected_predicate);
    }

although we don't support bar in ('test', 'test2') there is no reason to fail the entire conversion and not let you enjoy filtering files according to what we can support.
Do you see my point here? or perhaps I misunderstood your comment

@sdd
Copy link
Contributor

sdd commented Sep 1, 2024

Aah I see! That makes complete sense. So, what you are saying is, we do the best we can to filter via metadata within iceberg, but whatever operations we can't handle in iceberg will get applied at row-level anyway once the data is passed back to DataFusion? If so, then I withdraw my objection as this seems totally sensible 👍

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 1, 2024

Aah I see! That makes complete sense. So, what you are saying is, we do the best we can to filter via metadata within iceberg, but whatever operations we can't handle in iceberg will get applied at row-level anyway once the data is passed back to DataFusion? If so, then I withdraw my objection as this seems totally sensible 👍

Precisely!
DataFusion is pretty fast in scanning and processing parquet files and record batches, but I think that the major performance boost that Iceberg can bring in is by using its metadata to filter out and prune data files. So we make best effort to prune using predicate and let DF handle the rest (some of my tests on huge tables show this to gain a great performance boost)

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 2, 2024

Hey @liurenjie1024 ,
I think this one is ready to go :)

Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-agmon for this pr! This helps to improve datafusio integration, but there are still some missing part.

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 2, 2024

Thanks much for the review and comments @liurenjie1024
I refactored the code to be based on TreeNodeVisitor now, and all unit tests moved there.
I initially thought that TreeNodeVisitor will add an unnecessary layer of complexity but it actually seems to simplify it in some sense. Thanks

cc @sdd @FANNG1

impl<'n> TreeNodeVisitor<'n> for ExprToPredicateVisitor {
type Node = Expr;

fn f_down(&mut self, _node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn f_down(&mut self, _node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> {
fn f_down(&mut self, _node: &Expr) -> Result<TreeNodeRecursion, DataFusionError> {

I would suggest to use concrete type to make it more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Ok(TreeNodeRecursion::Continue)
}

fn f_up(&mut self, node: &'n Self::Node) -> Result<TreeNodeRecursion, DataFusionError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if let Expr::BinaryExpr(binary) = node {
match (&*binary.left, &binary.op, &*binary.right) {
// process simple expressions (involving a column, operator and literal)
(Expr::Column(col), op, Expr::Literal(lit)) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this visitor visit leaf nodes such as Expr::Column? This pattern doesn't seem flexible to me, for example, I believe we could also support things like a > 6, 6 < a.

Copy link
Contributor Author

@a-agmon a-agmon Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment @liurenjie1024
I have added support, and test, for both "directions" of binary statement: X > 1 and 1 < X.
But I think there is something more to be clarified:
Both x>1 and 1<x are binary statements.
When the binary statement foo > 1 is processed then there are 3 main leafs to handle:

1 - Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" })

This ^ will be ignored as its not a binary expr (its a column expr)

2 - Literal(Int64(1))

This ^ will also be ignored as its not a binary expr (its a literal expr)

3 - BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }), op: Gt, right: Literal(Int64(1)) })

This ^ binary expr will be captured, deconstructed, and processed by this match arm

  match (&*binary.left, &binary.op, &*binary.right) {
                (Expr::Column(col), op, Expr::Literal(lit)) => {
                  ...
                }

Therefore, in order to support a>6 or 6<a we dont need to capture Expr:Column, only the binary statement that includes it.
On the other hand, I dont think we want to be too flexible because there are expr we want to avoid. For example, Expr::SimilarTo (regexp comparison) is something I think we should leave to DataFusion to handle. TBH, I think there are actually very few Exp we want to handle, rather than leaving them to DataFusion, which are mostly where Iceberg can improve the datafusion scan by using predicates at the metadata level and pruning data files to be scanned.

fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option<Predicate> {
let valid_preds_count = valid_preds.len();
match (op, valid_preds_count) {
(Operator::And, 1) => valid_preds.first().cloned(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confusing about this case, when will this be valid?

Copy link
Contributor Author

@a-agmon a-agmon Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files.
suppose we have a case like this:

SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}');

This will translate into an expr that looks something like this

BinaryExpr(BinaryExpr AND RegExpExpr) 

After we processed both leafs: ( BinaryExpr and RegExpExpr), we have on the stack just one valid expression because the RegExpExpr is not supported.
Then we get to the parent: BinaryExpr(A AND B), we try to get the 2 child expressions from the stack, but in this case we get only one (A > 10) because the second one is not supported (REGEXP_LIKE).
In this case, because its a logical AND expr, we proceed with just one valid predicate. Thats the case you are highlighinting - cases where we process logical AND expr where just one side is valid. in this case - we are processing with just one predicate.
If it was an OR expr that was wrapping both then we would have failed the whole predicate.

let compound_pred = self.convert_compound_expr(&children, op);
self.stack.push_back(compound_pred);
}
_ => {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there sth not supported, I think the correct behavior should be TreeNodeRecursion::Stop and clear the stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

#[cfg(test)]
mod tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these tests, really appreciate it!

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 8, 2024

Thank you very much for the review and comments, @liurenjie1024
I think I have addressed all your comments and questions and would be happy if you can take a another look.

I just want to make a comment which might explain some of my design choices.
I think that the main value of converting data fusion expr to iceberg predicates, is that by pushing down some predicate to an iceberg scan we can prune data files and have data fusion scan less data. So, if there is some Expr that we can convert to Iceberg Predicate and then scan less files we should do it. Therefore I tried to implement this as greedy as possible - trying to convert as much relevant expr as possible, even when its just part of all the filters, and basically leave to data fusion to process the rest (which can sometimes be done better there)

@a-agmon
Copy link
Contributor Author

a-agmon commented Sep 13, 2024

Hi @liurenjie1024 ,
just fixed a few conflicts due to #594 (@FANNG1)
Should be good now

Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being so patient on this and explaining the rationale, @a-agmon. I think that there has been a lot of confusion on this because the approach of ignoring unsupported predicates (because they will be handled downstream by DataFusion) was not quite what we're used to seeing, but I think your approach makes sense. And, as I said before, perfect is the enemy of done and I'm happy to move forward with what we have here as there is nothing preventing us from further iterating on this.

Thanks again!

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @a-agmon for working this and thanks @liurenjie1024, @sdd, @FANNG1's review. Let's move!

@Xuanwo Xuanwo merged commit 1533c43 into apache:main Sep 23, 2024
16 checks passed
c-thiel added a commit to hansetag/iceberg-rust that referenced this pull request Oct 2, 2024
commit cda4a0c
Author: congyi wang <[email protected]>
Date:   Mon Sep 30 11:07:36 2024 +0800

    chore: fix typo in FileIO Schemes  (apache#653)

    * fix typo

    * fix typo

commit af9609d
Author: Scott Donnelly <[email protected]>
Date:   Mon Sep 30 04:06:14 2024 +0100

    fix: page index evaluator min/max args inverted (apache#648)

    * fix: page index evaluator min/max args inverted

    * style: fix clippy lint in test

commit a6a3fd7
Author: Alon Agmon <[email protected]>
Date:   Sat Sep 28 10:10:08 2024 +0300

    test (datafusion): add test for table provider creation (apache#651)

    * add test for table provider creation

    * fix formatting

    * fixing yet another formatting issue

    * testing schema using data fusion

    ---------

    Co-authored-by: Alon Agmon <[email protected]>

commit 87483b4
Author: Alon Agmon <[email protected]>
Date:   Fri Sep 27 04:40:08 2024 +0300

    making table provider pub (apache#650)

    Co-authored-by: Alon Agmon <[email protected]>

commit 984c91e
Author: ZENOTME <[email protected]>
Date:   Thu Sep 26 17:56:02 2024 +0800

    avoid to create memory schema operator every time (apache#635)

    Co-authored-by: ZENOTME <[email protected]>

commit 4171275
Author: Matheus Alcantara <[email protected]>
Date:   Wed Sep 25 08:28:42 2024 -0300

    scan: change ErrorKind when table dont have spanshots (apache#608)

commit ab51355
Author: xxchan <[email protected]>
Date:   Tue Sep 24 21:25:45 2024 +0800

    fix: compile error due to merge stale PR (apache#646)

    Signed-off-by: xxchan <[email protected]>

commit 420b4e2
Author: Scott Donnelly <[email protected]>
Date:   Tue Sep 24 08:20:23 2024 +0100

    Table Scan: Add Row Selection Filtering (apache#565)

    * feat(scan): add row selection capability via PageIndexEvaluator

    * test(row-selection): add first few row selection tests

    * feat(scan): add more tests, fix bug where min/max args swapped

    * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler

    * feat: changes suggested from PR review

commit b3709ba
Author: Christian <[email protected]>
Date:   Tue Sep 24 04:47:04 2024 +0200

    feat: Add NamespaceIdent.parent() (apache#641)

    * Add NamespaceIdent.parent()

    * Use split_last

commit 1533c43
Author: Alon Agmon <[email protected]>
Date:   Mon Sep 23 13:39:46 2024 +0300

    feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588)

    * adding main function and tests

    * adding tests, removing integration test for now

    * fixing typos and lints

    * fixing typing issue

    * - added support in schmema to convert Date32 to correct arrow type
    - refactored scan to use new predicate converter as visitor and seperated it to a new mod
    - added support for simple predicates with column cast expressions
    - added testing, mostly around date functions

    * fixing format and lic

    * reducing number of tests (17 -> 7)

    * fix formats

    * fix naming

    * refactoring to use TreeNodeVisitor

    * fixing fmt

    * small refactor

    * adding swapped op and fixing CR comments

    ---------

    Co-authored-by: Alon Agmon <[email protected]>

commit e967deb
Author: xxchan <[email protected]>
Date:   Mon Sep 23 18:34:59 2024 +0800

    feat: expose remove_all in FileIO (apache#643)

    Signed-off-by: xxchan <[email protected]>

commit d03c4f8
Author: Scott Donnelly <[email protected]>
Date:   Mon Sep 23 08:28:52 2024 +0100

    Migrate to arrow-* v53 (apache#626)

    * chore: migrate to arrow-* v53

    * chore: update datafusion to 42

    * test: fix incorrect test assertion

    * chore: update python bindings to arrow 53

commit 88e5e4a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Sep 23 15:26:18 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.5...v1.24.6)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit c354983
Author: xxchan <[email protected]>
Date:   Mon Sep 23 14:50:18 2024 +0800

    doc: improve FileIO doc (apache#642)

    Signed-off-by: xxchan <[email protected]>

commit 12e12e2
Author: xxchan <[email protected]>
Date:   Fri Sep 20 19:59:55 2024 +0800

    feat: expose arrow type <-> iceberg type (apache#637)

    * feat: expose arrow type <-> iceberg type

    Previously we only exposed the schema conversion.

    Signed-off-by: xxchan <[email protected]>

    * add tests

    Signed-off-by: xxchan <[email protected]>

    ---------

    Signed-off-by: xxchan <[email protected]>

commit 3b27c9e
Author: xxchan <[email protected]>
Date:   Fri Sep 20 18:32:31 2024 +0800

    feat: add Sync to TransformFunction (apache#638)

    Signed-off-by: xxchan <[email protected]>

commit 34cb81c
Author: Xuanwo <[email protected]>
Date:   Wed Sep 18 20:18:40 2024 +0800

    chore: Bump opendal to 0.50 (apache#634)

commit cde35ab
Author: FANNG <[email protected]>
Date:   Fri Sep 13 10:01:16 2024 +0800

    feat: support projection pushdown for datafusion iceberg (apache#594)

    * support projection pushdown for datafusion iceberg

    * support projection pushdown for datafusion iceberg

    * fix ci

    * fix field id

    * remove depencences

    * remove depencences

commit eae9464
Author: Xuanwo <[email protected]>
Date:   Thu Sep 12 02:06:31 2024 +0800

    refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628)

commit 8a3de4e
Author: Christian <[email protected]>
Date:   Mon Sep 9 14:45:16 2024 +0200

    Feat: Normalize TableMetadata (apache#611)

    * Normalize Table Metadata

    * Improve readability & comments

commit e08c0e5
Author: Renjie Liu <[email protected]>
Date:   Mon Sep 9 11:57:22 2024 +0800

    fix: Correctly calculate highest_field_id in schema (apache#590)

commit f78c59b
Author: Jack <[email protected]>
Date:   Mon Sep 9 03:35:16 2024 +0100

    feat: add `client.region` (apache#623)

commit a5aba9a
Author: Christian <[email protected]>
Date:   Sun Sep 8 18:36:05 2024 +0200

    feat: SortOrder methods should take schema ref if possible (apache#613)

    * SortOrder methods should take schema ref if possible

    * Fix test type

    * with_order_id should not take reference

commit 5812399
Author: Christian <[email protected]>
Date:   Sun Sep 8 18:18:41 2024 +0200

    feat: partition compatibility (apache#612)

    * Partition compatability

    * Partition compatability

    * Rename compatible_with -> is_compatible_with

commit ede4720
Author: Christian <[email protected]>
Date:   Sun Sep 8 16:49:39 2024 +0200

    fix: Less Panics for Snapshot timestamps (apache#614)

commit ced661f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Sep 8 22:43:38 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.3...v1.24.5)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit cbbd086
Author: Xuanwo <[email protected]>
Date:   Sun Sep 8 10:29:31 2024 +0800

    feat: Add more fields in FileScanTask (apache#609)

    Signed-off-by: Xuanwo <[email protected]>

commit 620d58e
Author: Callum Ryan <[email protected]>
Date:   Thu Sep 5 03:44:55 2024 +0100

    feat: SQL Catalog - namespaces (apache#534)

    * feat: SQL Catalog - namespaces

    Signed-off-by: callum-ryan <[email protected]>

    * feat: use transaction for updates and creates

    Signed-off-by: callum-ryan <[email protected]>

    * fix: pull out query param builder to fn

    Signed-off-by: callum-ryan <[email protected]>

    * feat: add drop and tests

    Signed-off-by: callum-ryan <[email protected]>

    * fix: String to str, remove pub and optimise query builder

    Signed-off-by: callum-ryan <[email protected]>

    * fix: nested match, remove ok()

    Signed-off-by: callum-ryan <[email protected]>

    * fix: remove pub, add set, add comments

    Signed-off-by: callum-ryan <[email protected]>

    * fix: refactor list_namespaces slightly

    Signed-off-by: callum-ryan <[email protected]>

    * fix: add default properties to all new namespaces

    Signed-off-by: callum-ryan <[email protected]>

    * fix: remove check for nested namespace

    Signed-off-by: callum-ryan <[email protected]>

    * chore: add more comments to the CatalogConfig to explain bind styles

    Signed-off-by: callum-ryan <[email protected]>

    * fix: edit test for nested namespaces

    Signed-off-by: callum-ryan <[email protected]>

    ---------

    Signed-off-by: callum-ryan <[email protected]>

commit ae75f96
Author: Søren Dalby Larsen <[email protected]>
Date:   Tue Sep 3 13:46:48 2024 +0200

    chore: bump crate-ci/typos to 1.24.3 (apache#598)

commit 7aa8bdd
Author: Scott Donnelly <[email protected]>
Date:   Thu Aug 29 04:37:48 2024 +0100

    Table Scan: Add Row Group Skipping (apache#558)

    * feat(scan): add row group and page index row selection filtering

    * fix(row selection): off-by-one error

    * feat: remove row selection to defer to a second PR

    * feat: better min/max val conversion in RowGroupMetricsEvaluator

    * test(row_group_filtering): first three tests

    * test(row_group_filtering): next few tests

    * test: add more tests for RowGroupMetricsEvaluator

    * chore: refactor test assertions to silence clippy lints

    * refactor: consolidate parquet stat min/max parsing in one place

commit da08e8d
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Aug 28 14:35:55 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583)

commit ecbb4c3
Author: Sung Yun <[email protected]>
Date:   Mon Aug 26 23:57:01 2024 -0400

    Expose Transforms to Python Binding (apache#556)

    * bucket transform rust binding

    * format

    * poetry x maturin

    * ignore poetry.lock in license check

    * update bindings_python_ci to use makefile

    * newline

    * python-poetry/poetry#9135

    * use hatch instead of poetry

    * refactor

    * revert licenserc change

    * adopt review feedback

    * comments

    * unused dependency

    * adopt review comment

    * newline

    * I like this approach a lot better

    * more tests

commit 905ebd2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Aug 26 20:49:07 2024 +0800

    chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582)

    ---
    updated-dependencies:
    - dependency-name: typed-builder
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit f9c92b7
Author: FANNG <[email protected]>
Date:   Sun Aug 25 22:31:36 2024 +0800

    fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584)

commit ba66665
Author: FANNG <[email protected]>
Date:   Sat Aug 24 12:35:36 2024 +0800

    fix: correct partition-id to field-id in UnboundPartitionField (apache#576)

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * xx
c-thiel added a commit to hansetag/iceberg-rust that referenced this pull request Oct 2, 2024
commit 2c9d7e4
Author: Christian Thiel <[email protected]>
Date:   Wed Oct 2 07:51:50 2024 +0200

    New builder

commit fea1817
Author: Christian Thiel <[email protected]>
Date:   Tue Oct 1 12:56:10 2024 +0200

    Merge branch 'feat/safe-partition-spec'

commit 1f49c95
Author: Christian Thiel <[email protected]>
Date:   Tue Oct 1 12:55:02 2024 +0200

    Merge branch 'feat/schema-reassign-field-ids'

commit cda4a0c
Author: congyi wang <[email protected]>
Date:   Mon Sep 30 11:07:36 2024 +0800

    chore: fix typo in FileIO Schemes  (apache#653)

    * fix typo

    * fix typo

commit af9609d
Author: Scott Donnelly <[email protected]>
Date:   Mon Sep 30 04:06:14 2024 +0100

    fix: page index evaluator min/max args inverted (apache#648)

    * fix: page index evaluator min/max args inverted

    * style: fix clippy lint in test

commit a6a3fd7
Author: Alon Agmon <[email protected]>
Date:   Sat Sep 28 10:10:08 2024 +0300

    test (datafusion): add test for table provider creation (apache#651)

    * add test for table provider creation

    * fix formatting

    * fixing yet another formatting issue

    * testing schema using data fusion

    ---------

    Co-authored-by: Alon Agmon <[email protected]>

commit 87483b4
Author: Alon Agmon <[email protected]>
Date:   Fri Sep 27 04:40:08 2024 +0300

    making table provider pub (apache#650)

    Co-authored-by: Alon Agmon <[email protected]>

commit 984c91e
Author: ZENOTME <[email protected]>
Date:   Thu Sep 26 17:56:02 2024 +0800

    avoid to create memory schema operator every time (apache#635)

    Co-authored-by: ZENOTME <[email protected]>

commit 4171275
Author: Matheus Alcantara <[email protected]>
Date:   Wed Sep 25 08:28:42 2024 -0300

    scan: change ErrorKind when table dont have spanshots (apache#608)

commit ab51355
Author: xxchan <[email protected]>
Date:   Tue Sep 24 21:25:45 2024 +0800

    fix: compile error due to merge stale PR (apache#646)

    Signed-off-by: xxchan <[email protected]>

commit 420b4e2
Author: Scott Donnelly <[email protected]>
Date:   Tue Sep 24 08:20:23 2024 +0100

    Table Scan: Add Row Selection Filtering (apache#565)

    * feat(scan): add row selection capability via PageIndexEvaluator

    * test(row-selection): add first few row selection tests

    * feat(scan): add more tests, fix bug where min/max args swapped

    * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler

    * feat: changes suggested from PR review

commit b3709ba
Author: Christian <[email protected]>
Date:   Tue Sep 24 04:47:04 2024 +0200

    feat: Add NamespaceIdent.parent() (apache#641)

    * Add NamespaceIdent.parent()

    * Use split_last

commit 1533c43
Author: Alon Agmon <[email protected]>
Date:   Mon Sep 23 13:39:46 2024 +0300

    feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588)

    * adding main function and tests

    * adding tests, removing integration test for now

    * fixing typos and lints

    * fixing typing issue

    * - added support in schmema to convert Date32 to correct arrow type
    - refactored scan to use new predicate converter as visitor and seperated it to a new mod
    - added support for simple predicates with column cast expressions
    - added testing, mostly around date functions

    * fixing format and lic

    * reducing number of tests (17 -> 7)

    * fix formats

    * fix naming

    * refactoring to use TreeNodeVisitor

    * fixing fmt

    * small refactor

    * adding swapped op and fixing CR comments

    ---------

    Co-authored-by: Alon Agmon <[email protected]>

commit e967deb
Author: xxchan <[email protected]>
Date:   Mon Sep 23 18:34:59 2024 +0800

    feat: expose remove_all in FileIO (apache#643)

    Signed-off-by: xxchan <[email protected]>

commit d03c4f8
Author: Scott Donnelly <[email protected]>
Date:   Mon Sep 23 08:28:52 2024 +0100

    Migrate to arrow-* v53 (apache#626)

    * chore: migrate to arrow-* v53

    * chore: update datafusion to 42

    * test: fix incorrect test assertion

    * chore: update python bindings to arrow 53

commit 88e5e4a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Sep 23 15:26:18 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.5...v1.24.6)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit c354983
Author: xxchan <[email protected]>
Date:   Mon Sep 23 14:50:18 2024 +0800

    doc: improve FileIO doc (apache#642)

    Signed-off-by: xxchan <[email protected]>

commit 12e12e2
Author: xxchan <[email protected]>
Date:   Fri Sep 20 19:59:55 2024 +0800

    feat: expose arrow type <-> iceberg type (apache#637)

    * feat: expose arrow type <-> iceberg type

    Previously we only exposed the schema conversion.

    Signed-off-by: xxchan <[email protected]>

    * add tests

    Signed-off-by: xxchan <[email protected]>

    ---------

    Signed-off-by: xxchan <[email protected]>

commit 3b27c9e
Author: xxchan <[email protected]>
Date:   Fri Sep 20 18:32:31 2024 +0800

    feat: add Sync to TransformFunction (apache#638)

    Signed-off-by: xxchan <[email protected]>

commit 34cb81c
Author: Xuanwo <[email protected]>
Date:   Wed Sep 18 20:18:40 2024 +0800

    chore: Bump opendal to 0.50 (apache#634)

commit cde35ab
Author: FANNG <[email protected]>
Date:   Fri Sep 13 10:01:16 2024 +0800

    feat: support projection pushdown for datafusion iceberg (apache#594)

    * support projection pushdown for datafusion iceberg

    * support projection pushdown for datafusion iceberg

    * fix ci

    * fix field id

    * remove depencences

    * remove depencences

commit eae9464
Author: Xuanwo <[email protected]>
Date:   Thu Sep 12 02:06:31 2024 +0800

    refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628)

commit 8a3de4e
Author: Christian <[email protected]>
Date:   Mon Sep 9 14:45:16 2024 +0200

    Feat: Normalize TableMetadata (apache#611)

    * Normalize Table Metadata

    * Improve readability & comments

commit e08c0e5
Author: Renjie Liu <[email protected]>
Date:   Mon Sep 9 11:57:22 2024 +0800

    fix: Correctly calculate highest_field_id in schema (apache#590)

commit f78c59b
Author: Jack <[email protected]>
Date:   Mon Sep 9 03:35:16 2024 +0100

    feat: add `client.region` (apache#623)

commit a5aba9a
Author: Christian <[email protected]>
Date:   Sun Sep 8 18:36:05 2024 +0200

    feat: SortOrder methods should take schema ref if possible (apache#613)

    * SortOrder methods should take schema ref if possible

    * Fix test type

    * with_order_id should not take reference

commit 5812399
Author: Christian <[email protected]>
Date:   Sun Sep 8 18:18:41 2024 +0200

    feat: partition compatibility (apache#612)

    * Partition compatability

    * Partition compatability

    * Rename compatible_with -> is_compatible_with

commit ede4720
Author: Christian <[email protected]>
Date:   Sun Sep 8 16:49:39 2024 +0200

    fix: Less Panics for Snapshot timestamps (apache#614)

commit ced661f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Sep 8 22:43:38 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.3...v1.24.5)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit cbbd086
Author: Xuanwo <[email protected]>
Date:   Sun Sep 8 10:29:31 2024 +0800

    feat: Add more fields in FileScanTask (apache#609)

    Signed-off-by: Xuanwo <[email protected]>

commit 620d58e
Author: Callum Ryan <[email protected]>
Date:   Thu Sep 5 03:44:55 2024 +0100

    feat: SQL Catalog - namespaces (apache#534)

    * feat: SQL Catalog - namespaces

    Signed-off-by: callum-ryan <[email protected]>

    * feat: use transaction for updates and creates

    Signed-off-by: callum-ryan <[email protected]>

    * fix: pull out query param builder to fn

    Signed-off-by: callum-ryan <[email protected]>

    * feat: add drop and tests

    Signed-off-by: callum-ryan <[email protected]>

    * fix: String to str, remove pub and optimise query builder

    Signed-off-by: callum-ryan <[email protected]>

    * fix: nested match, remove ok()

    Signed-off-by: callum-ryan <[email protected]>

    * fix: remove pub, add set, add comments

    Signed-off-by: callum-ryan <[email protected]>

    * fix: refactor list_namespaces slightly

    Signed-off-by: callum-ryan <[email protected]>

    * fix: add default properties to all new namespaces

    Signed-off-by: callum-ryan <[email protected]>

    * fix: remove check for nested namespace

    Signed-off-by: callum-ryan <[email protected]>

    * chore: add more comments to the CatalogConfig to explain bind styles

    Signed-off-by: callum-ryan <[email protected]>

    * fix: edit test for nested namespaces

    Signed-off-by: callum-ryan <[email protected]>

    ---------

    Signed-off-by: callum-ryan <[email protected]>

commit ae75f96
Author: Søren Dalby Larsen <[email protected]>
Date:   Tue Sep 3 13:46:48 2024 +0200

    chore: bump crate-ci/typos to 1.24.3 (apache#598)

commit 7aa8bdd
Author: Scott Donnelly <[email protected]>
Date:   Thu Aug 29 04:37:48 2024 +0100

    Table Scan: Add Row Group Skipping (apache#558)

    * feat(scan): add row group and page index row selection filtering

    * fix(row selection): off-by-one error

    * feat: remove row selection to defer to a second PR

    * feat: better min/max val conversion in RowGroupMetricsEvaluator

    * test(row_group_filtering): first three tests

    * test(row_group_filtering): next few tests

    * test: add more tests for RowGroupMetricsEvaluator

    * chore: refactor test assertions to silence clippy lints

    * refactor: consolidate parquet stat min/max parsing in one place

commit da08e8d
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Aug 28 14:35:55 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583)

commit ecbb4c3
Author: Sung Yun <[email protected]>
Date:   Mon Aug 26 23:57:01 2024 -0400

    Expose Transforms to Python Binding (apache#556)

    * bucket transform rust binding

    * format

    * poetry x maturin

    * ignore poetry.lock in license check

    * update bindings_python_ci to use makefile

    * newline

    * python-poetry/poetry#9135

    * use hatch instead of poetry

    * refactor

    * revert licenserc change

    * adopt review feedback

    * comments

    * unused dependency

    * adopt review comment

    * newline

    * I like this approach a lot better

    * more tests

commit 905ebd2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Aug 26 20:49:07 2024 +0800

    chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582)

    ---
    updated-dependencies:
    - dependency-name: typed-builder
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit f9c92b7
Author: FANNG <[email protected]>
Date:   Sun Aug 25 22:31:36 2024 +0800

    fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584)

commit ba66665
Author: FANNG <[email protected]>
Date:   Sat Aug 24 12:35:36 2024 +0800

    fix: correct partition-id to field-id in UnboundPartitionField (apache#576)

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * xx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert datafusion table scan filter into iceberg table scan' filter.
5 participants