Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for pl.concat #11597

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 8, 2023

Issues closed

"Three for the price of one..."
Additional ref: #8917 (comment).

Closes #9370.
Closes #9891.
Closes #9412.

TLDR

  • New pl.concat strategy "diagonal_relaxed".
  • New SQL support, eg: SELECT * FROM df1 UNION ALL BY NAME SELECT * FROM df2.
  • The "to_supertypes" parameter is set True by default for UNION ops iif coming from the SQL interface, as this is the expected behaviour in that environment.

Miscellaneous

While adding diagonal support for the "to_supertypes" param I also standardised the Rust-side function names for consistency/clarity, and updated the diagonal variant signature to match the standard concat, such that they both now take exactly the same UnionArgs enum (instead of breaking out the args as standalone params, as was previously done).

  • hor_concat_dfconcat_df_horizontal
  • diag_concat_dfconcat_df_diagonal
  • diag_concat_lfconcat_lf_diagonal

Examples

  • Setup sample frame data:

    import polars as pl
    
    df1 = pl.DataFrame({
        "A": [1, 2, 2],
        "B": [5, 4, 4],
    })
    
    df2 = df1.select(
        pl.col("B").cast(pl.UInt16),
        pl.col("A").cast(pl.Int32),
        pl.lit(123).alias("C"),
    )
  • Demonstrate new pl.concat strategy:

    pl.concat( [df1,df2], how="diagonal" )
    # ShapeError: 
    #  unable to vstack, dtypes for column "A" don't match: `i64` and `i32`
    
    pl.concat( [df1,df2], how="diagonal_relaxed" )
    # ┌─────┬─────┬──────┐
    # │ A   ┆ B   ┆ C    │
    # │ --- ┆ --- ┆ ---  │
    # │ i64 ┆ i64 ┆ str  │
    # ╞═════╪═════╪══════╡
    # │ 1   ┆ 5   ┆ null │
    # │ 2   ┆ 4   ┆ null │
    # │ 2   ┆ 4   ┆ null │
    # │ 1   ┆ 5   ┆ 0    │
    # │ 2   ┆ 4   ┆ 0    │
    # │ 2   ┆ 4   ┆ 0    │
    # └─────┴─────┴──────┘
  • Demonstrate equivalent/new SQL support:

    with pl.SQLContext(
        register_globals = True, 
        eager_execution = True,
    ) as ctx:
    
        ctx.execute(
            "SELECT * FROM df1 UNION ALL BY NAME SELECT * FROM df2"
        )
        # ┌─────┬─────┬──────┐
        # │ A   ┆ B   ┆ C    │
        # │ --- ┆ --- ┆ ---  │
        # │ i64 ┆ i64 ┆ str  │
        # ╞═════╪═════╪══════╡
        # │ 1   ┆ 5   ┆ null │
        # │ 2   ┆ 4   ┆ null │
        # │ 2   ┆ 4   ┆ null │
        # │ 1   ┆ 5   ┆ 0    │
        # │ 2   ┆ 4   ┆ 0    │
        # │ 2   ┆ 4   ┆ 0    │
        # └─────┴─────┴──────┘
    
        ctx.execute(
            "SELECT * FROM df1 UNION BY NAME SELECT * FROM df2"
        )
        # ┌─────┬─────┬──────┐
        # │ A   ┆ B   ┆ C    │
        # │ --- ┆ --- ┆ ---  │
        # │ i64 ┆ i64 ┆ i32  │
        # ╞═════╪═════╪══════╡
        # │ 2   ┆ 4   ┆ null │
        # │ 1   ┆ 5   ┆ null │
        # │ 2   ┆ 4   ┆ 123  │
        # │ 1   ┆ 5   ┆ 123  │
        # └─────┴─────┴──────┘

Bonus: while down this rabbit hole I also made an upstream PR with sqlparser-rs to support UNION DISTINCT BY NAME; we don't need to wait on that as the DISTINCT keyword is (functionally) optional.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 8, 2023
@alexander-beedie alexander-beedie changed the title feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, and make internal concat functions a little more consistent feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat, and make internal concat functions a little more consistent Oct 8, 2023
@alexander-beedie alexander-beedie added A-sql Area: Polars SQL functionality cli labels Oct 8, 2023
@alexander-beedie alexander-beedie changed the title feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat, and make internal concat functions a little more consistent feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat, make internal concat functions more consistent Oct 8, 2023
@alexander-beedie alexander-beedie changed the title feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat, make internal concat functions more consistent feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat Oct 8, 2023
@alexander-beedie alexander-beedie changed the title feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for concat feat(rust,python,cli): add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for pl.concat Oct 8, 2023
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Small remark that saves a heap allocation. Other than that, good to go. 👍

I think many people like a relaxed diagonal. :)

crates/polars-lazy/src/dsl/functions.rs Outdated Show resolved Hide resolved
@ritchie46 ritchie46 merged commit 5eb499c into pola-rs:main Oct 9, 2023
28 checks passed
@alexander-beedie alexander-beedie deleted the union-by-name-concat-diagonal branch October 9, 2023 14:26
@cmdlineluser
Copy link
Contributor

Thanks for this @alexander-beedie - a very neat addition \o/

Just a small note, I didn't see diagonal_relaxed in the "how" section of the docs. (It is fully documented in the list underneath though.)

how : {‘vertical’, ‘vertical_relaxed’, ‘diagonal’, ‘horizontal’, ‘align’}

I can make a PR if needed, or perhaps it's simpler for you to add it as it is a 1 word change.

@alexander-beedie
Copy link
Collaborator Author

I can make a PR if needed, or perhaps it's simpler for you to add it as it is a 1 word change.

Sounds like a plan; I'll be adding/fixing batched read_database mode for Databricks SQL tomorrow (now I have access to a test account), so I can throw this in at the same time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: Polars SQL functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-sort column order when concat vertical and vertical_relaxed UNION too sensitive with column ordering
3 participants