Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with_row_count shows non-deterministic behavior in lazy execution #12944

Closed
2 tasks done
rcliu623 opened this issue Dec 7, 2023 · 2 comments
Closed
2 tasks done

with_row_count shows non-deterministic behavior in lazy execution #12944

rcliu623 opened this issue Dec 7, 2023 · 2 comments
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@rcliu623
Copy link

rcliu623 commented Dec 7, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

    val_df = pl.DataFrame(
        {
            "key1": ["a", "a", "a", "b", "b", "b"],
            "id": [1, 2, 2, 3, 3, 5],
            "val": [1, 1, 1, 1, 1, 1],
        }
    )
    val_ldf = val_df.lazy()
    post_net_df = val_ldf.group_by("key1", "id").agg(pl.col("val").sum())
    mv_df = post_net_df.group_by("key1").agg(pl.col("val").sum()).with_row_count("_group_id")
    post_net_df = post_net_df.join(mv_df, on="key1", how="left")
    print(pl.collect_all([post_net_df, mv_df]))

Log output

DATAFRAME < 1000 rows: running default HASH AGGREGATION
join parallel: false
DATAFRAME < 1000 rows: running default HASH AGGREGATION
DATAFRAME < 1000 rows: running default HASH AGGREGATION
CACHE SET: cache id: 952e33479f159579
CACHE HIT: cache id: 952e33479f159579
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished

[shape: (4, 5)
┌──────┬─────┬─────┬───────────┬───────────┐
│ key1 ┆ id  ┆ val ┆ _group_id ┆ val_right │
│ ---  ┆ --- ┆ --- ┆ ---       ┆ ---       │
│ str  ┆ i64 ┆ i64 ┆ u32       ┆ i64       │
╞══════╪═════╪═════╪═══════════╪═══════════╡
│ a    ┆ 2   ┆ 2   ┆ 1         ┆ 3         │
│ b    ┆ 3   ┆ 2   ┆ 0         ┆ 3         │
│ b    ┆ 5   ┆ 1   ┆ 0         ┆ 3         │
│ a    ┆ 1   ┆ 1   ┆ 1         ┆ 3         │
└──────┴─────┴─────┴───────────┴───────────┘, shape: (2, 3)
┌───────────┬──────┬─────┐
│ _group_id ┆ key1 ┆ val │
│ ---       ┆ ---  ┆ --- │
│ u32       ┆ str  ┆ i64 │
╞═══════════╪══════╪═════╡
│ 0         ┆ a    ┆ 3   │
│ 1         ┆ b    ┆ 3   │
└───────────┴──────┴─────┘]

Issue description

In a lazy execution, I attempt to assign indices to mv_df to establish a unique key-id mapping for each of the key1 values, then join back into post_net_df. However, mapping from the _group_id column in post_net_df does not always align with that from mv_df. Eager execution does not have this problem.

Expected behavior

_group_id columns should align between the two dataframes. In this case, key1=a should be mapped to 0 and key1=b should be mapped to 1.

[shape: (4, 5)
┌──────┬─────┬─────┬───────────┬───────────┐
│ key1 ┆ id  ┆ val ┆ _group_id ┆ val_right │
│ ---  ┆ --- ┆ --- ┆ ---       ┆ ---       │
│ str  ┆ i64 ┆ i64 ┆ u32       ┆ i64       │
╞══════╪═════╪═════╪═══════════╪═══════════╡
│ a    ┆ 2   ┆ 2   ┆ 0         ┆ 3         │
│ b    ┆ 3   ┆ 2   ┆ 1         ┆ 3         │
│ b    ┆ 5   ┆ 1   ┆ 1         ┆ 3         │
│ a    ┆ 1   ┆ 1   ┆ 0         ┆ 3         │
└──────┴─────┴─────┴───────────┴───────────┘, shape: (2, 3)
┌───────────┬──────┬─────┐
│ _group_id ┆ key1 ┆ val │
│ ---       ┆ ---  ┆ --- │
│ u32       ┆ str  ┆ i64 │
╞═══════════╪══════╪═════╡
│ 0         ┆ a    ┆ 3   │
│ 1         ┆ b    ┆ 3   │
└───────────┴──────┴─────┘]

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             Linux-4.18.0-372.32.1.el8_6.jump1.x86_64-x86_64-with-glibc2.28
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.25.0
openpyxl:             <not installed>
pandas:               2.1.2
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.45
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@rcliu623 rcliu623 added bug Something isn't working python Related to Python Polars labels Dec 7, 2023
@cmdlineluser
Copy link
Contributor

I think this is #9786

(you need maintain_order=True for each .group_by to guarantee the order)

@rcliu623
Copy link
Author

rcliu623 commented Dec 7, 2023

Yes indeed, thanks for the reference. Closing this as this is an exact duplicate

@rcliu623 rcliu623 closed this as completed Dec 7, 2023
@alexander-beedie alexander-beedie added the invalid A bug report that is not actually a bug label Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants