`with_row_count` shows non-deterministic behavior in lazy execution #12944

rcliu623 · 2023-12-07T16:12:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

    val_df = pl.DataFrame(
        {
            "key1": ["a", "a", "a", "b", "b", "b"],
            "id": [1, 2, 2, 3, 3, 5],
            "val": [1, 1, 1, 1, 1, 1],
        }
    )
    val_ldf = val_df.lazy()
    post_net_df = val_ldf.group_by("key1", "id").agg(pl.col("val").sum())
    mv_df = post_net_df.group_by("key1").agg(pl.col("val").sum()).with_row_count("_group_id")
    post_net_df = post_net_df.join(mv_df, on="key1", how="left")
    print(pl.collect_all([post_net_df, mv_df]))

Log output

DATAFRAME < 1000 rows: running default HASH AGGREGATION
join parallel: false
DATAFRAME < 1000 rows: running default HASH AGGREGATION
DATAFRAME < 1000 rows: running default HASH AGGREGATION
CACHE SET: cache id: 952e33479f159579
CACHE HIT: cache id: 952e33479f159579
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished

[shape: (4, 5)
┌──────┬─────┬─────┬───────────┬───────────┐
│ key1 ┆ id  ┆ val ┆ _group_id ┆ val_right │
│ ---  ┆ --- ┆ --- ┆ ---       ┆ ---       │
│ str  ┆ i64 ┆ i64 ┆ u32       ┆ i64       │
╞══════╪═════╪═════╪═══════════╪═══════════╡
│ a    ┆ 2   ┆ 2   ┆ 1         ┆ 3         │
│ b    ┆ 3   ┆ 2   ┆ 0         ┆ 3         │
│ b    ┆ 5   ┆ 1   ┆ 0         ┆ 3         │
│ a    ┆ 1   ┆ 1   ┆ 1         ┆ 3         │
└──────┴─────┴─────┴───────────┴───────────┘, shape: (2, 3)
┌───────────┬──────┬─────┐
│ _group_id ┆ key1 ┆ val │
│ ---       ┆ ---  ┆ --- │
│ u32       ┆ str  ┆ i64 │
╞═══════════╪══════╪═════╡
│ 0         ┆ a    ┆ 3   │
│ 1         ┆ b    ┆ 3   │
└───────────┴──────┴─────┘]

Issue description

In a lazy execution, I attempt to assign indices to mv_df to establish a unique key-id mapping for each of the key1 values, then join back into post_net_df. However, mapping from the _group_id column in post_net_df does not always align with that from mv_df. Eager execution does not have this problem.

Expected behavior

_group_id columns should align between the two dataframes. In this case, key1=a should be mapped to 0 and key1=b should be mapped to 1.

[shape: (4, 5)
┌──────┬─────┬─────┬───────────┬───────────┐
│ key1 ┆ id  ┆ val ┆ _group_id ┆ val_right │
│ ---  ┆ --- ┆ --- ┆ ---       ┆ ---       │
│ str  ┆ i64 ┆ i64 ┆ u32       ┆ i64       │
╞══════╪═════╪═════╪═══════════╪═══════════╡
│ a    ┆ 2   ┆ 2   ┆ 0         ┆ 3         │
│ b    ┆ 3   ┆ 2   ┆ 1         ┆ 3         │
│ b    ┆ 5   ┆ 1   ┆ 1         ┆ 3         │
│ a    ┆ 1   ┆ 1   ┆ 0         ┆ 3         │
└──────┴─────┴─────┴───────────┴───────────┘, shape: (2, 3)
┌───────────┬──────┬─────┐
│ _group_id ┆ key1 ┆ val │
│ ---       ┆ ---  ┆ --- │
│ u32       ┆ str  ┆ i64 │
╞═══════════╪══════╪═════╡
│ 0         ┆ a    ┆ 3   │
│ 1         ┆ b    ┆ 3   │
└───────────┴──────┴─────┘]

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32
Platform:             Linux-4.18.0-372.32.1.el8_6.jump1.x86_64-x86_64-with-glibc2.28
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.25.0
openpyxl:             <not installed>
pandas:               2.1.2
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.45
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-12-07T17:35:16Z

I think this is #9786

(you need maintain_order=True for each .group_by to guarantee the order)

rcliu623 · 2023-12-07T17:58:24Z

Yes indeed, thanks for the reference. Closing this as this is an exact duplicate

rcliu623 added bug Something isn't working python Related to Python Polars labels Dec 7, 2023

rcliu623 closed this as completed Dec 7, 2023

alexander-beedie added the invalid A bug report that is not actually a bug label Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`with_row_count` shows non-deterministic behavior in lazy execution #12944

`with_row_count` shows non-deterministic behavior in lazy execution #12944

rcliu623 commented Dec 7, 2023 •

edited

Loading

cmdlineluser commented Dec 7, 2023

rcliu623 commented Dec 7, 2023

with_row_count shows non-deterministic behavior in lazy execution #12944

with_row_count shows non-deterministic behavior in lazy execution #12944

Comments

rcliu623 commented Dec 7, 2023 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Dec 7, 2023

rcliu623 commented Dec 7, 2023

`with_row_count` shows non-deterministic behavior in lazy execution #12944

`with_row_count` shows non-deterministic behavior in lazy execution #12944

rcliu623 commented Dec 7, 2023 •

edited

Loading