Column gets randomized after groupby join in lazy mode #9786

arnabanimesh · 2023-07-08T20:55:22Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

Run the sample code. I feel the issue is with special character (forward slash in this case).

Reproducible example

import polars as pl

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df_grp = df.groupby("col").all().with_row_count("colidx")
df = df.join(df_grp,on="col",how="left")
print(df_grp.collect())
print(df.collect())

Expected behavior

The colidx column in both the printed dataframes should correspond with the same group names in the col column.

Installed versions

--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            Windows-10-10.0.22621-SP0
Python:              3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.7.1
numpy:               1.24.3
pandas:              2.0.2
pyarrow:             <not installed>
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            0.8.1
xlsxwriter:          3.1.2

The text was updated successfully, but these errors were encountered:

avimallu · 2023-07-08T21:34:20Z

This is expected AFAIK. To maintain_order, you need to set that parameter to True.

import polars as pl

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df_grp = df.groupby("col", maintain_order=True).all().with_row_count("colidx")
df = df.join(df_grp,on="col",how="left")
print(df_grp.collect())
print(df.collect())

The assignment of colidx does not corerspond with the names because at the end of:

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df.groupby("col", maintain_order=False).all().collect()

the values of col itself get shuffled (or rather, are not guaranteed to be in the same order).

arnabanimesh · 2023-07-08T22:21:00Z

The issue is not about shuffling of rows while collecting. The issue is a different colidx number mapping to a group altogether.

Like in the grouped dataframe (df_grp) if colidx 1 mapped to Gr1/Gr1. After joining, colidx 1 should be mapped to Gr1/Gr1 in the parent dataframe (df) too even though it might be present at any random row due to lack of maintain order attribute. Here we are joining the entire grouped dataframe (df_grp) and the generated colidx column is a part of the grouped dataframe

As mentioned in the title a particular column is getting shuffled.

avimallu · 2023-07-08T22:44:32Z

Like in the grouped dataframe (df_grp) if colidx 1 mapped to Gr1/Gr1. After joining, colidx 1 should be mapped to Gr1/Gr1 in the parent dataframe (df) too even though it might be present at any random row due to lack of maintain order attribute.

I don't think this is guaranteed. You have two separate query plans in df_grp and the second df statement (where you join to df_grp. In the first collect i.e. print(df_grp.collect()), the df_grp generated does not have to match the df_grp generated in print(df.collect()).

In other words, it's not necessary that if the colidx value 1 is mapped to Gr1/Gr1, it will be mapped to the same group in the df_grp created with print(df.collect()). I hope the code below helps:

>>> import polars as pl
>>> 
>>> df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
>>> df_grp = df.groupby("col").all().with_row_count("colidx")
>>> df = df.join(df_grp.inspect(),on="col",how="left")
>>> print(df_grp.collect())
shape: (4, 2)
┌────────┬─────────┐
│ colidx ┆ col     │
│ ---    ┆ ---     │
│ u32    ┆ str     │
╞════════╪═════════╡
│ 0      ┆ Gr4/Gr4 │
│ 1      ┆ Gr1/Gr1 │
│ 2      ┆ Gr3/Gr3 │
│ 3      ┆ Gr2/Gr2 │
└────────┴─────────┘
>>> df.collect()
shape: (4, 2)
┌────────┬─────────┐
│ colidx ┆ col     │
│ ---    ┆ ---     │
│ u32    ┆ str     │
╞════════╪═════════╡
│ 0      ┆ Gr1/Gr1 │
│ 1      ┆ Gr4/Gr4 │
│ 2      ┆ Gr2/Gr2 │
│ 3      ┆ Gr3/Gr3 │
└────────┴─────────┘

I've not copied the final output of df.collect() to illustrate my point - both are outputs of df_grp's collect() (although indirectly in the latter). Let me know if I've misunderstood your comment.

arnabanimesh · 2023-07-08T23:29:35Z

I may use the grouped dataframe multiple times and it will result in randomisation during every join. This issue is not intuitive to the user. In my opinion it should be addressed since polars lazy is the intended mode of usage by the casual end users. It is like adopting Rust philosophy where restrictions are placed so that when developer does some wrong stuff they are stopped in the tracks.

In lazy query plan, a dataframe should be created only once and should be referred to. Like in this case, when grouped dataframe is being referred to it should be frozen as I may use it multiple times. It doesn't make sense to create it multiple times as part of the lazy query plan as it will have performance penalty.

Or atleast if it is not implementable or has significant performance impact due to some other thing which I'm not aware of, it has to be mandated that grouped dataframes should be frozen or collected before reusing it multiple times later on using some error or warning in the lsp. Atleast a note should be present in the documentation.

For now I'm appending .collect().lazy() after creating df_grp to limit the query plan from creating df_grp twice.

avimallu · 2023-07-09T01:51:20Z

it will result in randomisation during every join.

That's why you have the maintain_order argument if it's important to you :)

This issue is not intuitive to the user.

If you're coming from a background such as SQL or Pandas, then this is probably unusual. Which is why Polars has an option to get the groups in the order of input, and an explanation for why it is not the default (it is faster).

In lazy query plan, a dataframe should be created only once and should be referred to.

One day, we might have it this way. The current optimization in for collect_all does have a common_subplan_elimination, but unfortunately, it isn't yet shared across query plans.

Or atleast if it is not implementable or has significant performance impact due to some other thing which I'm not aware of

You're effectively asking for caching the DataFrame once created, in memory without an assignment to a Python variable. For large DataFrame objects, this will obviously be suboptimal. It might eventually be possible one day with the collect_all option, but even then it would mean that you tell Polars where to cache the DataFrame (albeit indirectly).

Atleast a note should be present in the documentation.

Perhaps a note in the user-guide might be helpful.

arnabanimesh · 2023-07-09T06:04:30Z

Closing the issue. But ideally this thing should be added in notes of the documentation.

cmdlineluser · 2023-07-09T06:35:17Z

Perhaps this should be left open so that it can at least be resolved by adding something to the documentation?

Otherwise it may get lost/discarded.

This behaviour was surprising to me and I wasn't aware it worked like this.

arnabanimesh · 2023-07-09T09:29:55Z

@cmdlineluser reopened this issue. Will close it after the disclaimer gets added to the documentation.

ritchie46 · 2023-07-09T09:43:50Z

If you're coming from a background such as SQL or Pandas

Sql also doesn't guarantee order. You need to explicitly sort or specify 'order by'. I don't think this needs a mention in the docs, other than that the order is not stable, which I believe we already mention in the maintain_order parameters docstring.

If you need stable ordering you must specify that either via maintain_order or an explicit sort.

cmdlineluser · 2023-07-09T10:41:52Z

Sorry for any confusion @ritchie46 - groupby and maintain_order is already clear and documented.

It just wasn't entirely obvious to me at first that the groupby runs each time .collect() is called.

df_a = pl.LazyFrame({"col": ["A", "B", "C", "D"]})
df_b = pl.LazyFrame({"col": ["A", "B", "C", "D"]})

df_b = df_b.groupby("col").all().with_row_count()

df_a.join(df_b, on="col", how="left").collect()

# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ ---    │
# │ str ┆ u32    │
# ╞═════╪════════╡
# │ A   ┆ 0      │
# │ B   ┆ 2      │
# │ C   ┆ 3      │
# │ D   ┆ 1      │
# └─────┴────────┘

df_a.join(df_b, on="col", how="left").collect()

# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ ---    │
# │ str ┆ u32    │
# ╞═════╪════════╡
# │ A   ┆ 1      │
# │ B   ┆ 3      │
# │ C   ┆ 0      │
# │ D   ┆ 2      │
# └─────┴────────┘

arnabanimesh · 2023-07-09T11:30:56Z

I agree with @cmdlineluser . The way lazy query plan works should be present in the getting started of the lazy mode so that any new guy don't fall into this trap.

avimallu · 2023-07-09T12:36:02Z

It just wasn't entirely obvious to me at first that the groupby runs each time .collect() is called.

I'd probably put it this way: it's a "gotcha" i.e. it works exactly as documented (i.e. order is not preserved unless explicitly requested), but behavior is a little counter-intuitive (the thought would be: "I defined this LazyFrame just once, why are the results different?").

That's why I suggested that it's better as a user-guide "welcome to Polars Lazy evaluation; here's a quick note" material, because the documentation is pretty clear as is and the existing behavior naturally follows from it.

arnabanimesh · 2023-07-09T14:08:04Z

Until the quick note is added, I'm keeping this issue open.

mcrumiller · 2023-07-09T18:10:27Z

One day, we might have it this way. The current optimization in for collect_all does have a common_subplan_elimination, but unfortunately, it isn't yet shared across query plans.

@avimallu is this the reason for #8369, which I opened a while ago? In a nutshell, it looks like if you have multiple dataframes that are created from a scanned (lazy) CSV, then calling collect_all() re-scans the CSV for each dataframe, resulting in much lower lazy performance than eager.

avimallu · 2023-07-09T23:57:27Z

@mcrumiller, that does seem to be the most likely explanation. I base my answer from the user-guide section for "Common subplan elimination":

Cache subtrees/file scans that are used by multiple subtrees in the query plan.

This just says "plan", and not "all the query plans". In addition, I don't think Polars' read_csv is using a memory map based process (correct me if I'm wrong here), so the system doesn't cache the first read to make subsequent ones faster. It does do that for read_parquet, and since the memory_map option isn't present for the scan_parquet option, it seems reasonable to assume that lazy reading of a CSV also likely doesn't have it.

mcrumiller · 2023-07-10T01:21:21Z

@ritchie46 can you corroborate this? If so I can close the other issue and, I think, this one and be closed as well.

ritchie46 · 2023-07-10T06:48:33Z

A LazyFrame is a promise on a computation. It doesn't do any implicit caching between runs.

So if you run a LazyFrame (e.g. a computation DAG) twice, you will run that computation twice. Operations that don't guarantee ordering, (for instance because they use hash-maps or work-scheduling of threads) will have results in a different order.

Fee free to open a PR on the user guide and I am happen to review.

is using a memory map based process (correct me if I'm wrong here)

It is for some file types, but that will still have randomness in hash tables and work scheduling.

avimallu · 2023-07-11T22:49:10Z

@ritchie46, please review pola-rs/polars-book#370 and let me know if you need any changes so that we can close this issue.

ritchie46 · 2023-07-18T11:46:07Z

Documented in the user guide.

arnabanimesh added bug Something isn't working python Related to Python Polars labels Jul 8, 2023

arnabanimesh closed this as completed Jul 9, 2023

arnabanimesh reopened this Jul 9, 2023

avimallu mentioned this issue Jul 11, 2023

Add a note on reusing LazyFrames pola-rs/polars-book#370

Merged

ritchie46 closed this as completed Jul 18, 2023

cmdlineluser mentioned this issue Dec 7, 2023

with_row_count shows non-deterministic behavior in lazy execution #12944

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column gets randomized after groupby join in lazy mode #9786

Column gets randomized after groupby join in lazy mode #9786

arnabanimesh commented Jul 8, 2023

avimallu commented Jul 8, 2023 •

edited

Loading

arnabanimesh commented Jul 8, 2023 •

edited

Loading

avimallu commented Jul 8, 2023

arnabanimesh commented Jul 8, 2023

avimallu commented Jul 9, 2023 •

edited

Loading

arnabanimesh commented Jul 9, 2023

cmdlineluser commented Jul 9, 2023

arnabanimesh commented Jul 9, 2023

ritchie46 commented Jul 9, 2023

cmdlineluser commented Jul 9, 2023

arnabanimesh commented Jul 9, 2023

avimallu commented Jul 9, 2023 •

edited

Loading

arnabanimesh commented Jul 9, 2023

mcrumiller commented Jul 9, 2023 •

edited

Loading

avimallu commented Jul 9, 2023

mcrumiller commented Jul 10, 2023

ritchie46 commented Jul 10, 2023

avimallu commented Jul 11, 2023

ritchie46 commented Jul 18, 2023

Column gets randomized after groupby join in lazy mode #9786

Column gets randomized after groupby join in lazy mode #9786

Comments

arnabanimesh commented Jul 8, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

avimallu commented Jul 8, 2023 • edited Loading

arnabanimesh commented Jul 8, 2023 • edited Loading

avimallu commented Jul 8, 2023

arnabanimesh commented Jul 8, 2023

avimallu commented Jul 9, 2023 • edited Loading

arnabanimesh commented Jul 9, 2023

cmdlineluser commented Jul 9, 2023

arnabanimesh commented Jul 9, 2023

ritchie46 commented Jul 9, 2023

cmdlineluser commented Jul 9, 2023

arnabanimesh commented Jul 9, 2023

avimallu commented Jul 9, 2023 • edited Loading

arnabanimesh commented Jul 9, 2023

mcrumiller commented Jul 9, 2023 • edited Loading

avimallu commented Jul 9, 2023

mcrumiller commented Jul 10, 2023

ritchie46 commented Jul 10, 2023

avimallu commented Jul 11, 2023

ritchie46 commented Jul 18, 2023

avimallu commented Jul 8, 2023 •

edited

Loading

arnabanimesh commented Jul 8, 2023 •

edited

Loading

avimallu commented Jul 9, 2023 •

edited

Loading

avimallu commented Jul 9, 2023 •

edited

Loading

mcrumiller commented Jul 9, 2023 •

edited

Loading