-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column gets randomized after groupby join in lazy mode #9786
Comments
This is expected AFAIK. To import polars as pl
df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df_grp = df.groupby("col", maintain_order=True).all().with_row_count("colidx")
df = df.join(df_grp,on="col",how="left")
print(df_grp.collect())
print(df.collect()) The assignment of df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df.groupby("col", maintain_order=False).all().collect() the values of |
The issue is not about shuffling of rows while collecting. The issue is a different colidx number mapping to a group altogether. Like in the grouped dataframe ( As mentioned in the title a particular column is getting shuffled. |
I don't think this is guaranteed. You have two separate query plans in In other words, it's not necessary that if the >>> import polars as pl
>>>
>>> df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
>>> df_grp = df.groupby("col").all().with_row_count("colidx")
>>> df = df.join(df_grp.inspect(),on="col",how="left")
>>> print(df_grp.collect())
shape: (4, 2)
┌────────┬─────────┐
│ colidx ┆ col │
│ --- ┆ --- │
│ u32 ┆ str │
╞════════╪═════════╡
│ 0 ┆ Gr4/Gr4 │
│ 1 ┆ Gr1/Gr1 │
│ 2 ┆ Gr3/Gr3 │
│ 3 ┆ Gr2/Gr2 │
└────────┴─────────┘
>>> df.collect()
shape: (4, 2)
┌────────┬─────────┐
│ colidx ┆ col │
│ --- ┆ --- │
│ u32 ┆ str │
╞════════╪═════════╡
│ 0 ┆ Gr1/Gr1 │
│ 1 ┆ Gr4/Gr4 │
│ 2 ┆ Gr2/Gr2 │
│ 3 ┆ Gr3/Gr3 │
└────────┴─────────┘ I've not copied the final output of |
I may use the grouped dataframe multiple times and it will result in randomisation during every join. This issue is not intuitive to the user. In my opinion it should be addressed since polars lazy is the intended mode of usage by the casual end users. It is like adopting Rust philosophy where restrictions are placed so that when developer does some wrong stuff they are stopped in the tracks. In lazy query plan, a dataframe should be created only once and should be referred to. Like in this case, when grouped dataframe is being referred to it should be frozen as I may use it multiple times. It doesn't make sense to create it multiple times as part of the lazy query plan as it will have performance penalty. Or atleast if it is not implementable or has significant performance impact due to some other thing which I'm not aware of, it has to be mandated that grouped dataframes should be frozen or collected before reusing it multiple times later on using some error or warning in the lsp. Atleast a note should be present in the documentation. For now I'm appending |
That's why you have the
If you're coming from a background such as SQL or Pandas, then this is probably unusual. Which is why Polars has an option to get the groups in the order of input, and an explanation for why it is not the default (it is faster).
One day, we might have it this way. The current optimization in for
You're effectively asking for caching the DataFrame once created, in memory without an assignment to a Python variable. For large DataFrame objects, this will obviously be suboptimal. It might eventually be possible one day with the
Perhaps a note in the user-guide might be helpful. |
Closing the issue. But ideally this thing should be added in notes of the documentation. |
Perhaps this should be left open so that it can at least be resolved by adding something to the documentation? Otherwise it may get lost/discarded. This behaviour was surprising to me and I wasn't aware it worked like this. |
@cmdlineluser reopened this issue. Will close it after the disclaimer gets added to the documentation. |
Sql also doesn't guarantee order. You need to explicitly sort or specify 'order by'. I don't think this needs a mention in the docs, other than that the order is not stable, which I believe we already mention in the maintain_order parameters docstring. If you need stable ordering you must specify that either via |
Sorry for any confusion @ritchie46 - groupby and It just wasn't entirely obvious to me at first that the groupby runs each time df_a = pl.LazyFrame({"col": ["A", "B", "C", "D"]})
df_b = pl.LazyFrame({"col": ["A", "B", "C", "D"]})
df_b = df_b.groupby("col").all().with_row_count()
df_a.join(df_b, on="col", how="left").collect()
# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ --- │
# │ str ┆ u32 │
# ╞═════╪════════╡
# │ A ┆ 0 │
# │ B ┆ 2 │
# │ C ┆ 3 │
# │ D ┆ 1 │
# └─────┴────────┘
df_a.join(df_b, on="col", how="left").collect()
# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ --- │
# │ str ┆ u32 │
# ╞═════╪════════╡
# │ A ┆ 1 │
# │ B ┆ 3 │
# │ C ┆ 0 │
# │ D ┆ 2 │
# └─────┴────────┘ |
I agree with @cmdlineluser . The way lazy query plan works should be present in the getting started of the lazy mode so that any new guy don't fall into this trap. |
I'd probably put it this way: it's a "gotcha" i.e. it works exactly as documented (i.e. order is not preserved unless explicitly requested), but behavior is a little counter-intuitive (the thought would be: "I defined this LazyFrame just once, why are the results different?"). That's why I suggested that it's better as a user-guide "welcome to Polars Lazy evaluation; here's a quick note" material, because the documentation is pretty clear as is and the existing behavior naturally follows from it. |
Until the quick note is added, I'm keeping this issue open. |
@avimallu is this the reason for #8369, which I opened a while ago? In a nutshell, it looks like if you have multiple dataframes that are created from a scanned (lazy) CSV, then calling |
@mcrumiller, that does seem to be the most likely explanation. I base my answer from the user-guide section for "Common subplan elimination":
This just says "plan", and not "all the query plans". In addition, I don't think Polars' |
@ritchie46 can you corroborate this? If so I can close the other issue and, I think, this one and be closed as well. |
A So if you run a Fee free to open a PR on the user guide and I am happen to review.
It is for some file types, but that will still have randomness in hash tables and work scheduling. |
@ritchie46, please review pola-rs/polars-book#370 and let me know if you need any changes so that we can close this issue. |
Documented in the user guide. |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
Run the sample code. I feel the issue is with special character (forward slash in this case).
Reproducible example
Expected behavior
The
colidx
column in both the printed dataframes should correspond with the same group names in thecol
column.Installed versions
The text was updated successfully, but these errors were encountered: