Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column gets randomized after groupby join in lazy mode #9786

Closed
2 tasks done
arnabanimesh opened this issue Jul 8, 2023 · 19 comments
Closed
2 tasks done

Column gets randomized after groupby join in lazy mode #9786

arnabanimesh opened this issue Jul 8, 2023 · 19 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@arnabanimesh
Copy link
Contributor

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Run the sample code. I feel the issue is with special character (forward slash in this case).

Reproducible example

import polars as pl

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df_grp = df.groupby("col").all().with_row_count("colidx")
df = df.join(df_grp,on="col",how="left")
print(df_grp.collect())
print(df.collect())

Expected behavior

The colidx column in both the printed dataframes should correspond with the same group names in the col column.

Installed versions

--------Version info---------
Polars:              0.18.6
Index type:          UInt32
Platform:            Windows-10-10.0.22621-SP0
Python:              3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          3.7.1
numpy:               1.24.3
pandas:              2.0.2
pyarrow:             <not installed>
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            0.8.1
xlsxwriter:          3.1.2
@arnabanimesh arnabanimesh added bug Something isn't working python Related to Python Polars labels Jul 8, 2023
@avimallu
Copy link
Contributor

avimallu commented Jul 8, 2023

This is expected AFAIK. To maintain_order, you need to set that parameter to True.

import polars as pl

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df_grp = df.groupby("col", maintain_order=True).all().with_row_count("colidx")
df = df.join(df_grp,on="col",how="left")
print(df_grp.collect())
print(df.collect())

The assignment of colidx does not corerspond with the names because at the end of:

df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
df.groupby("col", maintain_order=False).all().collect()

the values of col itself get shuffled (or rather, are not guaranteed to be in the same order).

@arnabanimesh
Copy link
Contributor Author

arnabanimesh commented Jul 8, 2023

The issue is not about shuffling of rows while collecting. The issue is a different colidx number mapping to a group altogether.

Like in the grouped dataframe (df_grp) if colidx 1 mapped to Gr1/Gr1. After joining, colidx 1 should be mapped to Gr1/Gr1 in the parent dataframe (df) too even though it might be present at any random row due to lack of maintain order attribute. Here we are joining the entire grouped dataframe (df_grp) and the generated colidx column is a part of the grouped dataframe

As mentioned in the title a particular column is getting shuffled.

@avimallu
Copy link
Contributor

avimallu commented Jul 8, 2023

Like in the grouped dataframe (df_grp) if colidx 1 mapped to Gr1/Gr1. After joining, colidx 1 should be mapped to Gr1/Gr1 in the parent dataframe (df) too even though it might be present at any random row due to lack of maintain order attribute.

I don't think this is guaranteed. You have two separate query plans in df_grp and the second df statement (where you join to df_grp. In the first collect i.e. print(df_grp.collect()), the df_grp generated does not have to match the df_grp generated in print(df.collect()).

In other words, it's not necessary that if the colidx value 1 is mapped to Gr1/Gr1, it will be mapped to the same group in the df_grp created with print(df.collect()). I hope the code below helps:

>>> import polars as pl
>>> 
>>> df = pl.LazyFrame({"col":["Gr1/Gr1","Gr2/Gr2","Gr3/Gr3","Gr4/Gr4"]})
>>> df_grp = df.groupby("col").all().with_row_count("colidx")
>>> df = df.join(df_grp.inspect(),on="col",how="left")
>>> print(df_grp.collect())
shape: (4, 2)
┌────────┬─────────┐
│ colidxcol     │
│ ------     │
│ u32str     │
╞════════╪═════════╡
│ 0Gr4/Gr4 │
│ 1Gr1/Gr1 │
│ 2Gr3/Gr3 │
│ 3Gr2/Gr2 │
└────────┴─────────┘
>>> df.collect()
shape: (4, 2)
┌────────┬─────────┐
│ colidxcol     │
│ ------     │
│ u32str     │
╞════════╪═════════╡
│ 0Gr1/Gr1 │
│ 1Gr4/Gr4 │
│ 2Gr2/Gr2 │
│ 3Gr3/Gr3 │
└────────┴─────────┘

I've not copied the final output of df.collect() to illustrate my point - both are outputs of df_grp's collect() (although indirectly in the latter). Let me know if I've misunderstood your comment.

@arnabanimesh
Copy link
Contributor Author

I may use the grouped dataframe multiple times and it will result in randomisation during every join. This issue is not intuitive to the user. In my opinion it should be addressed since polars lazy is the intended mode of usage by the casual end users. It is like adopting Rust philosophy where restrictions are placed so that when developer does some wrong stuff they are stopped in the tracks.

In lazy query plan, a dataframe should be created only once and should be referred to. Like in this case, when grouped dataframe is being referred to it should be frozen as I may use it multiple times. It doesn't make sense to create it multiple times as part of the lazy query plan as it will have performance penalty.

Or atleast if it is not implementable or has significant performance impact due to some other thing which I'm not aware of, it has to be mandated that grouped dataframes should be frozen or collected before reusing it multiple times later on using some error or warning in the lsp. Atleast a note should be present in the documentation.

For now I'm appending .collect().lazy() after creating df_grp to limit the query plan from creating df_grp twice.

@avimallu
Copy link
Contributor

avimallu commented Jul 9, 2023

it will result in randomisation during every join.

That's why you have the maintain_order argument if it's important to you :)

This issue is not intuitive to the user.

If you're coming from a background such as SQL or Pandas, then this is probably unusual. Which is why Polars has an option to get the groups in the order of input, and an explanation for why it is not the default (it is faster).

In lazy query plan, a dataframe should be created only once and should be referred to.

One day, we might have it this way. The current optimization in for collect_all does have a common_subplan_elimination, but unfortunately, it isn't yet shared across query plans.

Or atleast if it is not implementable or has significant performance impact due to some other thing which I'm not aware of

You're effectively asking for caching the DataFrame once created, in memory without an assignment to a Python variable. For large DataFrame objects, this will obviously be suboptimal. It might eventually be possible one day with the collect_all option, but even then it would mean that you tell Polars where to cache the DataFrame (albeit indirectly).

Atleast a note should be present in the documentation.

Perhaps a note in the user-guide might be helpful.

@arnabanimesh
Copy link
Contributor Author

Closing the issue. But ideally this thing should be added in notes of the documentation.

@cmdlineluser
Copy link
Contributor

Perhaps this should be left open so that it can at least be resolved by adding something to the documentation?

Otherwise it may get lost/discarded.

This behaviour was surprising to me and I wasn't aware it worked like this.

@arnabanimesh arnabanimesh reopened this Jul 9, 2023
@arnabanimesh
Copy link
Contributor Author

@cmdlineluser reopened this issue. Will close it after the disclaimer gets added to the documentation.

@ritchie46
Copy link
Member

If you're coming from a background such as SQL or Pandas

Sql also doesn't guarantee order. You need to explicitly sort or specify 'order by'. I don't think this needs a mention in the docs, other than that the order is not stable, which I believe we already mention in the maintain_order parameters docstring.

If you need stable ordering you must specify that either via maintain_order or an explicit sort.

@cmdlineluser
Copy link
Contributor

Sorry for any confusion @ritchie46 - groupby and maintain_order is already clear and documented.

It just wasn't entirely obvious to me at first that the groupby runs each time .collect() is called.

df_a = pl.LazyFrame({"col": ["A", "B", "C", "D"]})
df_b = pl.LazyFrame({"col": ["A", "B", "C", "D"]})

df_b = df_b.groupby("col").all().with_row_count()

df_a.join(df_b, on="col", how="left").collect()

# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ ---    │
# │ str ┆ u32    │
# ╞═════╪════════╡
# │ A   ┆ 0      │
# │ B   ┆ 2      │
# │ C   ┆ 3      │
# │ D   ┆ 1      │
# └─────┴────────┘

df_a.join(df_b, on="col", how="left").collect()

# shape: (4, 2)
# ┌─────┬────────┐
# │ col ┆ row_nr │
# │ --- ┆ ---    │
# │ str ┆ u32    │
# ╞═════╪════════╡
# │ A   ┆ 1      │
# │ B   ┆ 3      │
# │ C   ┆ 0      │
# │ D   ┆ 2      │
# └─────┴────────┘

@arnabanimesh
Copy link
Contributor Author

I agree with @cmdlineluser . The way lazy query plan works should be present in the getting started of the lazy mode so that any new guy don't fall into this trap.

@avimallu
Copy link
Contributor

avimallu commented Jul 9, 2023

It just wasn't entirely obvious to me at first that the groupby runs each time .collect() is called.

I'd probably put it this way: it's a "gotcha" i.e. it works exactly as documented (i.e. order is not preserved unless explicitly requested), but behavior is a little counter-intuitive (the thought would be: "I defined this LazyFrame just once, why are the results different?").

That's why I suggested that it's better as a user-guide "welcome to Polars Lazy evaluation; here's a quick note" material, because the documentation is pretty clear as is and the existing behavior naturally follows from it.

@arnabanimesh
Copy link
Contributor Author

Until the quick note is added, I'm keeping this issue open.

@mcrumiller
Copy link
Contributor

mcrumiller commented Jul 9, 2023

One day, we might have it this way. The current optimization in for collect_all does have a common_subplan_elimination, but unfortunately, it isn't yet shared across query plans.

@avimallu is this the reason for #8369, which I opened a while ago? In a nutshell, it looks like if you have multiple dataframes that are created from a scanned (lazy) CSV, then calling collect_all() re-scans the CSV for each dataframe, resulting in much lower lazy performance than eager.

@avimallu
Copy link
Contributor

avimallu commented Jul 9, 2023

@mcrumiller, that does seem to be the most likely explanation. I base my answer from the user-guide section for "Common subplan elimination":

Cache subtrees/file scans that are used by multiple subtrees in the query plan.

This just says "plan", and not "all the query plans". In addition, I don't think Polars' read_csv is using a memory map based process (correct me if I'm wrong here), so the system doesn't cache the first read to make subsequent ones faster. It does do that for read_parquet, and since the memory_map option isn't present for the scan_parquet option, it seems reasonable to assume that lazy reading of a CSV also likely doesn't have it.

@mcrumiller
Copy link
Contributor

@ritchie46 can you corroborate this? If so I can close the other issue and, I think, this one and be closed as well.

@ritchie46
Copy link
Member

A LazyFrame is a promise on a computation. It doesn't do any implicit caching between runs.

So if you run a LazyFrame (e.g. a computation DAG) twice, you will run that computation twice. Operations that don't guarantee ordering, (for instance because they use hash-maps or work-scheduling of threads) will have results in a different order.

Fee free to open a PR on the user guide and I am happen to review.

is using a memory map based process (correct me if I'm wrong here)

It is for some file types, but that will still have randomness in hash tables and work scheduling.

@avimallu
Copy link
Contributor

@ritchie46, please review pola-rs/polars-book#370 and let me know if you need any changes so that we can close this issue.

@ritchie46
Copy link
Member

Documented in the user guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants