Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent dtypes after .implode in group_by depending on the emptyness of the frame #19164

Closed
2 tasks done
NicolasMuellerQC opened this issue Oct 9, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@NicolasMuellerQC
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(data=dict(x = ["a", "b"], g=[1,1]))

df.group_by("g").agg( pl.col("x").implode().list.first())
# Yields
# shape: (1, 2)
#┌─────┬───────────┐
#│ g   ┆ x         │
#│ --- ┆ ---       │
#│ i64 ┆ list[str] │
#╞═════╪═══════════╡
#│ 1   ┆ ["a"]     │
#└─────┴───────────┘
# as expected


df.head(0).group_by("g").agg( pl.col("x").implode().list.first())
# Yields
# shape: (0, 2)
#┌─────┬─────┐
#│ g   ┆ x   │
#│ --- ┆ --- │
#│ i64 ┆ str │
#╞═════╪═════╡
#└─────┴─────┘
# Note the different dtype!



## Example where this is relevant:
df.group_by("g").agg(pl.col("x").filter(pl.col("x") == pl.col("x").filter(pl.col("x").is_last_distinct()).implode().list.get(-1)))
# Yields 
#shape: (1, 2)
#┌─────┬───────────┐
#│ g   ┆ x         │
#│ --- ┆ ---       │
#│ i64 ┆ list[str] │
#╞═════╪═══════════╡
#│ 1   ┆ ["b"]     │
#└─────┴───────────┘
# while
df.head(0).group_by("g").agg(pl.col("x").filter(pl.col("x") == pl.col("x").filter(pl.col("x").is_last_distinct()).implode().list.get(-1)))
# raises
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `str`"))

Log output

No response

Issue description

implode().list.first() and other elements of the .list namespace return inconsistent dtypes in a group context depending on whether the frame is empty or not.

Expected behavior

I expect the dtypes in the empty case to be the same as in the non-empty case. In particular

df.group_by("g").agg(pl.col("x").filter(pl.col("x") == pl.col("x").filter(pl.col("x").is_last_distinct()).implode().list.get(-1)))

should not raise but return a frame with 0 rows with schema g: i64, x: list[str].

Installed versions

--------Version info---------
Polars:              1.9.0
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ]
----Optional dependencies----
adbc_driver_manager  <not installed>
altair               4.2.2
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              17.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           2.0.35
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
@NicolasMuellerQC NicolasMuellerQC added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 9, 2024
@NicolasMuellerQC NicolasMuellerQC changed the title Inconsistent typing for nested list in group_by. Inconsistent typing for nested list in group_by depending on the emptyness of the frame. Oct 9, 2024
@NicolasMuellerQC NicolasMuellerQC changed the title Inconsistent typing for nested list in group_by depending on the emptyness of the frame. Inconsistent typing for nested list in group_by depending on the emptyness of the frame Oct 9, 2024
@NicolasMuellerQC NicolasMuellerQC changed the title Inconsistent typing for nested list in group_by depending on the emptyness of the frame Inconsistent dtypes after .implode in group_by depending on the emptyness of the frame Oct 9, 2024
@cmdlineluser
Copy link
Contributor

This is also fixed by:

(df.head(0)
   .group_by("g")
   .agg(
       pl.col("x").filter(pl.col("x") == pl.col("x").filter(pl.col("x").is_last_distinct()).implode().list.get(-1))
    )
)

# shape: (0, 2)
# ┌─────┬───────────┐
# │ g   ┆ x         │
# │ --- ┆ ---       │
# │ i64 ┆ list[str] │
# ╞═════╪═══════════╡
# └─────┴───────────┘

@ritchie46
Copy link
Member

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants