-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LazyFrame - Unnested columns are missing in Lazy Frame #16460
Comments
This comment was marked as resolved.
This comment was marked as resolved.
I'm not sure if it is the underlying cause, but whilst trying to reduce down your example, the import polars as pl
lf = pl.LazyFrame({
"index": [0, 1, 2],
"payload": [1, 1, 1],
"category": ["a", "b", "c"]
})
lf_temp = (
lf.with_columns(
pl.struct("index", "payload", "category").map_elements(lambda row:
[
{"a": row["index"], "b": row["payload"], "c": row["category"]},
{"a": row["index"] + 1, "b": row["payload"] + 1, "c": row["category"]}
],
return_dtype=pl.List(pl.Struct)
)
.alias("struct_column")
)
)
lf_result = lf_temp.explode("struct_column").unnest("struct_column")
lf_result.sink_parquet("lazy_frame.parquet")
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# ComputeError(ErrString("a StructArray must contain at least one field")) The shape: (6, 6)
┌───────┬─────────┬──────────┬─────┬─────┬─────┐
│ index ┆ payload ┆ category ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 ┆ str │
╞═══════╪═════════╪══════════╪═════╪═════╪═════╡
│ 0 ┆ 1 ┆ a ┆ 0 ┆ 1 ┆ a │
│ 0 ┆ 1 ┆ a ┆ 1 ┆ 2 ┆ a │
│ 1 ┆ 1 ┆ b ┆ 1 ┆ 1 ┆ b │
│ 1 ┆ 1 ┆ b ┆ 2 ┆ 2 ┆ b │
│ 2 ┆ 1 ┆ c ┆ 2 ┆ 1 ┆ c │
│ 2 ┆ 1 ┆ c ┆ 3 ┆ 2 ┆ c │
└───────┴─────────┴──────────┴─────┴─────┴─────┘ |
Interestingly from your above example this fails: lf_result.sink_parquet("lazy_frame.parquet") But this succeeds: lf_result.collect().lazy().sink_parquet("lazy_frame.parquet") |
Yeah, the collect also panics with streaming enabled: lf_result.collect(streaming=True)
# PanicException: called `Result::unwrap()` on an `Err` value:
# ComputeError(ErrString("a StructArray must contain at least one field")) |
I just tried this again on the latest main and it still fails. Specifying the full schema allows it to run as expected. return_dtype=pl.List(pl.Struct({'a': pl.Int64, 'b': pl.Int64, 'c': pl.String})) It seems allowing an empty |
@cmdlineluser Thanks for your feedback. When I specify the full schema then streaming works for me. Also my complete use case not only the example above. |
Sorry, but this cannot be the solution. What if you have an unknown amount of fields that are inferred by polars in the unnest function. |
@cmdlineluser Thank you for reopening 🙏🏻 or linking |
Checks
Reproducible example
Log output
Issue description
We are processing large amounts of data, where we need to use the streaming feature because the data will not fit into RAM. Also we are running on a Kubernetes cluster where it is desirable to have a more constant RAM consumption.
We have a custom function that will return a List of Structs. This column needs to be exploded and unested . Unfortunately this unnested columns are then not part of the result. When we sink the results to an output parquet file the columns are missing.
When we collect the lazy frame into a DataFrame, the columns are present. Unfortunately we can not collect the results due to Resource limitations.
This only happens when we use the "map_elements" function. When the List of structs is already within the source data, then the explode and unnest is working as expected. For example this here is working as expected:
Expected behavior
The expectation would be that the schema of the result Lazy Frame is the same as the Data Frame that is collected.
Installed versions
The text was updated successfully, but these errors were encountered: