Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_parquet() + collect() errors if the parquet file was created with sink_parquet() #365

Closed
etiennebacher opened this issue Aug 16, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@etiennebacher
Copy link
Collaborator

I think sink_parquet() has some bug inside: running pl$scan_parquet()$collect() errors if the file was created with sink_parquet() but not if it was created with arrow::write_parquet().

library(polars)

iris_lazy <- pl$LazyFrame(iris)
dest <- tempfile(fileext = ".parquet")
dest2 <- tempfile(fileext = ".parquet")

# this works
arrow::write_parquet(iris, dest)
pl$scan_parquet(dest)$collect()
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
#> │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
#> │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat       │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
#> │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
#> │ …            ┆ …           ┆ …            ┆ …           ┆ …         │
#> │ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
#> │ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
#> │ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
#> │ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

# but not this
iris_lazy$sink_parquet(dest2)
pl$scan_parquet(dest2)$collect()
#> polars: closing concurrent R handler
#> Error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    1: A polars sub-thread panicked. See panic msg, which is likely more informative than this error: Any { .. }
@Sicheng-Pan
Copy link
Collaborator

I got the following error messages:

> pl$scan_parquet(dest2)$collect()
thread '<unnamed>' panicked at 'should not fail: ComputeError(ErrString("cannot concat categoricals coming from a different source; consider setting a global StringCache"))', <redacted>/polars/polars-core/src/frame/mod.rs:923:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
polars: closing concurrent R handler
Error: Execution halted with the following contexts
   0: In R: in $collect():
   0: During function call [pl$scan_parquet(dest2)$collect()]
   1: A polars sub-thread panicked. See panic msg, which is likely more informative than this error: Any { .. }

Maybe it's related to the String cache? I'm not sure about this

@Sicheng-Pan
Copy link
Collaborator

Also personally I think this is not likely to be a bug of sink_parquet in r-polars, since it simply assembles the related parameters and call sink_parquet method for polars::prelude::LazyFrame

@etiennebacher
Copy link
Collaborator Author

I can't reproduce it with py-polars:

import polars as pl

df = pl.LazyFrame(
    {"values": [1, 2, 3], "values2": ["a", "b", "c"]},
    schema={"values": pl.Float64, "values2": pl.Categorical},
)

df.sink_parquet("foo.parquet")
pl.scan_parquet("foo.parquet").collect()

Maybe it was fixed in a recent version

@etiennebacher
Copy link
Collaborator Author

Fails when there are more than 3 values in categorical column. Simpler reprex:

library(polars)

lf1 <- pl$LazyFrame(values = factor(letters[1:3]))
lf2 <- pl$LazyFrame(values = factor(letters[1:4]))
dest <- tempfile(fileext = ".parquet")

lf1$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (3, 1)
#> ┌────────┐
#> │ values │
#> │ ---    │
#> │ cat    │
#> ╞════════╡
#> │ a      │
#> │ b      │
#> │ c      │
#> └────────┘


lf2$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> polars: closing concurrent R handler
#> Error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    1: A polars sub-thread panicked. See panic msg, which is likely more informative than this error: Any { .. }

@etiennebacher etiennebacher added the bug Something isn't working label Aug 22, 2023
@sorhawell
Copy link
Collaborator

I'm working on bumpin rust-polars to 0.32.0 and is quite far.

@etiennebacher
Copy link
Collaborator Author

Looks like #334 solved this:

library(polars)

lf1 <- pl$LazyFrame(values = factor(letters[1:3]))
lf2 <- pl$LazyFrame(values = factor(letters[1:4]))
dest <- tempfile(fileext = ".parquet")

lf1$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (3, 1)
#> ┌────────┐
#> │ values │
#> │ ---    │
#> │ cat    │
#> ╞════════╡
#> │ a      │
#> │ b      │
#> │ c      │
#> └────────┘

lf2$sink_parquet(dest)
pl$scan_parquet(dest)$collect()
#> shape: (4, 1)
#> ┌────────┐
#> │ values │
#> │ ---    │
#> │ cat    │
#> ╞════════╡
#> │ a      │
#> │ b      │
#> │ c      │
#> │ d      │
#> └────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants