Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds error for certain string values while writing/reading IPC file #18636

Closed
2 tasks done
StijnKas opened this issue Sep 9, 2024 · 2 comments · Fixed by #18980
Closed
2 tasks done

Out of bounds error for certain string values while writing/reading IPC file #18636

StijnKas opened this issue Sep 9, 2024 · 2 comments · Fixed by #18980
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@StijnKas
Copy link

StijnKas commented Sep 9, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

If I create a dummy dataframe with

  1. At least one categorical value
  2. At least one Utf8/String value
  3. Some specific string values

I seem to get a OutOfBoundsError. See some variations of fails/successes I've tried below:

import os
os.environ['POLARS_VERBOSE']='1'
import polars as pl

filename='testfile.ipc'
print("Writing, disabling memory map:")
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=False).collect()
print("This fails")
os.remove(filename)
print("Writing with memory map")
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This fails")
os.remove(filename)
print("Writing while commenting the categorical value in the dataframe")
df = pl.DataFrame(
    {
        # "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 205"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")
os.remove(filename)
print('Writing with a smaller number value')
df = pl.DataFrame(
    {
        "Test": pl.Series(["Value"], dtype=pl.Categorical),
        "Test2": pl.Series(["Value Two 20"], dtype=pl.Utf8),
        "Test3": pl.Series(["Value3"], dtype=pl.Utf8),
    }
)
df.write_ipc(filename)
written = pl.scan_ipc(filename, memory_map=True).collect()
print("This works")

Log output

Writing, disabling memory map:
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
OutOfBoundsError                          Traceback (most recent call last)
Cell In[2], line 15
      7 df = pl.DataFrame(
      8     {
      9         "Test": pl.Series(["Value"], dtype=pl.Categorical),
   (...)
     12     }
     13 )
     14 df.write_ipc(filename)
---> 15 written = pl.scan_ipc(filename, memory_map=False).collect()
     16 print("This fails")

File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

OutOfBoundsError: view index out of bounds

Got: 0 buffers and index: 0



Writing with memory map
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[3], line 11
      3 df = pl.DataFrame(
      4     {
      5         "Test": pl.Series(["Value"], dtype=pl.Categorical),
   (...)
      8     }
      9 )
     10 df.write_ipc(filename)
---> 11 written = pl.scan_ipc(filename, memory_map=True).collect()
     12 print("This fails")

File ~/Documents/Code/polars_bug/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

ComputeError: buffer's length is too small in mmap


Writing while commenting the categorical value in the dataframe
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]


Writing with a smaller number value
This works
executing ipc read sync with row_index = None, n_rows = None, predicate = false for paths ["testfile.ipc"]

Issue description

I haven't been able to narrow down the issue completely, but it appears to be related to some specific string values interacting with categoricals. It fails similarly when I turn the categorical to an Enum value as well.

This seems to be a regression, as I've tested it on version 0.20 and it works fine for me. I ran this on a clean virtual environment on the latest Polars version.

Expected behavior

This should not result in an out of bounds error.

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.12.4 (main, Jun  6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@StijnKas StijnKas added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 9, 2024
@coastalwhite
Copy link
Collaborator

A bisect shows that this was caused by #17084.

@ritchie46 could you have a look at this?

@ritchie46
Copy link
Member

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants