Alternative method 10x faster than dt.offset_by() #16722

Chuck321123 · 2024-06-04T13:59:57Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import polars as pl

num_rows = 1000000
utc_time = pd.date_range(start='2023-01-01', periods=num_rows, freq='ms')

# Create the DataFrame
df = pd.DataFrame({
    'UTC_Time': utc_time
})

df['UTC_Time'] = df['UTC_Time'].sort_values()

df = pl.DataFrame(df)

df = df.with_columns(pl.col("UTC_Time").dt.offset_by("5m").alias("Method1"))

df = df.with_columns(pl.from_epoch((pl.col("UTC_Time").dt.epoch(time_unit="ns") 
                                    + 5 * 60 * 1_000_000_000), time_unit="ns").alias("Method2"))

%timeit df.with_columns(pl.col("UTC_Time").dt.offset_by("5m"))                    
%timeit df.with_columns(pl.from_epoch((pl.col("UTC_Time").dt.epoch(time_unit="ns") + 5 * 60 * 1_000_000_000), time_unit="ns"))

Log output

5.31 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
534 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Issue description

So I was wondering why dt.offset_by() was so slow and experimented with an alternative method. Turned out it was ~10x faster than the dt.offset_by() method. Would be nice to make dt.offset_by() equally as fast.

Expected behavior

That dt.offset_by() becomes at least equally as fast as alternative method

Installed versions

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.12.2.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 11
Version               : 10.0.22631
machine               : AMD64
processor             : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : en
LOCALE                : Norwegian Bokmål_Norway.1252

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 68.2.2
pip                   : 23.3.1
Cython                : 3.0.10
pytest                : None
hypothesis            : None
sphinx                : 7.2.6
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : 5.2.2
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.22.2
pandas_datareader     : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : 3.8.3
numba                 : 0.59.1
numexpr               : None
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
pyarrow               : 15.0.2
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.12.0
sqlalchemy            : None
tables                : None
tabulate              : 0.9.0
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : 2.4.1
pyqt5                 : None

pl.show_versions()
--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-06-04T14:13:01Z

thanks @Chuck321123 for the report

agree - I think something similar to #16615 and #16666 could be done

will try to work on a pr this week

Chuck321123 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 4, 2024

MarcoGorelli added performance Performance issues or improvements and removed needs triage Awaiting prioritization by a maintainer labels Jun 4, 2024

MarcoGorelli mentioned this issue Jun 4, 2024

perf: Speed up dt.offset_by 2x for constant durations #16728

Merged

ritchie46 closed this as completed in #16728 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative method 10x faster than dt.offset_by() #16722

Alternative method 10x faster than dt.offset_by() #16722

Chuck321123 commented Jun 4, 2024 •

edited

Loading

MarcoGorelli commented Jun 4, 2024

Alternative method 10x faster than dt.offset_by() #16722

Alternative method 10x faster than dt.offset_by() #16722

Comments

Chuck321123 commented Jun 4, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Jun 4, 2024

Chuck321123 commented Jun 4, 2024 •

edited

Loading