Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python,rust): add date pattern dd.mm.YYYY #16045

Conversation

Julian-J-S
Copy link
Contributor

close #14990

  • add dd.mm.YYYY date pattern which is widely used.
  • enables automatic date inference for to_date as well as in read_csv
  • also added tests for all date patterns
pl.DataFrame({"D.M.Y": ["01.02.2020", "03.04.2020", "31.12.2020"]}).with_columns(
    pl.all().str.to_date()
)

shape: (3, 1)
┌────────────┐
│ D.M.Y      │
│ ---        │
│ date       │
╞════════════╡
│ 2020-02-01 │
│ 2020-04-03 │
│ 2020-12-31 │
└────────────┘


pl.read_csv(
    source="dates\n01.02.2020\n03.04.2020\n31.12.2020".encode(),
    try_parse_dates=True,
)

shape: (3, 1)
┌────────────┐
│ dates      │
│ ---        │
│ date       │
╞════════════╡
│ 2020-02-01 │
│ 2020-04-03 │
│ 2020-12-31 │
└────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 4, 2024
Copy link

codspeed-hq bot commented May 4, 2024

CodSpeed Performance Report

Merging #16045 will improve performances by 27.87%

Comparing JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference (09a776b) with main (6730a72)

Summary

⚡ 1 improvements
✅ 34 untouched benchmarks

Benchmarks breakdown

Benchmark main JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference Change
test_filter2 2.8 ms 2.2 ms +27.87%

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 4, 2024

I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your PR

if we go ahead with this, it should also be added to DATETIME_D_M_Y too (DATE_D_M_Y should be a subset of it)

@Julian-J-S
Copy link
Contributor Author

I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).

@alexander-beedie
Afaik the order does NOT matter as is checks for all formats all the time.
Was thinking about the order anyway.
However, I did check the occurrences (#15949 (comment)) and the order by countries actually is:

  • dd/mm/yyyy: 36x
  • dd.mm.yyyy: 26x
  • dd-mm-yyyy: 7x (least used)

if we go ahead with this, it should also be added to DATETIME_D_M_Y too (DATE_D_M_Y should be a subset of it)

@MarcoGorelli
Was thinking about this, too.
However, it looks like while dd.mm.YYYY is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format. 🤔

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 4, 2024

Afaik the order does NOT matter as is checks for all formats all the time.

IIRC it goes in order until it finds a match, then it keeps using the last successful one (until it may need to look at the next one in the group). maybe let's put them in order of popularity then?

However, it looks like while dd.mm.YYYY is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format.

ok sure

@Julian-J-S
Copy link
Contributor Author

maybe let's put them in order of popularity then?

sure! But how do we measure this?
I would suggest

  • dd-mm-yyyy: ISO 8601 international standard; should be No 1
  • dd/mm/yyyy: more countries than below according to wiki
  • dd.mm.yyyy:

No sure what the performance impact really is but if we do this "micro optimizations" should we also adjust the DATE_Y_M_D, DATETIME_D_M_Y and DATETIME_Y_M_D patterns to put the ISO 8601 standard as the first format as these are used by ALL widespread data engineering/analytics/etl/... tools??

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @JulianCologne - probably not worth micro-optimising the order, and as you noted, %d.%m.%Y is very common

@MarcoGorelli MarcoGorelli merged commit 9bda525 into pola-rs:main May 5, 2024
25 of 26 checks passed
@Julian-J-S Julian-J-S deleted the feat-add-D-M-Y-dot-sepatared-date-pattern-inference branch May 12, 2024 05:19
Wouittone pushed a commit to Wouittone/polars that referenced this pull request Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.to_datetime - support date format "%d.%m.%Y" , // 31.12.2021
3 participants