Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aliases for Bytes.indexOf and Text.indexOf #165

Closed
pchiusano opened this issue Jun 8, 2023 · 12 comments
Closed

Add aliases for Bytes.indexOf and Text.indexOf #165

pchiusano opened this issue Jun 8, 2023 · 12 comments
Assignees

Comments

@pchiusano
Copy link
Member

pchiusano commented Jun 8, 2023

There's now much more efficient implementations of these as builtins (see this PR)

Bytes.indexOf : Bytes -> Bytes -> Optional Nat
Text.indexOf : Text -> Text -> Optional Nat

I notice Text.indexOf exists but has a slow implementation. So maybe as simple as...

replace Text.indexOf ##Text.indexOf
alias.term ##Bytes.indexOf Bytes.indexOf

Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.

@pchiusano pchiusano changed the title Add aliases for Bytes.indexOf and Text.indexOf Add aliases for `Bytes.indexOf` and `Text.indexOf` Jun 8, 2023
@pchiusano pchiusano changed the title Add aliases for `Bytes.indexOf` and `Text.indexOf` Add aliases for Bytes.indexOf and Text.indexOf Jun 8, 2023
@runarorama
Copy link
Contributor

runarorama commented Jun 9, 2023

Text.indexOf looks like it has a bug for multi-byte characters:

    1 | > ##Text.indexOf "foo👊🏿" "bar👊🏿foo👊🏿baz👊🏿"
          ⧩
          Some 7

    3 | > Text.size "foo👊🏿"
          ⧩
          5

The 👊🏿 character is two codepoints, so the answer should be Some 5

@pchiusano
Copy link
Member Author

Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html

Or maybe we're just using it wrong?

@runarorama
Copy link
Contributor

If it's working as intended, I don't understand what it's doing. It's interpreting "👊🏿" to have length 4, but it's two codepoints and 8 bytes. But 4 what?

@pchiusano
Copy link
Member Author

What if you just call it in ghci? Does it still behave incorrectly?

@runarorama
Copy link
Contributor

Yeah

λ> indices (Text.pack "foo") (Text.pack "👊🏿foo")
[4]

Is this just due to the fact that Haskell strings are UTF-16?

@pchiusano
Copy link
Member Author

pchiusano commented Jun 9, 2023

https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.

@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.

@runarorama
Copy link
Contributor

runarorama commented Jun 9, 2023

Text.breakOn, Text.breakOnEnd, and Text.breakOnAll would be great builtins to have. We could implement a fast Text.indexOf etc. in terms of those.

@pchiusano
Copy link
Member Author

I like breakOnAll and indexOfEnd as new builtins. That could be for later though.

indexOf seems better than breakOn - you can implement breakOn using indexOf, and indexOf can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.

@runarorama
Copy link
Contributor

Yeah, if we can get indexOf to work correctly for text, that's ideal.

@pchiusano
Copy link
Member Author

@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.

@runarorama
Copy link
Contributor

runarorama commented Jun 15, 2023

Now there's a different bug:

> ##Text.indexOf "" "foo"

⬇️

Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
    error

I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0.

@runarorama
Copy link
Contributor

Fixed in unisonweb/unison#4101

Replaced the Unison definitions with the builtins and pushed to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants