Refactor char/string and byte search #54667

jakobnissen · 2024-06-04T13:28:26Z

This is a refactoring of base/string/search.jl. It is purely internal, and comes with no changes in behaviour. It's based on #54593 and #54579, so those needs to get merged first, then this PR will be rebased onto master.

Included changes are:

The char/string search functions now use the last byte to memchr, not the first byte. Because the last bytes are more varied, this is much faster on small non-ASCII alphabets (like searching Greek or Cyrillic text) and somewhat faster on large non-ASCII ones (like Japanese). Speed on ASCII alphabets (like English) in unchanged.
Several unused or redundant methods have been removed
Moved boundschecks from the inner _search and _rsearch functions to the outer top-level functions that call them. This is because the former may be called in a loop where repeated boundschecking is needless. This should speed up search a bit.
Much of this code used 0 as a sentinel value, possibly a leftover from before union-splitting was a thing. Replace this with nothing.
Char/string search functions are now implemented in terms of an internal lazy iterator. This allows findall and findnext to share implementation, and will also make it trivially easy to implement a lazy findall in the future (see Implement lazy findall (Iterators.findall, perhaps?) #43737)

IMO there is still more work to be done on this file, but this requires a decision to be made on #43737, #54581 or #54584

Benchmarks

using BenchmarkTools
using Random

rng = Xoshiro(55)

greek = join(rand(rng, 'Α':'ψ', 100000)) * 'ω'
@btime findfirst('ω', greek)

@btime findfirst(==('\xce'), greek)

english = join(rand(rng, 'A':'y', 100000)) * 'z'
@btime findfirst('z', english)

@btime findall('A', english)
@btime findall('\xff', english)
nothing

1.11.0-beta2:

  100.049 μs (1 allocation: 16 bytes)
  474.084 μs (0 allocations: 0 bytes)
  689.110 ns (1 allocation: 16 bytes)
  93.536 μs (9 allocations: 21.84 KiB)
  72.316 μs (1 allocation: 32 bytes)

This PR:

  1.319 μs (1 allocation: 16 bytes)
  398.011 μs (0 allocations: 0 bytes)
  681.550 ns (1 allocation: 16 bytes)
  8.867 μs (8 allocations: 21.81 KiB)
  683.962 ns (1 allocation: 32 bytes)

In text, the first UTF8 bytes of characters are typically more repetitive than the last byte. For example, most Greek characters start with 0xce or 0xcf. By searching for the more unique last byte, more time is spent in the memchr fast path. This gives a significant speedup.

It's more Julian to return nothing directly from the search function.

Many of these are identical to the generic fallback

This has two advantages: First, it consolidates the implementation of findnext and findall. Second, it allows a hypothetical lazy findall iterator to be trivially implemented later.

The search functions are a basic building block of the other functions, and may e.g. be called in a loop. It's wasteful to check bounds in these, as they are often called when we know for sure we are inbounds. Move the boundscheck closer to the top-level calls. This should slightly improve efficiency.

Take fast path not in every iteration, but just once, outside the loop.

jakobnissen · 2024-09-12T06:14:55Z

This is good to go now. Test failures are unrelated.

base/strings/search.jl

jakobnissen added strings "Strings!" search & find The find* family of functions performance Must go faster labels Jun 4, 2024

jakobnissen added 7 commits September 11, 2024 08:43

Various fixes to searching (squashed JuliaLang#54579)

cd1890b

Remove nothing_sentinel

5e4d98b

It's more Julian to return nothing directly from the search function.

Remove unused functions

a31db09

Many of these are identical to the generic fallback

Impl byte/string search as lazy iterator

6061a8d

This has two advantages: First, it consolidates the implementation of findnext and findall. Second, it allows a hypothetical lazy findall iterator to be trivially implemented later.

Make findall slightly faster

ba4e410

Take fast path not in every iteration, but just once, outside the loop.

jakobnissen force-pushed the find_refactor branch from df9c1d8 to ba4e410 Compare September 11, 2024 06:46

jakobnissen added 2 commits September 11, 2024 08:49

Fix typos

c791285

Fixup

aa0305c

jakobnissen marked this pull request as ready for review September 12, 2024 06:13

jakobnissen added the awaiting review PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Sep 12, 2024

KristofferC reviewed Sep 12, 2024

View reviewed changes

base/strings/search.jl Outdated Show resolved Hide resolved

Switch internal docstrings to comments

0b08a60

jakobnissen changed the title ~~WIP: Refactor char/string and byte search~~ Refactor char/string and byte search Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor char/string and byte search #54667

Refactor char/string and byte search #54667

jakobnissen commented Jun 4, 2024 •

edited

Loading

jakobnissen commented Sep 12, 2024

Refactor char/string and byte search #54667

Are you sure you want to change the base?

Refactor char/string and byte search #54667

Conversation

jakobnissen commented Jun 4, 2024 • edited Loading

Benchmarks

jakobnissen commented Sep 12, 2024

jakobnissen commented Jun 4, 2024 •

edited

Loading