Skip to content

why does this crate and PCRE2 differ with respect to \w+|[^\w\s]+ when searching haystacks with Unicode data? #1019

Answered by BurntSushi
DamonsJ asked this question in Q&A
Discussion options

You must be logged in to vote

It looks like you asked this question on StackOverflow, and the answer you got was basically correct. With a few missing details added via the comments there.

The result you're getting is correct. The key thing you're likely missing is that this crate defaults to treating \w as Unicode-aware, where as PCRE2 defaults to treating \w as ASCII-only. You can make PCRE2 treat \w as Unicode-aware (by enabling the PCRE2_UCP option), and similarly, you can make this crate treat \w as ASCII only. For example, (?-u:\w) and [\w&&\p{ascii}] are precisely equivalent.

Making [^\w\s] ASCII-only is a little trickier though, since (?-u:[^\w\s]) will match any individual byte that isn't in \w or \s. That in…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by DamonsJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #1018 on June 26, 2023 11:28.