why does this crate and PCRE2 differ with respect to `\w+|[^\w\s]+` when searching haystacks with Unicode data? #1019

DamonsJ · 2023-06-26T07:51:13Z

DamonsJ
Jun 26, 2023

What version of regex are you using?

v1.8.4

Describe the bug at a high level.

match string "戦場のヴァルキュリア3" with pattern r"\w+|[^\w\s]+" give 1 match
but using https://regexr.com/ with PCRE gives two matches, one is "戦場のヴァルキュリア" and another is "3"

What are the steps to reproduce the behavior?

here is the rust code I used:
main.rs.txt

What is the actual behavior?

the rust code gives 1 match, but it looks 2 matches is right

What is the expected behavior?

expect 2 matches

Answered by BurntSushi

Jun 26, 2023

It looks like you asked this question on StackOverflow, and the answer you got was basically correct. With a few missing details added via the comments there.

The result you're getting is correct. The key thing you're likely missing is that this crate defaults to treating \w as Unicode-aware, where as PCRE2 defaults to treating \w as ASCII-only. You can make PCRE2 treat \w as Unicode-aware (by enabling the PCRE2_UCP option), and similarly, you can make this crate treat \w as ASCII only. For example, (?-u:\w) and [\w&&\p{ascii}] are precisely equivalent.

Making [^\w\s] ASCII-only is a little trickier though, since (?-u:[^\w\s]) will match any individual byte that isn't in \w or \s. That in…

View full answer

BurntSushi · 2023-06-26T11:35:48Z

BurntSushi
Jun 26, 2023
Maintainer

It looks like you asked this question on StackOverflow, and the answer you got was basically correct. With a few missing details added via the comments there.

The result you're getting is correct. The key thing you're likely missing is that this crate defaults to treating \w as Unicode-aware, where as PCRE2 defaults to treating \w as ASCII-only. You can make PCRE2 treat \w as Unicode-aware (by enabling the PCRE2_UCP option), and similarly, you can make this crate treat \w as ASCII only. For example, (?-u:\w) and [\w&&\p{ascii}] are precisely equivalent.

Making [^\w\s] ASCII-only is a little trickier though, since (?-u:[^\w\s]) will match any individual byte that isn't in \w or \s. That in turn permits it to match invalid UTF-8, which is not allowed when using regex::Regex. So for that, you'll have to drop down to regex::bytes::Regex and match on &[u8]. However, you can use [[^\w\s]&&\p{ascii}] instead, which will match any ASCII codepoint that is not in \w or \s. That can be used with regex::Regex since it cannot match invalid UTF-8 by virtue of being limited to ASCII.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why does this crate and PCRE2 differ with respect to `\w+|[^\w\s]+` when searching haystacks with Unicode data? #1019

{{title}}

Replies: 1 comment

{{title}}

Select a reply

why does this crate and PCRE2 differ with respect to \w+|[^\w\s]+ when searching haystacks with Unicode data? #1019

DamonsJ Jun 26, 2023

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

Replies: 1 comment

BurntSushi Jun 26, 2023 Maintainer

why does this crate and PCRE2 differ with respect to `\w+|[^\w\s]+` when searching haystacks with Unicode data? #1019

DamonsJ
Jun 26, 2023

BurntSushi
Jun 26, 2023
Maintainer