Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace current regex engine with PCRE2 #4033

Closed
wants to merge 1 commit into from
Closed

Conversation

Rot127
Copy link
Member

@Rot127 Rot127 commented Dec 13, 2023

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed)

Detailed description

Replace the current regex engine with PCRE2.

Test plan

All green.

Closing issues

closes #3730

Partially addresses #4055

Todo

  • Add a rz_vector_pop_ptr() function which doesn't use memcpy() and use it in match_all_flat(). Or add a rz_vector_concat() function with the same functionality.
  • Terminology for groups and matches is weird.
  • docs
  • Open bug report for tcc and pcre2 compilation and __asm
  • Test rz_regex_get_match_name

@Rot127 Rot127 marked this pull request as draft December 13, 2023 23:04
@Rot127 Rot127 changed the title Add PCRE2 as dependency. Replace current regex engine with PCRE2 Dec 13, 2023
@Rot127 Rot127 added the performance A performance problem/enhancement label Dec 13, 2023
Copy link
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please send it from dist- branch from rizinorg account, to check that it builds and runs fine on all supported platforms and configurations
  2. Once it's done, I think it is worth removing the old implementation as well.

@wargio
Copy link
Member

wargio commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

@Rot127
Copy link
Member Author

Rot127 commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

Currently for every distro which has no pcre2 it is compiled as subproject. Shouldn't this be enough?

@XVilka
Copy link
Member

XVilka commented Dec 14, 2023

would be nice to also allow PCRE (v1) for older distro

Currently for every distro which has no pcre2 it is compiled as subproject. Shouldn't this be enough?

Yes, I think it should be enough. We have the same strategy for Capstone dependency as well.

@Rot127
Copy link
Member Author

Rot127 commented Dec 14, 2023

The PCRE2 defines a huge amount of options and flags. Is it fine if we do not add an RZ_REGEX_... variant for each of it? Simply to avoid duplicates?
Or is it always preferable to have our API consistently named?

@wargio
Copy link
Member

wargio commented Dec 15, 2023

you should only use PCRE2_CASELESS | PCRE2_MULTILINE
where caseless is applied only if the grep configuration is auto or case insensitive.

@XVilka XVilka added the refactor Refactoring requests label Dec 16, 2023
meson.build Outdated Show resolved Hide resolved
librz/util/regex.c Outdated Show resolved Hide resolved
@Rot127 Rot127 force-pushed the pcre2 branch 2 times, most recently from 558652e to 4168046 Compare January 3, 2024 17:28
@Rot127
Copy link
Member Author

Rot127 commented Jan 4, 2024

Tests/build still fail, but I'd like to know your opinion on the API first. Because now there are some changes in actual code now.

@Rot127 Rot127 marked this pull request as ready for review February 2, 2024 08:06
Copy link
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also fix test\db\archos\windows-x64\dbg_dts - remove \n or just don't include it in the capture.
macOS (Darwin) regexes should be fixed too, in particular with whitespace and newlines handling: https://ci.rizin.re/repos/27/pipeline/4083/7#L616

OpenBSD error message is puzzling:

ERROR: Regex compilation failed at 0: no more memory

Same happens on NetBSD too.

librz/cons/pager.c Show resolved Hide resolved
@XVilka

This comment was marked as resolved.

@XVilka

This comment was marked as resolved.

@Rot127
Copy link
Member Author

Rot127 commented Feb 3, 2024

Yep, aware of it. Just wait and bundle it with other fixes, so the CI is not triggered again and again.

@Rot127
Copy link
Member Author

Rot127 commented Feb 3, 2024

BSD problems seem to be a bug in PCRE2 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276252). Hopefully only affecting the JIT compiler?

Going make a minimal example and open an issue later. If it is indeed a BSD only problem, I would just exclude BSD from JIT/

@XVilka
Copy link
Member

XVilka commented Feb 3, 2024

@Rot127 yes, though FreeBSD works just fine. The problem occurs only in OpenBSD and NetBSD, please disable JIT only for those, but not the FreeBSD.

@XVilka
Copy link
Member

XVilka commented Feb 3, 2024

Once this PR is green after the rebase, please send one from the dist-fuzz- branch, to check all configurations. Also, do you want to squash all those commits into one?

@Rot127
Copy link
Member Author

Rot127 commented Feb 3, 2024

Also, do you want to squash all those commits into one?

Absolutely. The commit history is a mess

@Rot127 Rot127 force-pushed the pcre2 branch 4 times, most recently from 5ff6680 to 5a5d5af Compare February 3, 2024 08:41
@thestr4ng3r

This comment was marked as resolved.

@Rot127
Copy link
Member Author

Rot127 commented Feb 3, 2024

Should be fixed now.

meson.build Outdated Show resolved Hide resolved
librz/magic/softmagic.c Outdated Show resolved Hide resolved
PCRE2 has way better performance than the OpenBSD
library (something around 20 times faster).

The following flags are enabled for every pattern:

- PCRE2_UTF
- PCRE2_MATCH_INVALID_UTF
- PCRE2_NO_UTF_CHECK

All the others are optional.

Changes made:

- Adds PCRE2 as subproject.
- Changes the API away from POSIX to PCRE2.
- Edits many regex patterns because:
 - ' ' is skipped in patterns, if the EXTENDED flag is set for matching. '\s' must be set now.
 - '.' doesn't match newlines by default.
- Changes the API so matches and their groups are bundled into PVectors.
- Moves the regex component to rz_util.
@XVilka
Copy link
Member

XVilka commented Feb 3, 2024

@Rot127 please send a new PR from inside the rizinorg account with the dist-fuzz- branch to trigger everything.

@Rot127 Rot127 mentioned this pull request Feb 4, 2024
10 tasks
@Rot127 Rot127 closed this Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex is slow
5 participants