-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement phone number analyzer (cherry-pick to 2.x) #16187
Merged
reta
merged 2 commits into
opensearch-project:2.x
from
rursprung:implement-phone-number-analyzer-2.x
Oct 4, 2024
Merged
implement phone number analyzer (cherry-pick to 2.x) #16187
reta
merged 2 commits into
opensearch-project:2.x
from
rursprung:implement-phone-number-analyzer-2.x
Oct 4, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]>
this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]>
rursprung
requested review from
anasalkouz,
andrross,
ashking94,
Bukhtawar,
CEHENKLE,
dblock,
dbwiddis,
gbbafna,
jainankitk,
kotwanikunal,
linuxpi,
mch2,
msfroh,
nknize,
owaiskazi19,
reta,
Rishikesh1159,
sachinpkale,
saratvemulapalli,
shwetathareja,
sohami and
VachaShah
as code owners
October 4, 2024 06:43
github-actions
bot
added
enhancement
Enhancement or improvement to existing feature or request
Search:Relevance
v2.18.0
Issues and PRs related to version 2.18.0
v3.0.0
Issues and PRs related to version 3.0.0
labels
Oct 4, 2024
github-actions
bot
added
v2.18.0
Issues and PRs related to version 2.18.0
v3.0.0
Issues and PRs related to version 3.0.0
labels
Oct 4, 2024
3 tasks
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 2.x #16187 +/- ##
============================================
- Coverage 71.69% 71.67% -0.03%
- Complexity 64780 64782 +2
============================================
Files 5279 5284 +5
Lines 302966 303042 +76
Branches 44073 44082 +9
============================================
- Hits 217226 217195 -31
- Misses 67594 67646 +52
- Partials 18146 18201 +55 ☔ View full report in Codecov by Sentry. |
reta
approved these changes
Oct 4, 2024
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
Enhancement or improvement to existing feature or request
Search:Relevance
v2.18.0
Issues and PRs related to version 2.18.0
v3.0.0
Issues and PRs related to version 3.0.0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
cherry-pick of #15915
see commit messages (or other PR) for details.
Related Issues
Resolves #11326
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.