Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement phone number analyzer (cherry-pick to 2.x) #16187

Merged

Conversation

rursprung
Copy link
Contributor

Description

cherry-pick of #15915

see commit messages (or other PR) for details.

Related Issues

Resolves #11326

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

inspiration taken from [this SO answer][SO].

note that the stream is not parallelised to avoid the overhead of this
as the method is intended to be called primarily with shorter strings
where the time to set up would take longer than the actual check.

[SO]: https://stackoverflow.com/a/35150400

Signed-off-by: Ralph Ursprung <[email protected]>
this is largely based on [elasticsearch-phone] and internally uses
[libphonenumber].
this intentionally only ports a subset of the features: only `phone` and
`phone-search` are supported right now, `phone-email` can be added
if/when there's a clear need for it.

using `libphonenumber` is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list [falsehoods programmers believe about phone
numbers][falsehoods].

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (`ZZ`) had been used
which worked as long as international numbers were prefixed with `+` but
did not work when using local numbers (e.g. a number stored as
`+4158...` was not matched against a number entered as `004158...` or
`058...`).

example configuration for an index:
```json
{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}
```
this creates four analyzers: `phone` and `phone-search` which do not
explicitly specify a region and thus fall back to `ZZ` (unknown region,
regional version of international dialing prefix (e.g. `00` instead of
`+` in most of europe) will not be recognised) and `phone-ch` and
`phone-search-ch` which will try to parse the phone number as a swiss
phone number (thus e.g. `00` as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a `String` in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in `analysis-common` (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes opensearch-project#11326

[elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone
[libphonenumber]: https://github.com/google/libphonenumber
[falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md

Signed-off-by: Ralph Ursprung <[email protected]>
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0 labels Oct 4, 2024
@github-actions github-actions bot added v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0 labels Oct 4, 2024
@rursprung rursprung mentioned this pull request Oct 4, 2024
3 tasks
Copy link
Contributor

github-actions bot commented Oct 4, 2024

✅ Gradle check result for 2d72f2a: SUCCESS

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 97.36842% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.67%. Comparing base (66852cf) to head (2d72f2a).
Report is 1 commits behind head on 2.x.

Files with missing lines Patch % Lines
...earch/analysis/phone/PhoneNumberTermTokenizer.java 97.87% 0 Missing and 1 partial ⚠️
...nalysis/phone/PhoneNumberTermTokenizerFactory.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x   #16187      +/-   ##
============================================
- Coverage     71.69%   71.67%   -0.03%     
- Complexity    64780    64782       +2     
============================================
  Files          5279     5284       +5     
  Lines        302966   303042      +76     
  Branches      44073    44082       +9     
============================================
- Hits         217226   217195      -31     
- Misses        67594    67646      +52     
- Partials      18146    18201      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@reta reta merged commit 56f1498 into opensearch-project:2.x Oct 4, 2024
56 of 57 checks passed
@rursprung rursprung deleted the implement-phone-number-analyzer-2.x branch October 4, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Relevance v2.18.0 Issues and PRs related to version 2.18.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants