Implement phone number analyzer #15915

rursprung · 2024-09-12T16:55:17Z

Description

this is largely based on elasticsearch-phone and internally uses
libphonenumber.
this intentionally only ports a subset of the features: only phone and
phone-search are supported right now, phone-email can be added
if/when there's a clear need for it.

using libphonenumber is required since parsing phone numbers is a
non-trivial task (even though it might seem trivial at first glance!),
as can be seen in the list falsehoods programmers believe about phone
numbers.

this allows defining the region to be used when analysing a phone
number. so far only the generic "unkown" region (ZZ) had been used
which worked as long as international numbers were prefixed with + but
did not work when using local numbers (e.g. a number stored as
+4158... was not matched against a number entered as 004158... or
058...).

example configuration for an index:

{
  "index": {
    "analysis": {
      "analyzer": {
        "phone": {
          "type": "phone"
        },
        "phone-search": {
          "type": "phone-search"
        },
        "phone-ch": {
          "type": "phone",
          "phone-region": "CH"
        },
        "phone-search-ch": {
          "type": "phone-search",
          "phone-region": "CH"
        }
      }
    }
  }
}

this creates four analyzers: phone and phone-search which do not
explicitly specify a region and thus fall back to ZZ (unknown region,
regional version of international dialing prefix (e.g. 00 instead of
+ in most of europe) will not be recognised) and phone-ch and
phone-search-ch which will try to parse the phone number as a swiss
phone number (thus e.g. 00 as a prefix is recognised as the
international dialing prefix).

note that the analyzer is (currently) not meant to find phone numbers in
large text documents - instead it should be used on fields which contain
just the phone number (though extra text will be ignored) and it
collects the whole content of the field into a String in memory,
making it unsuitable for large field values.

this has been implemented in a new plugin which is however part of the
central opensearch repository as it was deemed too big an overhead to
have it in a separate repository but not important enough to bundle it
directly in analysis-common (see the discussion on the issue and the
PR for further details).

note that the new plugin has been added to the exclude list of the
javadoc check as this check is overzealous and also complains in many
cases where it shouldn't (e.g. on overridden methods - which it should
theoretically not do - or constructors which don't even exist). the
check first needs to be improved before this exclusion could be removed.

closes #11326

Signed-off-by: Ralph Ursprung [email protected]

Related Issues

Resolves #11326

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable. => [DOC] new phone number analyzer plugin documentation-website#8389

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-09-12T17:00:59Z

❌ Gradle check result for 74429fe: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-12T17:09:44Z

❌ Gradle check result for d844ea9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-13T13:14:04Z

❌ Gradle check result for 24e60a5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-09-13T14:25:24Z

❕ Gradle check result for f7669e2: UNSTABLE

TEST FAILURES:

      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2024-09-13T14:26:39Z

Codecov Report

Attention: Patch coverage is 97.36842% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.94%. Comparing base (6020c58) to head (a3ac6dc).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...earch/analysis/phone/PhoneNumberTermTokenizer.java	97.87%	0 Missing and 1 partial ⚠️
...nalysis/phone/PhoneNumberTermTokenizerFactory.java	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #15915      +/-   ##
============================================
+ Coverage     71.88%   71.94%   +0.05%     
- Complexity    64496    64535      +39     
============================================
  Files          5291     5296       +5     
  Lines        301668   301744      +76     
  Branches      43576    43585       +9     
============================================
+ Hits         216863   217094     +231     
+ Misses        67031    66764     -267     
- Partials      17774    17886     +112

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rursprung · 2024-09-13T14:33:45Z

testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

❕ Gradle check result for f7669e2: UNSTABLE
* **TEST FAILURES:**
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

this is a flaky test: #14304

and the failure of the "mend security check" also seems to be random (but i don't have the rights to re-trigger it)

opensearch-trigger-bot · 2024-10-03T22:40:03Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15915-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d1fd47c652b4c6a2c0ec5d0ee574a0ff0d263177
# Push it to GitHub
git push --set-upstream origin backport/backport-15915-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-15915-to-2.x.

reta · 2024-10-03T22:40:31Z

@rursprung apologies, mind please sending a manual backport to2.x branch? thank you

rursprung · 2024-10-04T07:00:09Z

@rursprung apologies, mind please sending a manual backport to2.x branch? thank you

no worries, done: #16187

i'm a big fan of having a changelog, but it's causing a lot of merge conflicts here 🙁
i've seen another repo (don't remember which, but i have a vague feeling that it was even in the opensearch org?) which had a subfolder where you created one file per PR for the changelog and some automated tooling then collected all of that together and merged it into the main changelog file for the release (and deleted the other files). maybe that might be an idea here as well to avoid the merge conflicts? might be worth discussing in a separate issue (which should probably come from someone regularly contributing to this repo as you'll be much more affected by this issue than me)?

on another note: squash-merging destroys my nice atomic commits 🙁
i get that you do that for PRs where people just add a ton of "fix review finding" commits, but for the proper (linux kernel style ;)) PRs where you force-push to have a nice linear commit history with atomic commits, each with a nice commit message) i think this is a big loss (and makes it harder to find the actual culprit with tools like git-bisect and git-revert)

reta · 2024-10-04T11:43:22Z

might be worth discussing in a separate issue (which should probably come from someone regularly contributing to this repo as you'll be much more affected by this issue than me)?

Please feel free to open an issue or kick off discussion!

on another note: squash-merging destroys my nice atomic commits

The clean repo history is useful, but this is a tradeoff for sure

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves opensearch-project#8389 Co-authored-by: Fanit Kolchina <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new tes group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new tes group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). note that the CI currently needs to fetch the image from `opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and `ref` config can be removed once 2.18.0 has been released. Signed-off-by: Ralph Ursprung <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new test group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). note that the CI currently needs to fetch the image from `opensearchstaging` as 2.18.0 hasn't been released yet. the `hub` and `ref` config can be removed once 2.18.0 has been released. Signed-off-by: Ralph Ursprung <[email protected]>

* add `Strings#isDigits` API inspiration taken from [this SO answer][SO]. note that the stream is not parallelised to avoid the overhead of this as the method is intended to be called primarily with shorter strings where the time to set up would take longer than the actual check. [SO]: https://stackoverflow.com/a/35150400 Signed-off-by: Ralph Ursprung <[email protected]> * add `phone` & `phone-search` analyzer + tokenizer this is largely based on [elasticsearch-phone] and internally uses [libphonenumber]. this intentionally only ports a subset of the features: only `phone` and `phone-search` are supported right now, `phone-email` can be added if/when there's a clear need for it. using `libphonenumber` is required since parsing phone numbers is a non-trivial task (even though it might seem trivial at first glance!), as can be seen in the list [falsehoods programmers believe about phone numbers][falsehoods]. this allows defining the region to be used when analysing a phone number. so far only the generic "unkown" region (`ZZ`) had been used which worked as long as international numbers were prefixed with `+` but did not work when using local numbers (e.g. a number stored as `+4158...` was not matched against a number entered as `004158...` or `058...`). example configuration for an index: ```json { "index": { "analysis": { "analyzer": { "phone": { "type": "phone" }, "phone-search": { "type": "phone-search" }, "phone-ch": { "type": "phone", "phone-region": "CH" }, "phone-search-ch": { "type": "phone-search", "phone-region": "CH" } } } } } ``` this creates four analyzers: `phone` and `phone-search` which do not explicitly specify a region and thus fall back to `ZZ` (unknown region, regional version of international dialing prefix (e.g. `00` instead of `+` in most of europe) will not be recognised) and `phone-ch` and `phone-search-ch` which will try to parse the phone number as a swiss phone number (thus e.g. `00` as a prefix is recognised as the international dialing prefix). note that the analyzer is (currently) not meant to find phone numbers in large text documents - instead it should be used on fields which contain just the phone number (though extra text will be ignored) and it collects the whole content of the field into a `String` in memory, making it unsuitable for large field values. this has been implemented in a new plugin which is however part of the central opensearch repository as it was deemed too big an overhead to have it in a separate repository but not important enough to bundle it directly in `analysis-common` (see the discussion on the issue and the PR for further details). note that the new plugin has been added to the exclude list of the javadoc check as this check is overzealous and also complains in many cases where it shouldn't (e.g. on overridden methods - which it should theoretically not do - or constructors which don't even exist). the check first needs to be improved before this exclusion could be removed. closes opensearch-project#11326 [elasticsearch-phone]: https://github.com/purecloudlabs/elasticsearch-phone [libphonenumber]: https://github.com/google/libphonenumber [falsehoods]: https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md Signed-off-by: Ralph Ursprung <[email protected]> --------- Signed-off-by: Ralph Ursprung <[email protected]>

* document the new `analysis-phonenumber` plugin this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. resolves #8389 Co-authored-by: Fanit Kolchina <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> * Minor rewrites Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/supported-analyzers/phone-analyzers.md Signed-off-by: kolchfa-aws <[email protected]> * Update _analyzers/supported-analyzers/phone-analyzers.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]>

this is part of opensearch-project/OpenSearch#11326. the actual implementation was done opensearch-project/OpenSearch#15915. see the commit message on the PR for further details. the new test group `analysis` has been added so that it can later be extended with all other optional language analyzers (which are currently also not covered). Signed-off-by: Ralph Ursprung <[email protected]>

github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Relevance labels Sep 12, 2024

rursprung force-pushed the implement-phone-number-analyzer branch from 74429fe to d844ea9 Compare September 12, 2024 16:56

rursprung mentioned this pull request Sep 12, 2024

analysis-common: make UniqueTokenFilter public #14179

Closed

3 tasks

rursprung force-pushed the implement-phone-number-analyzer branch from d844ea9 to 24e60a5 Compare September 13, 2024 13:06

rursprung force-pushed the implement-phone-number-analyzer branch from 24e60a5 to f7669e2 Compare September 13, 2024 13:30

rursprung marked this pull request as ready for review September 13, 2024 14:35

rursprung requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, dblock, dbwiddis, gbbafna, jainankitk, kotwanikunal, linuxpi, mch2, msfroh, nknize, owaiskazi19, reta and Rishikesh1159 as code owners September 13, 2024 14:35

reta merged commit d1fd47c into opensearch-project:main Oct 3, 2024
33 of 35 checks passed

opensearch-trigger-bot bot added the backport-failed label Oct 3, 2024

rursprung deleted the implement-phone-number-analyzer branch October 4, 2024 06:19

rursprung mentioned this pull request Oct 4, 2024

implement phone number analyzer (cherry-pick to 2.x) #16187

Merged

3 tasks

rursprung mentioned this pull request Oct 4, 2024

[Feature Request] Backports fail often due to CHANGELOG conflicts #15149

Open

rursprung mentioned this pull request Oct 4, 2024

document the new analysis-phonenumber plugin opensearch-project/documentation-website#8469

Merged

1 task

rursprung mentioned this pull request Oct 11, 2024

add phone number analysis plugin opensearch-project/opensearch-api-specification#609

Open

BrewTestBot mentioned this pull request Nov 6, 2024

opensearch 2.18.0 Homebrew/homebrew-core#196785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement phone number analyzer #15915

Implement phone number analyzer #15915

rursprung commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024

github-actions bot commented Sep 12, 2024

github-actions bot commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

codecov bot commented Sep 13, 2024 •

edited

Loading

rursprung commented Sep 13, 2024 •

edited

Loading

opensearch-trigger-bot bot commented Oct 3, 2024

reta commented Oct 3, 2024

rursprung commented Oct 4, 2024

reta commented Oct 4, 2024

Implement phone number analyzer #15915

Implement phone number analyzer #15915

Conversation

rursprung commented Sep 12, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Sep 12, 2024

github-actions bot commented Sep 12, 2024

github-actions bot commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

rursprung commented Sep 13, 2024 • edited Loading

opensearch-trigger-bot bot commented Oct 3, 2024

reta commented Oct 3, 2024

rursprung commented Oct 4, 2024

reta commented Oct 4, 2024

rursprung commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

rursprung commented Sep 13, 2024 •

edited

Loading