Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussing changes to LanguageAnalyzer #788

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

brentam
Copy link
Contributor

@brentam brentam commented Dec 31, 2023

Description

This a temporary/draft PR to discuss the viability of changes to the LanguageAnalyzer needed to correct some issues.
The original PR is here #779

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -68,7 +69,7 @@ private LanguageAnalyzer(Builder builder) {

this.version = builder.version;
this.language = ApiTypeHelper.requireNonNull(builder.language, this, "language");
this.stemExclusion = ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion");
this.stemExclusion = builder.stemExclusion; // ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion");
Copy link
Contributor Author

@brentam brentam Dec 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stemExclusion should not be mandatory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is supported by most analyzers but not all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it is reasonable to make it optional for those that supports this property, I agree

@@ -589,8 +592,11 @@ public Analyzer build() {

protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) {

for (Language value : Language.values()) {
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());
//TODO should we lowercase in the Language Enum?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language jsonValue has to match the Kind jsonValue
So we set the deserializers for all the languages here

Arabic("Arabic"),

Armenian("Armenian"),
Arabic("arabic"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the json value has to be lowercase. to match the values Opensearch expects

@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz
*/
@Override
public Analyzer.Kind _analyzerKind() {
return Analyzer.Kind.Language;
return Analyzer.Kind.valueOf(this.language.name());
}
Copy link
Contributor Author

@brentam brentam Dec 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to determine the Analyzer.Kind at runtime....
Note that the Language values have to match the Kind values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the Cjk was added because it does not support stem_exclusion , not every language analyzer supports it actually (per documentation).

assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords());
assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath());
}
// temporary commenting this test until we decide if we keep the CjkAnalyser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep CjkAnalyser since it does not support stem_exclusion

Copy link
Contributor Author

@brentam brentam Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta
sorry if the comment was not clear. What I meant is that if we change the LanguageAnalyzer, the CJkAnalyzer will be deleted, same as DutchAnalyzer, since all languages analyzers would be constructed using the LanguageAnalyzer+Language.

Copy link
Contributor Author

@brentam brentam Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am usure what you mean.
The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers. If we have one exception for Cjk, we will keep supporting two patterns. In that case I rather keep all the 33 analyzers created in the the other PR, since the LanguageAnalyzer atm is not correct and the Analyzer Per language pattern is working fine..

Copy link
Collaborator

@reta reta Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brentam

The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers.

This is correct, we though need to acknowledge that not all LanguageAnalyzer are supporting the same customization capabilities - CJK does not support stem_exclusion (may be others as well but this one for sure). Making stem_exclusion optional won't solve the problem - the server side will (or at least, should) reject the CJK analyzer with stem_exclusion.

Looking how we could accommodate that.

Copy link
Collaborator

@reta reta Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, @brentam I think I have a solution for that:

  • we keep CjkAnalyzer but
  • we remove Cjk from Language enumeration

This is not ideal but the pros are that - it prevents users from referring to unsupported features at compile time (vs removing CjkAnalyzer but failing at runtime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found other big issues with the analyzers api while testing and looking at the code, I was thinking about creating other tickets because this discussion is becoming quite long but I will add them in this thread so you have an idea of the minimum I think needs to be done. Those are based on the use case we need to be supported at work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is
There seems to be a lot of problems with the IndexSettingsAnalysis.Builder.
(This can be the subject of another PR)

Check this example

        Map<String, Analyzer> map = new HashMap<>();
        map.put("some_analyzer",
                LanguageAnalyzer.of(l -> l.language(Language.German).
                                stemExclusion(new ArrayList<>()))
                        ._toAnalyzer());
        IndexSettingsAnalysis settings = IndexSettingsAnalysis.of(
                        anl -> anl
                                .analyzer(map)
                                .charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip")))
        );

        System.out.println(toJson(settings));

This will produce this json
{"analyzer":{"some_analyzer":{"type":"language","language":"German","stem_exclusion":[]}},"char_filter":{"some_char_filter":"html_strip"}}.

Now, we should be able to specify the char_filter under "some_analyzer" too, not only under the "analyzer" like it is here.

These two jsons are perfectly fine and we are actually using this in production:

          "english_analyzer": {
            "type": "english",
            "char_filter": [
              "html_strip"
            ],
            "tokenizer": "standard"
          }
          "german_analyzer" : {
            "filter" : [ "lowercase", "german_normalization", "german_stop", "german_synonyms", "german_decompounder", "german_synonyms", "possessive_stemmer", "german_stemmer" ],
            "char_filter" : [ "html_strip" ],
            "tokenizer" : "standard"
          },

So there is some redesigned needed, the easiest would be modify the language Deserializer to accept the "char_filter" and some other clauses.

Here is the Deserialization function for IndexSettingsAnalysis

    protected static void setupIndexSettingsAnalysisDeserializer(ObjectDeserializer<IndexSettingsAnalysis.Builder> op) {

        op.add(Builder::analyzer, JsonpDeserializer.stringMapDeserializer(Analyzer._DESERIALIZER), "analyzer");
        op.add(Builder::charFilter, JsonpDeserializer.stringMapDeserializer(CharFilter._DESERIALIZER), "char_filter");
        op.add(Builder::filter, JsonpDeserializer.stringMapDeserializer(TokenFilter._DESERIALIZER), "filter");
        op.add(Builder::normalizer, JsonpDeserializer.stringMapDeserializer(Normalizer._DESERIALIZER), "normalizer");
        op.add(Builder::tokenizer, JsonpDeserializer.stringMapDeserializer(Tokenizer._DESERIALIZER), "tokenizer");

    }

I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".

The CharFilter.Builder seems to be incomplete too.
There is no way to construct an array of names, like how is expected here

          "english_analyzer": {
            "type": "english",
            "char_filter": [
              "html_strip"
            ],
            "tokenizer": "standard"
          }

It only allows a map
.charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip")))
that will result in
{"some_char_filter":"html_strip"}

Now, I was looking at the apis related to the Charfilter builders. There is a lot of classes and hard to understand the right ones to use, so refactoring that will be a pain...

I suggest the simple fix would be having a "reference" method in the builder...
We would set it like this
CharFilter.of(c -> c.references(new String[]{"html_strip","some_custom_charset_defined_elsewhere"}))

Copy link
Collaborator

@reta reta Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brentam, I till try to answer based on the documentation

Arent we being to pendantic here?
Is really the way to construct the language analyzer using LanguageAnalyzer Builder that important??

This model is generated from the API specification so it is supposed to work or the API is completely wrong. The fact stem_exclusion are not enforced is probably not good, I suspect when specified, the expectation is it should be applied during analysis (I will look into it shortly).

That is another reason I am in favour of keeping the 33 new analyzers I have created in the other PR an not touching the LanguageAnalyzer at all

As you discovered, the LanguageAnalyzer does not work at all at the moment. I believe with the changes form this pull request it should work the way it is supposed to work, right?

I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".

No, I think we should not - this belongs to custom analyzers [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html#analysis-custom-analyzer

Copy link
Collaborator

@reta reta Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brentam quick update (my apologies, learning some things along with you), this is my understand of all the pieces so far:

  1. Language Analyzers: OpenSearch uses some predefined language analyzers that come with Apache Lucene, those could be used by specifying "type":"<language>" when definiting analyzers

  2. All of those predefined language analyzers support stop words but not all support stem_exclusion. The way it works though is very unexpected - there is no validation but the the unsupported properties (like stem_exclusion in case of Cjk) are silently ignored. More to that, to my understanding, it equally applies to char_filter & co - they are not taken into account since predefined language analyzers do not support this customization (that is why custom has to be used). In practice it means that what you think you configure is not what you get:

     "english_analyzer": {
             "type": "english",
             "char_filter": [
               "html_strip"
             ],
             "tokenizer": "standard"
           }
    

    is equivalent to

    "english_analyzer": {
            "type": "english"
    }
    

and sadly, is equivalent to (since char_filter and tokenizer are ignored completely)

 "english_analyzer": {
         "type": "english",
         "char_filter": [
           "garbage"
         ],
         "tokenizer": "garbage"
       }

All in all, I think we are doing very good change (thanks to you) by first of all, fixing the client libraries, and secondly - by uncovering the gaps we currently have on the server side. Lastly, the documentation would need large chunk of work there as well.

Copy link
Contributor Author

@brentam brentam Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, that is sad...lol. I will have to review our analyzers definition at work now..lol

@@ -92,6 +90,74 @@ public enum Kind implements JsonEnum {

Cjk("cjk"),

Arabic("arabic"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why do we need to mirror Language enumeration here?

@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz
*/
@Override
public Analyzer.Kind _analyzerKind() {
return Analyzer.Kind.Language;
return Analyzer.Kind.valueOf(this.language.name());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change analyzer kind here, we need to change serialization (type -> language) and deserialization (Analyzer class):

        for (final Language language: Language.values()) {
            op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, language.jsonValue());
        }


generator.write("type", "language");
generator.write("type", this._analyzerKind().jsonValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -589,8 +592,11 @@ public Analyzer build() {

protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) {

for (Language value : Language.values()) {
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already did that, right?

Suggested change
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());
op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue());

// assertTrue(analyzer2.isCjk());
// assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords());
// assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath());
// }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Test
public void testLanguage_Analyzer() {
    for (final Language language: Language.values()) {
        final Analyzer analyzer = new Analyzer.Builder().language(b -> b.language(language).stopwords(Arrays.asList("a", "b", "c")).stopwordsPath("path")).build();

        String str = toJson(analyzer);
        assertEquals("{\"type\":\"" + language.jsonValue() + "\",\"stopwords\":[\"a\",\"b\",\"c\"],\"stopwords_path\":\"path\"}", str);

        Analyzer analyzer2 = fromJson(str, Analyzer._DESERIALIZER);
        assertEquals(analyzer.language().stopwords(), analyzer2.language().stopwords());
        assertEquals(analyzer.language().stopwordsPath(), analyzer2.language().stopwordsPath());
        assertEquals(analyzer.language().language(), language);
    }
}

}

@dblock
Copy link
Member

dblock commented Jan 3, 2024

Let's tag @msfroh here who works on search for some 👀 . I don't understand how much of this is supposed to work, and I am looking forward to @brentam to become an expert here and help us sort this mess out!

@msfroh
Copy link

msfroh commented Jan 4, 2024

Let's tag @msfroh here who works on search for some 👀 . I don't understand how much of this is supposed to work, and I am looking forward to @brentam to become an expert here and help us sort this mess out!

Thanks! I'll need to go through this a bit.

I'm not clear on why the client needs to be aware of what analyzers exist server-side (especially since that can vary based on what plugins are installed on the server). I'm going to read through until I understand what the client is doing.

@brentam
Copy link
Contributor Author

brentam commented Jan 20, 2024

@reta @msfroh
Sorry, I have been busy and could not visit this thread.
I can revisit this thread in the near future. Atm, out team is more concerned with another issue.
Some of the builders not able to deserialize our query properly due to some erros in the implementation.
What we need is a way to pass a raw json to the seach client. (without having to deserialize to the api java classes).
I will probably create an story for this, and will be glad to work on it if you guys agree is something we should have.

@reta
Copy link
Collaborator

reta commented Jan 20, 2024

What we need is a way to pass a raw json to the seach client. (without having to deserialize to the api java classes). I will probably create an story for this, and will be glad to work on it if you guys agree is something we should have.

@brentam sure, please check if we already have an issue for that , fe #377 may be the one your are looking for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants