discussing changes to LanguageAnalyzer #788

brentam · 2023-12-31T04:53:22Z

Description

This a temporary/draft PR to discuss the viability of changes to the LanguageAnalyzer needed to correct some issues.
The original PR is here #779

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>

Signed-off-by: brentam <[email protected]>

…-languages

Signed-off-by: brentam <[email protected]>

…yser Signed-off-by: brentam <[email protected]>

brentam · 2023-12-31T04:55:41Z

...-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/LanguageAnalyzer.java

@@ -68,7 +69,7 @@ private LanguageAnalyzer(Builder builder) {

        this.version = builder.version;
        this.language = ApiTypeHelper.requireNonNull(builder.language, this, "language");
-        this.stemExclusion = ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion");
+        this.stemExclusion = builder.stemExclusion; // ApiTypeHelper.unmodifiableRequired(builder.stemExclusion, this, "stemExclusion");


stemExclusion should not be mandatory

It is supported by most analyzers but not all

but it is reasonable to make it optional for those that supports this property, I agree

brentam · 2023-12-31T04:58:51Z

java-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/Analyzer.java

@@ -589,8 +592,11 @@ public Analyzer build() {

    protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) {

+        for (Language value : Language.values()) {
+            op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());
+            //TODO should we lowercase in the Language Enum?


Language jsonValue has to match the Kind jsonValue
So we set the deserializers for all the languages here

brentam · 2023-12-31T04:59:54Z

java-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/Language.java

-    Arabic("Arabic"),
-
-    Armenian("Armenian"),
+    Arabic("arabic"),


the json value has to be lowercase. to match the values Opensearch expects

brentam · 2023-12-31T05:02:55Z

...-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/LanguageAnalyzer.java

@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz
     */
    @Override
    public Analyzer.Kind _analyzerKind() {
-        return Analyzer.Kind.Language;
+        return Analyzer.Kind.valueOf(this.language.name());
    }


We have to determine the Analyzer.Kind at runtime....
Note that the Language values have to match the Kind values

reta · 2023-12-31T15:26:03Z

java-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/CjkAnalyzer.java

I believe the Cjk was added because it does not support stem_exclusion , not every language analyzer supports it actually (per documentation).

reta · 2023-12-31T15:33:52Z

java-client/src/test/java/org/opensearch/client/opensearch/experiments/ParsingTests.java

-        assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords());
-        assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath());
-    }
+//   temporary commenting this test until we decide if we keep the CjkAnalyser


I think we should keep CjkAnalyser since it does not support stem_exclusion

@reta
sorry if the comment was not clear. What I meant is that if we change the LanguageAnalyzer, the CJkAnalyzer will be deleted, same as DutchAnalyzer, since all languages analyzers would be constructed using the LanguageAnalyzer+Language.

I am usure what you mean.
The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers. If we have one exception for Cjk, we will keep supporting two patterns. In that case I rather keep all the 33 analyzers created in the the other PR, since the LanguageAnalyzer atm is not correct and the Analyzer Per language pattern is working fine..

@brentam

The whole point of the discussion was that maybe we should use the LanguageAnalyzer pattern to build the language analyzers.

This is correct, we though need to acknowledge that not all LanguageAnalyzer are supporting the same customization capabilities - CJK does not support stem_exclusion (may be others as well but this one for sure). Making stem_exclusion optional won't solve the problem - the server side will (or at least, should) reject the CJK analyzer with stem_exclusion.

Looking how we could accommodate that.

Ok, @brentam I think I have a solution for that:

we keep CjkAnalyzer but

we remove Cjk from Language enumeration

This is not ideal but the pros are that - it prevents users from referring to unsupported features at compile time (vs removing CjkAnalyzer but failing at runtime)

I found other big issues with the analyzers api while testing and looking at the code, I was thinking about creating other tickets because this discussion is becoming quite long but I will add them in this thread so you have an idea of the minimum I think needs to be done. Those are based on the use case we need to be supported at work.

Here it is
There seems to be a lot of problems with the IndexSettingsAnalysis.Builder.
(This can be the subject of another PR)

Check this example

Map<String, Analyzer> map = new HashMap<>(); map.put("some_analyzer", LanguageAnalyzer.of(l -> l.language(Language.German). stemExclusion(new ArrayList<>())) ._toAnalyzer()); IndexSettingsAnalysis settings = IndexSettingsAnalysis.of( anl -> anl .analyzer(map) .charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip"))) ); System.out.println(toJson(settings));

This will produce this json
{"analyzer":{"some_analyzer":{"type":"language","language":"German","stem_exclusion":[]}},"char_filter":{"some_char_filter":"html_strip"}}.

Now, we should be able to specify the char_filter under "some_analyzer" too, not only under the "analyzer" like it is here.

These two jsons are perfectly fine and we are actually using this in production:

"english_analyzer": { "type": "english", "char_filter": [ "html_strip" ], "tokenizer": "standard" }

"german_analyzer" : { "filter" : [ "lowercase", "german_normalization", "german_stop", "german_synonyms", "german_decompounder", "german_synonyms", "possessive_stemmer", "german_stemmer" ], "char_filter" : [ "html_strip" ], "tokenizer" : "standard" },

So there is some redesigned needed, the easiest would be modify the language Deserializer to accept the "char_filter" and some other clauses.

Here is the Deserialization function for IndexSettingsAnalysis

protected static void setupIndexSettingsAnalysisDeserializer(ObjectDeserializer<IndexSettingsAnalysis.Builder> op) { op.add(Builder::analyzer, JsonpDeserializer.stringMapDeserializer(Analyzer._DESERIALIZER), "analyzer"); op.add(Builder::charFilter, JsonpDeserializer.stringMapDeserializer(CharFilter._DESERIALIZER), "char_filter"); op.add(Builder::filter, JsonpDeserializer.stringMapDeserializer(TokenFilter._DESERIALIZER), "filter"); op.add(Builder::normalizer, JsonpDeserializer.stringMapDeserializer(Normalizer._DESERIALIZER), "normalizer"); op.add(Builder::tokenizer, JsonpDeserializer.stringMapDeserializer(Tokenizer._DESERIALIZER), "tokenizer"); }

I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".

The CharFilter.Builder seems to be incomplete too.
There is no way to construct an array of names, like how is expected here

"english_analyzer": { "type": "english", "char_filter": [ "html_strip" ], "tokenizer": "standard" }

It only allows a map
.charFilter("some_char_filter", CharFilter.of(c -> c.name("html_strip")))
that will result in
{"some_char_filter":"html_strip"}

Now, I was looking at the apis related to the Charfilter builders. There is a lot of classes and hard to understand the right ones to use, so refactoring that will be a pain...

I suggest the simple fix would be having a "reference" method in the builder...
We would set it like this
CharFilter.of(c -> c.references(new String[]{"html_strip","some_custom_charset_defined_elsewhere"}))

Thanks @brentam, I till try to answer based on the documentation

Arent we being to pendantic here?
Is really the way to construct the language analyzer using LanguageAnalyzer Builder that important??

This model is generated from the API specification so it is supposed to work or the API is completely wrong. The fact stem_exclusion are not enforced is probably not good, I suspect when specified, the expectation is it should be applied during analysis (I will look into it shortly).

That is another reason I am in favour of keeping the 33 new analyzers I have created in the other PR an not touching the LanguageAnalyzer at all

As you discovered, the LanguageAnalyzer does not work at all at the moment. I believe with the changes form this pull request it should work the way it is supposed to work, right?

I believe what we need to do is to modify the Analyzer._DESERIALIZER and the LanguageAnalyzer.Builder to allow for
"char_filter", "filter", "normalizer" and "tokenizer".

No, I think we should not - this belongs to custom analyzers [1]

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html#analysis-custom-analyzer

@brentam quick update (my apologies, learning some things along with you), this is my understand of all the pieces so far:

Language Analyzers: OpenSearch uses some predefined language analyzers that come with Apache Lucene, those could be used by specifying "type":"<language>" when definiting analyzers

All of those predefined language analyzers support stop words but not all support stem_exclusion. The way it works though is very unexpected - there is no validation but the the unsupported properties (like stem_exclusion in case of Cjk) are silently ignored. More to that, to my understanding, it equally applies to char_filter & co - they are not taken into account since predefined language analyzers do not support this customization (that is why custom has to be used). In practice it means that what you think you configure is not what you get:

"english_analyzer": { "type": "english", "char_filter": [ "html_strip" ], "tokenizer": "standard" }

is equivalent to

"english_analyzer": { "type": "english" }

and sadly, is equivalent to (since char_filter and tokenizer are ignored completely)

"english_analyzer": { "type": "english", "char_filter": [ "garbage" ], "tokenizer": "garbage" }

All in all, I think we are doing very good change (thanks to you) by first of all, fixing the client libraries, and secondly - by uncovering the gaps we currently have on the server side. Lastly, the documentation would need large chunk of work there as well.

omg, that is sad...lol. I will have to review our analyzers definition at work now..lol

reta · 2024-01-02T14:47:32Z

java-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/Analyzer.java

@@ -92,6 +90,74 @@ public enum Kind implements JsonEnum {

        Cjk("cjk"),

+        Arabic("arabic"),


Curious why do we need to mirror Language enumeration here?

reta · 2024-01-02T14:48:49Z

...-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/LanguageAnalyzer.java

@@ -83,7 +84,7 @@ public static LanguageAnalyzer of(Function<Builder, ObjectBuilder<LanguageAnalyz
     */
    @Override
    public Analyzer.Kind _analyzerKind() {
-        return Analyzer.Kind.Language;
+        return Analyzer.Kind.valueOf(this.language.name());


I think we don't need to change analyzer kind here, we need to change serialization (type -> language) and deserialization (Analyzer class):

for (final Language language: Language.values()) { op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, language.jsonValue()); }

reta · 2024-01-02T14:49:19Z

...-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/LanguageAnalyzer.java


-        generator.write("type", "language");
+        generator.write("type", this._analyzerKind().jsonValue());


reta · 2024-01-02T14:53:06Z

java-client/src/main/java/org/opensearch/client/opensearch/_types/analysis/Analyzer.java

@@ -589,8 +592,11 @@ public Analyzer build() {

    protected static void setupAnalyzerDeserializer(ObjectDeserializer<Builder> op) {

+        for (Language value : Language.values()) {
+            op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());


You already did that, right?

Suggested change

op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());

op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue());

reta · 2024-01-02T14:59:03Z

java-client/src/test/java/org/opensearch/client/opensearch/experiments/ParsingTests.java

+//        assertTrue(analyzer2.isCjk());
+//        assertEquals(analyzer.cjk().stopwords(), analyzer2.cjk().stopwords());
+//        assertEquals(analyzer.cjk().stopwordsPath(), analyzer2.cjk().stopwordsPath());
+//    }



@Test public void testLanguage_Analyzer() { for (final Language language: Language.values()) { final Analyzer analyzer = new Analyzer.Builder().language(b -> b.language(language).stopwords(Arrays.asList("a", "b", "c")).stopwordsPath("path")).build(); String str = toJson(analyzer); assertEquals("{\"type\":\"" + language.jsonValue() + "\",\"stopwords\":[\"a\",\"b\",\"c\"],\"stopwords_path\":\"path\"}", str); Analyzer analyzer2 = fromJson(str, Analyzer._DESERIALIZER); assertEquals(analyzer.language().stopwords(), analyzer2.language().stopwords()); assertEquals(analyzer.language().stopwordsPath(), analyzer2.language().stopwordsPath()); assertEquals(analyzer.language().language(), language); } }

}

dblock · 2024-01-03T16:25:49Z

Let's tag @msfroh here who works on search for some 👀 . I don't understand how much of this is supposed to work, and I am looking forward to @brentam to become an expert here and help us sort this mess out!

msfroh · 2024-01-04T19:35:40Z

Let's tag @msfroh here who works on search for some 👀 . I don't understand how much of this is supposed to work, and I am looking forward to @brentam to become an expert here and help us sort this mess out!

Thanks! I'll need to go through this a bit.

I'm not clear on why the client needs to be aware of what analyzers exist server-side (especially since that can vary based on what plugins are installed on the server). I'm going to read through until I understand what the client is doing.

brentam · 2024-01-20T06:21:18Z

@reta @msfroh
Sorry, I have been busy and could not visit this thread.
I can revisit this thread in the near future. Atm, out team is more concerned with another issue.
Some of the builders not able to deserialize our query properly due to some erros in the implementation.
What we need is a way to pass a raw json to the seach client. (without having to deserialize to the api java classes).
I will probably create an story for this, and will be glad to work on it if you guys agree is something we should have.

reta · 2024-01-20T14:36:09Z

What we need is a way to pass a raw json to the seach client. (without having to deserialize to the api java classes). I will probably create an story for this, and will be glad to work on it if you guys agree is something we should have.

@brentam sure, please check if we already have an issue for that , fe #377 may be the one your are looking for

brentam added 10 commits December 21, 2023 23:11

1.adding analyzer class for other languages.

54ab458

2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>

test to verify the language analyzers are parsed correctly

12bf07b

Signed-off-by: brentam <[email protected]>

dummy CHANGELOG.md

31a7621

Signed-off-by: brentam <[email protected]>

adding information about the real PR to CHANGELOG.md

843be49

Signed-off-by: brentam <[email protected]>

ran ./gradlew :java-client:spotlessApply

f3412c6

Signed-off-by: brentam <[email protected]>

Merge remote-tracking branch 'upstream/main' into analyzers-for-extra…

4b8abd2

…-languages

Reverting created language *Analyzers, and Analyzer.java changes.

6cb8ea5

Signed-off-by: brentam <[email protected]>

draft changes to language analyser and analyzer kindd

aee4632

Signed-off-by: brentam <[email protected]>

removing obsolete test class

5154ee6

Signed-off-by: brentam <[email protected]>

temporary commenting this test until we decide if we keep the CjkAnal…

a38b9ba

…yser Signed-off-by: brentam <[email protected]>

brentam requested review from reta, Bukhtawar, dblock, szczepanczykd, madhusudhankonda, saratvemulapalli and VachaShah as code owners December 31, 2023 04:53

brentam mentioned this pull request Dec 31, 2023

Added support for 33 language analyzers #779

Open

brentam commented Dec 31, 2023

View reviewed changes

reta reviewed Dec 31, 2023

View reviewed changes

reta reviewed Jan 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussing changes to LanguageAnalyzer #788

discussing changes to LanguageAnalyzer #788

brentam commented Dec 31, 2023 •

edited

Loading

brentam Dec 31, 2023 •

edited

Loading

reta Dec 31, 2023

reta Jan 2, 2024

brentam Dec 31, 2023

brentam Dec 31, 2023

brentam Dec 31, 2023 •

edited

Loading

reta Dec 31, 2023

reta Dec 31, 2023

brentam Jan 1, 2024 •

edited

Loading

brentam Jan 1, 2024 •

edited

Loading

reta Jan 2, 2024 •

edited

Loading

reta Jan 2, 2024 •

edited

Loading

brentam Jan 2, 2024

brentam Jan 2, 2024

reta Jan 3, 2024 •

edited

Loading

reta Jan 3, 2024 •

edited

Loading

brentam Jan 20, 2024 •

edited

Loading

reta Jan 2, 2024

reta Jan 2, 2024

reta Jan 2, 2024

reta Jan 2, 2024

reta Jan 2, 2024

dblock commented Jan 3, 2024 •

edited

Loading

msfroh commented Jan 4, 2024

brentam commented Jan 20, 2024 •

edited

Loading

reta commented Jan 20, 2024

		@@ -92,6 +90,74 @@ public enum Kind implements JsonEnum {

		Cjk("cjk"),

		Arabic("arabic"),


		generator.write("type", "language");
		generator.write("type", this._analyzerKind().jsonValue());

	op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue().toLowerCase());
	op.add(Builder::language, LanguageAnalyzer._DESERIALIZER, value.jsonValue());

discussing changes to LanguageAnalyzer #788

Are you sure you want to change the base?

discussing changes to LanguageAnalyzer #788

Conversation

brentam commented Dec 31, 2023 • edited Loading

Description

brentam Dec 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brentam Dec 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brentam Jan 1, 2024 • edited Loading

Choose a reason for hiding this comment

brentam Jan 1, 2024 • edited Loading

Choose a reason for hiding this comment

reta Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

reta Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reta Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

reta Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

brentam Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dblock commented Jan 3, 2024 • edited Loading

msfroh commented Jan 4, 2024

brentam commented Jan 20, 2024 • edited Loading

reta commented Jan 20, 2024

brentam commented Dec 31, 2023 •

edited

Loading

brentam Dec 31, 2023 •

edited

Loading

brentam Dec 31, 2023 •

edited

Loading

brentam Jan 1, 2024 •

edited

Loading

brentam Jan 1, 2024 •

edited

Loading

reta Jan 2, 2024 •

edited

Loading

reta Jan 2, 2024 •

edited

Loading

reta Jan 3, 2024 •

edited

Loading

reta Jan 3, 2024 •

edited

Loading

brentam Jan 20, 2024 •

edited

Loading

dblock commented Jan 3, 2024 •

edited

Loading

brentam commented Jan 20, 2024 •

edited

Loading