Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Swedish stemming of -ös #152

Closed
znakeeye opened this issue Sep 2, 2021 · 15 comments
Closed

Improving Swedish stemming of -ös #152

znakeeye opened this issue Sep 2, 2021 · 15 comments

Comments

@znakeeye
Copy link

znakeeye commented Sep 2, 2021

I have identified a group of words that are incorrectly stemmed. Please assist me. I don't know how to patch swedish.sbl.

I realize it's kind of hard to improve the stemmer without over-stemming. However, there are around 100 Swedish adjectives ending with ös (English equivalent ous) that do not conflict with any öst words.

Suffix t not identified
These word endings are correctly stemmed with a suffix of a, but not t (English equivalent ly). E.g.:

"monstruösa" → "monstruös"   ✓
"monstruöst" → "monstruöst"  ✗
"nervösa"    → "nervös"      ✓
"nervöst"    → "nervöst"     ✗

Now, this is almost handled in the existing stemmer. Recall the handling of "lös" at line 58.

Proposed improvement
Not sure if regex/ranges are supported, or what the syntax would be. But, this is a 100% verified improvement (no over-stemming):

Change 'l{o"}st' (<-'l{o"}s') to something where l is changed to [ilnuv]. Somehow...

Otherwise, we could simply add these four lines:

'i{o"}st'        (<-'i{o"}s')
'n{o"}st'        (<-'n{o"}s')
'u{o"}st'        (<-'u{o"}s')
'v{o"}st'        (<-'v{o"}s')
@ojwb
Copy link
Member

ojwb commented Sep 3, 2021

I've not tried this out myself yet, but it sounds sensible from your description.

The four lines you suggest aren't a bad way to implement it (each among gets turned into a data structure which is searched by binary chop, so they can handle a lot of entries pretty efficiently).

But perhaps a better option here is to drop the l from that among entry's string and then in its action add a test for one of the five characters as the next character (bearing in mind Snowball is working in backwardmode here so "next" is actually previous in string order) - the natural Snowball way to do that is to define a grouping for them, so something like:

    groupings ( v s_ending ost_prefix )

    // ...

    define ost_prefix 'ilnuv'

    define other_suffix as setlimit tomark p1 for (
        [substring] among(
            'lig' 'ig' 'els' (delete)
            '{o"}st'        (ost_prefix <-'{o"}s')
            'fullt'          (<-'full')
        )
    )

(Maybe there's a better name for the new grouping.)

If you want to submit a patch, please do - otherwise I'm happy to look at this, but not until after we've landed the patch for #47.

@znakeeye
Copy link
Author

znakeeye commented Sep 3, 2021

That's beautiful. Using this pattern I think we can probably add some more prefixes. Or maybe a negated prefix?

My understanding so far is that very few words end with öst. For those words, removing the t will not cause any conflict in most cases. The only conflicts I found are:

röst   → rös
öst    → ös

höst   → hös ... which will not match....
hösten → höst

sydöst → sydös (no conflict, but incorrect stem)

I believe a perfect rule for öst is:

  • Prefix is not any of dhr or "empty".

Can you come up with such a pattern for the .sbl?

@ojwb
Copy link
Member

ojwb commented Sep 3, 2021

You can use non to invert the sense of a grouping check - e.g. non ost_prefix or non-ost_prefix.

Note that only the last of your examples will actually be considered here anyway since this removal is restricted to suffixes which are entirely in region R1 - here R1 is the the region after the first non-vowel following a vowel, or starts at least 3 characters from the start of the word if the first non-vowel following a vowel is before that.

So | marks the start of R1 below:

  • rös|t
  • öst| (by the "at least 3 characters" rule)
  • hös|t
  • syd|öst(only in this case does the suffix lie entirely in R1)

(This also means that if the öst suffix check passed there must be at least 3 characters in the prefix, so there's no need to worry about it being empty.)

@znakeeye
Copy link
Author

znakeeye commented Sep 5, 2021

The only exceptions left to consider are these three words then:

öst - ös, ösa, öst (homonym of "scooped" and "east")
föst - "sweeped"
löst - "solved" or "loose"

These are the only words to consider outside of R1. Especially "löst" is important, since it is a fairly common word. "öst" is a homonym, so not sure how to think here.

@ojwb
Copy link
Member

ojwb commented Sep 6, 2021

I think öst is best left alone here:

  • the "east" meaning will be a fairly common word (probably at least as common as the verb meaning)
  • conflating it with other forms of the verb creates a lot of false matches around the "east" meaning which outweigh the benefits of finding currently missed matches for the verb meaning
  • nouns and adjectives are more important for search than verbs (at least I've seen that stated in several academic papers about stemming, but not sure I've actually seen a study that demonstrated this as such now I come to think of it!)

@ojwb
Copy link
Member

ojwb commented Sep 28, 2023

If you want to submit a patch, please do - otherwise I'm happy to look at this, but not until after we've landed the patch for #47.

That seems to be more complicated than expected, so I think we should actually try to merge this one first as it looks like that's more easily achievable.

@ojwb
Copy link
Member

ojwb commented Sep 28, 2023

But perhaps a better option here is to drop the l from that among entry's string and then in its action add a test for one of the five characters as the next character

That does have slightly different semantics as it doesn't require the l (or other character) to also be in R1, whereas the current code does. However you say "-ös" is like English "-ous" so if anything that seems more logical, and in practical terms just changing where the "l" gets tested for doesn't actually change the stem for anything in swedish/voc.txt.

@ojwb
Copy link
Member

ojwb commented Sep 29, 2023

Reviewing the discussion we have a few suggested and inferred options which in perl-like regexp terms are:

  • s/([ilnuv]ös)t$/$1/
  • s/([^dhr]ös)t$/$1/
  • s/([^d]ös)t$/$1/ (because the "röst" and "höst" problem cases are actually excluded by the R1 check)
  • s/(.ös)t$/$1/ (because d was motivated by "(no conflict, but incorrect stem)" but we aren't seeking to implement lemmatisation and the returned stem is really just an arbitrary string)

(Possibly plus some exceptions to also apply this rule to cases that are too short by the R1 rule, but I'll worry about those separately.)

I'd probably lean slightly towards an inclusive list, as perhaps less likely to have unintended effects on proper nouns, foreign words, obscure words, etc, but only weakly - e.g. if we had about half the alphabet to include or exclude I'd pick inclusive.

This rule is applied as the final step after other suffixes have potentially been removed, so e.g. "senbösten" -> "senböst" by main_suffix and would then have this rule applied, so we need to consider that (reading between the lines a little, but I think you were just looking at actual Swedish words ending "ös" and "öst").

I looked at this for the current vocab by commenting out the call to other_suffix and then doing:

./stemwords -l swedish -p2 < ../snowball-data/swedish/voc.txt |grep '[^l]öst\?$'|sort -k2,2|less -p '.*t$'

This only shows words where the stem being formed ("proto-stem") ends "ös" or "öst" at the point in the process where this rule would be applied (and I've excluded those ending "lös" or "löst" as those are already handled, or aren't handled because of the R1 rule). The -p option to less means the ones where there's a "t" we could remove are highlighted.

Looking over this list, there are 22 different proto-stems ending "öst" at that point. The only cases where there's another word in the vocab we could conflate one of these with by removing the final "t" are:

  • "nervöst" could conflate with "nervös" and "nervösa" (a case you mention above, and handled by all the candidate rules)
  • "religiöst" could conflate with "religiös" and "religiösa" (also handled by all the candidate rules above)

There's also these which are too short by the R1 rule:

  • "pöser"/"pöste" (seem to both be forms of "pösa" from https://en.wiktionary.org/wiki/p%C3%B6sa#Swedish and neither seems to have another meaning, so these could be usefully conflated, but probably not worth a special case)
  • "rös"/"röst"/"rösts"/"rösten"/"röster"/"röstens"/"rösterna" - currently all but the first stem to "röst" and from the above I gather we don't want to conflate with "rös" so that's good.

I'm currently crunching Swedish wikipedia data to get a larger wordlist and will try this on that too.

@znakeeye
Copy link
Author

s/([ilnuv]ös)t$/$1/ will take us quite far, I think.

You might want to add the very rare t, k to the inclusion list too:

utöst (utösa) - utös reasonable stem? Like ösa...?
pastöst - pastös
komatöst - komatös
ödematöst - ödematös

visköst - viskös

@ojwb
Copy link
Member

ojwb commented Sep 30, 2023

Thanks for the feedback. My wikipedia data is still crunching (I think the Python module the script uses must have got significantly slower, but this Swedish wikipedia dump is also the largest I've run it on so far).

@ojwb
Copy link
Member

ojwb commented Oct 3, 2023

Looking at the 98888 words which occur at least 36 times in Swedish wikipedia, and checking what gets presented to the final step of the stemmer (like I did above) I also found "generöst" (neuter singular of "generös" - like English "generous"), "poröst" (ditto for "porös" - English "porous") and "rigoröst" (ditto "rigorös" - English "rigorous"), which are all "r" cases. We don't want to conflate "rös"/"röst" but the R1 check means we wouldn't anyway. So "r" is a candidate if there aren't other problematic cases (but there aren't a huge number of "r" words, or indeed really of anything except the "l" we already handle).

The only other words ending "-öst" don't have matching "-ös" words in the list, and I don't see any words ending "-töst" or "-köst" (or "-uöst").

Here are the counts for all (but not taking into account whether the R1 requirement excludes them):

      2 döst
      5 höst
      4 iöst
      1 jöst
     55 löst
      2 nöst
     46 röst
      1 vöst

(I checked that the "j" word is and it's "sjöst" so isn't in R1.)


Reducing the frequency threshold to 20 (150940 words) finds me "incestuös"/"incestuöst" and "luxuös"/"luxuöst", but otherwise doesn't find any more words with both "-ös" and "-öst" forms. The only other letter we gain is "f" from "föste"/"föstes" which reach this rule as "föst", but are excluded from it by the R1 check anyway.


Reducing the frequency threshold to 5 (429604 words) finds more "r" words: "fibrös"/"fibröst", "glamorös"/"glamoröst", but also "överöser" (which reaches this stage as "överös") vs "överöst"/"överöste"/"överöstes" which seem to not be forms of the same word as best I can make our, but they're already conflated by the current algorithm anyway.

There's also a "p" word: "pompös"/"pompöst".

And at last a "k" word: "viskös"/"visköst" (same example you gave); still no "t" but there are your examples and no counterexamples.

The grouping test does a max/min check and then a bitmap test so there's no runtime overhead there from adding extra characters within the range we'd be testing anyway, so I think we might as well include "k", "p" and "t".

Perhaps "r" too - I haven't spotted a problematic case for "-röst" but there are other words it affects so there's more to worry about, but also more cases and more common cases than "k", "p" or "t".

@ojwb
Copy link
Member

ojwb commented Oct 3, 2023

That does have slightly different semantics as it doesn't require the l (or other character) to also be in R1, whereas the current code does. However you say "-ös" is like English "-ous" so if anything that seems more logical, and in practical terms just changing where the "l" gets tested for doesn't actually change the stem for anything in swedish/voc.txt.

I was wrong about that - the setlimit tomark p1 for has a parenthesised list after it which includes the among as well as the substring, so the limit restriction applies when we test for the character in the action of the among too.

I tested changing it to work how I thought it was now working and it's better for a few cases and worse for none, so let's do that too - it is more logical for "-ös" and doesn't affect the other suffixes handled by this step.

@ojwb
Copy link
Member

ojwb commented Oct 3, 2023

Here's my proposed change without "r" (and with-"r" variant commented):

--- a/algorithms/swedish.sbl
+++ b/algorithms/swedish.sbl
@@ -9,7 +9,7 @@ externals ( stem )
 
 integers ( p1 x )
 
-groupings ( v s_ending )
+groupings ( v s_ending ost_ending )
 
 stringescapes {}
 
@@ -23,6 +23,9 @@ define v 'aeiouy{a"}{ao}{o"}'
 
 define s_ending  'bcdfghjklmnoprtvy'
 
+define ost_ending 'iklnptuv'
+//define ost_ending 'iklnprtuv'
+
 define mark_regions as (
 
     $p1 = limit
@@ -52,10 +55,10 @@ backwardmode (
         and ([next] delete)
     )
 
-    define other_suffix as setlimit tomark p1 for (
-        [substring] among(
+    define other_suffix as ( setlimit tomark p1 for (
+        [substring] ) among(
             'lig' 'ig' 'els' (delete)
-            'l{o"}st'        (<-'l{o"}s')
+            '{o"}st'         (ost_ending <-'{o"}s')
             'fullt'          (<-'full')
         )
     )

I've been working on a script to analyse stemmer output before and after a change. I tried the no-r and r variants on a huge wordlist of the 988946 words which occur twice or more in Swedish wikipedia and the analysis reported is:

  • Merging stem groups:
    { amoröst }
    { amorös, amorösa, amoröse }
    -> { amorös, amorösa, amoröse, amoröst }
  • Merging stem groups:
    { fibröst }
    { fibrös, fibrösa }
    -> { fibrös, fibrösa, fibröst }
  • Merging stem groups:
    { generöst }
    { generös, generösa, generösare, generöse }
    -> { generös, generösa, generösare, generöse, generöst }
  • Merging stem groups:
    { glamoröst }
    { glamorös, glamorösa, glamorösare }
    -> { glamorös, glamorösa, glamorösare, glamoröst }
  • Merging stem groups:
    { glamouröst }
    { glamourös, glamourösa }
    -> { glamourös, glamourösa, glamouröst }
  • Merging stem groups:
    { högporöst }
    { högporösa }
    -> { högporösa, högporöst }
  • Merging stem groups:
    { oneröst }
    { onerös, onerösa }
    -> { onerös, onerösa, oneröst }
  • Merging stem groups:
    { polyamoröst }
    { polyamorösa }
    -> { polyamorösa, polyamoröst }
  • Merging stem groups:
    { poröst }
    { porös, porösa, porösare }
    -> { porös, porösa, porösare, poröst }
  • Merging stem groups:
    { rigoröst }
    { rigorös, rigorösa }
    -> { rigorös, rigorösa, rigoröst }
  • Merging stem groups:
    { seröst }
    { serös, serösa }
    -> { serös, serösa, seröst }
  • Merging stem groups:
    { överöst, överösta, överöste, överöstes, överösts }
    { överösa, överösas, överöser, överöses }
    -> { överösa, överösas, överöser, överöses, överöst, överösta, överöste, överöstes, överösts }

284 words stemmed differently
12 groups of stems merged
0 words changed stem group

So including "r" here looks good to me, but I'd appreciate feedback from someone fluent in Swedish.

@znakeeye
Copy link
Author

znakeeye commented Oct 4, 2023

Looks good! But not entirely sure about "r".

Words ending with "röst" could either be a "-rös" word or "röst" (voice). E.g.:

basröst (deep voice)
gumröst (old lady's voice)
talröst (speaking voice)
...
There are a few

I guess you could say that the stem is e.g. "basrös" and "talrös". I would expect it to be unique, as it cannot collide with a "-rös" word.

It's probably up to you to decide what you consider a stem. I.e. Would we be fine with "voice" being stemmed as "voi"? If yes, then "röst" may be stemmed as "rös". Except for the word "röst" itself!

@ojwb
Copy link
Member

ojwb commented Oct 4, 2023

It's probably up to you to decide what you consider a stem. I.e. Would we be fine with "voice" being stemmed as "voi"?

While the stems almost all look a lot like the word they're a stem of, and often actually are what you might think of as the linguistic root, we aren't actually aiming to produce the linguistic root, so "voice" to "voi" would be OK and in fact the English stemmer currently stems "voice" -> "voic" which kind of illustrates this point.

(The underlying reason for this is that a final "e" is elided in English when adding some suffixes and when removing these we can for example easily reduce "voicing" -> "voic", but it's then hard to come up with a rule to know whether to append an "e" to the stem after removing "-ing", and much easier to write a rule to remove the "-e" when stemming "voice".)

So the cases you highlight are fine as long as these stems don't collide with the stems of unrelated words, and even my overlarge wordlist didn't uncovered any cases where they did. The only caveat there is that while the vocabulary of wikipedia should be fairly broad, it may be lacking words which are only used in particular regions or dialects, have fallen out of usage (but could be encountered while indexing a data set including older documents), etc.

If yes, then "röst" may be stemmed as "rös". Except for the word "röst" itself!

Yes, and "röst" is excluded by the R1 requirement.

Thanks for the feedback - I'll get this change merged.

ojwb added a commit to snowballstem/snowball-data that referenced this issue Oct 4, 2023
snowballstem/snowball#152 improves the
handling of -öst endings.  This change expands the test data to
fully cover the changes, and updated the expected output.
@ojwb ojwb closed this as completed in 6bfccb8 Oct 4, 2023
ojwb added a commit to snowballstem/snowball-website that referenced this issue Oct 4, 2023
This now reflects the -öst suffix changes from
snowballstem/snowball#152
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants