Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for multilingual stemming #296

Open
martindholmes opened this issue Apr 18, 2024 · 0 comments
Open

Plan for multilingual stemming #296

martindholmes opened this issue Apr 18, 2024 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@martindholmes
Copy link
Collaborator

@joeytakeda and I discussed the problem of multilingual texts and stemming today, and came up with what looks like a workable plan; this is a bluesky enhancement, and we don't intend it to be in 2.0, but it would come along afterwards.

  1. End-users may add @lang attributes to sections of text which are not in the default language of pages. (This would be good practice anyway.)
  2. Content in those context could be pre-stemmed, using span elements around words, with an attribute @data-ss-stem. This could either be done by end users as part of their own build process, or if staticSearch has a stemmer for the language, it could be done by a staticSearch process that runs before the main language stemming/tokenizing.
  3. When the main stemming process encounters one of these spans, it simply ignores the content.
  4. At build time, a JSON file is created which is basically a lookup table from full forms to stems, for all the terms which have @data-ss-stem.
  5. The StaticSearch object looks for this file, and loads it if it exists. If loaded, when running any text search, the JS first checks that lookup table for any matches, and if any are found, uses those stems. It also proceeds to apply the default stemming to the term; this is necessary because it's not possible to know which language a particular search term is supposed to be in, and there are of course cases where words are the same across languages.

This is a good solution to the problem of sites with one dominant language and potentially many other languages that appear in quotes and so on. Truly multilingual sites should of course have multiple searches.

@martindholmes martindholmes added the enhancement New feature or request label Apr 18, 2024
@martindholmes martindholmes added this to the Blue sky milestone Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant