Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

Open
martindholmes opened this issue Aug 22, 2023 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@martindholmes
Copy link
Collaborator

Working on a couple of indigenous language dictionaries, we've encountered an intriguing problem. It's perfectly legitimate for a user/learner of the language to want to search for the other language word for a common English word that might be in the stopword list. If you're learning prepositions of location, you would obviously want to search for "at", "in", "on" etc.

However, if we just nuke these items from the stopword list, we'll end up with a massive index, and most of the hits will not be relevant to the search.

I think the solution here is to have a config file component which allows you to specify, through XPath, elements where the stopword list will be ignored when indexing; so for example a <gloss> element inside a dictionary entry can be assumed to contain the English gloss for a term, and could be indexed without the stopword list being invoked, generating and index entry for "in" if it contains that word; but instances of the stopwords would be ignored in all other contexts as normal.

This doesn't seem like it might be too difficult. The only bit I haven't figure out is how to carry over this functionality to the JavaScript; maybe all we need to do for a case like this is not use the stopword list at all, on the assumption that there's no penalty when a common word is searched for; if there's a stem file for it, then good -- it will have been constructed only from the specially-defined contexts, and shouldn't be too large -- and if there isn't, then the search just fails.

@joeytakeda Any thoughts?

@martindholmes martindholmes added the enhancement New feature or request label Aug 22, 2023
@martindholmes
Copy link
Collaborator Author

After discussion, we will wait until we actually have a project that doesn't solve this problem simply by using an empty stopword list. If we do implement it, we should do it through contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant