Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do we need default files for a dictionary and stop words list #271

Closed
peterrobinson opened this issue Jul 25, 2023 · 5 comments
Closed

do we need default files for a dictionary and stop words list #271

peterrobinson opened this issue Jul 25, 2023 · 5 comments
Assignees
Milestone

Comments

@peterrobinson
Copy link

peterrobinson commented Jul 25, 2023

My adventures with static search continue: you can read more about them at my Scholarly Digital Editions blog,, especially the sequence beginning at the Endings Project and the Canterbury Tales project. Here I will post particular issues that the grand gurus of static search may want to consider.

In setting up static search for my project, I found that it would not run unless I had set up values in the section of the config file for <stopwordsFile> and <dictionaryFile>. Further, these had to point at actual files present in the root (or related) directory in my project folder, thus:

    `<stopwordsFile>test_stopwords.txt</stopwordsFile>`
    `<dictionaryFile>english_words.txt</dictionaryFile>`

Then, I had to physically include the files test_stopwords.txt and english_words.txt in my project folder for static search to run successfully.

This seems to me to be a candidate for default values, to avoid being forced to choose something (anything!) to make the build work.

@martindholmes
Copy link
Collaborator

It's a bug that if nothing is supplied for these files the build fails. It would be quite rare that you wouldn't want to use a stoplist or a wordlist; the only context I can imagine not wanting a stoplist is in the case of a dictionary where words like "in", "at", or "here" might well be searched for, but issue #273 proposes a different solution to that (which if implemented will be dependent on a complete reworking of the tokenizing process, which @joeytakeda is thinking about now).

So I think the solution here is:

  1. Add these two config items to the list of mandatory items.
  2. Document how to create and use empty files if you do want to avoid a stoplist.

@martindholmes martindholmes self-assigned this Sep 21, 2023
@martindholmes
Copy link
Collaborator

This should be a patch to the 1.4 release branch and also implemented in dev.

martindholmes added a commit that referenced this issue Sep 25, 2023
@joeytakeda
Copy link
Contributor

I think part of what @peterrobinson is asking for (and apologies if I'm misrepresenting here) is that the config file should be as minimal as possible in order to get staticSearch up and running—if you don't specify a stopwords element, then you just get whatever staticSearch thinks you should use (i.e. xsl/english_words.txt).

We've gone back and forth on the problem of default values (see #195 ) and I think this is a good case for stating that stopwords shouldn't be mandatory at all.

In terms of the dictionaryFile — I personally think we should just get rid of it entirely; it's only used when generating the report, but does create additional files and overhead unnecessarily, imo.

joeytakeda added a commit that referenced this issue Sep 27, 2023
Fixes for issue #271, and a sequence wrapper for a content model in t…
@martindholmes martindholmes added this to the Release 1.4.x milestone Oct 4, 2023
@martindholmes
Copy link
Collaborator

Just adding a reminder that the documentation will need substantial changes arising out of the decisions made here. The 1.4.5 documentation is updated for the mandatory status of those elements, but it seems likely that the dictionaryFile may be unnecessary and the stopwordFile optional in 2.0. However, it's worth remembering that our documentation suggests that you might create/modify your stopwords file based on the output of the report generator, and that depends on the dictionary file at the moment.

@martindholmes
Copy link
Collaborator

I'm going to close this issue because there are substantial changes affecting this in version 2.0; meanwhile, we do not plan anything other than bugfixes for the release-1.4 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants