Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigure search parameters in config file #195

Open
martindholmes opened this issue Jan 26, 2022 · 13 comments
Open

Reconfigure search parameters in config file #195

martindholmes opened this issue Jan 26, 2022 · 13 comments
Labels
question Further information is requested release-blocker The dev branch must not be merged into master while a release-blocker bug is open. validation and diagnostics Work to be done on the schema, schematron, or the diagnostics
Milestone

Comments

@martindholmes
Copy link
Collaborator

martindholmes commented Jan 26, 2022

If, as argued in #194 and originally arising out of #115, schema-based default values for configuration items are not practical or desirable, another option for making our use of de facto defaults in config processing more explicit would be to make more -- or all -- configuration elements mandatory. I'm actually in favour of this. It would mean that we could provide a single straightforward template for a config file (at least, for the initial option collection -- and we could also provide commented-out examples for all the other components). We could then validate the config file and simply fail the build if anything is missing, forcing users to see all the config they're using. We could have an initial deprecation period in which we would simply warn about missing elements, but then in the subsequent release start failing builds. The interdependence of various config options could also then be tested with Schematron, so incompatibilities would be explicit and easily understood by the user.

For cases where one option makes another redundant, the latter could have an explicit value of "redundant because X".

This would also mean that there would be less pressure on the documentation to explain any mysterious defaulting that might take place in the background to handle a missing element; there could be no missing element, so regular documentation of the values would be sufficient.

@joeytakeda
Copy link
Contributor

I think I'm convinced that we should make these mandatory. My only thought is whether some of these config options are necessary at all—that said, I don't want to spin this ticket out into a broader discussion of particular elements. So, for now, I'll just ask: if we are to cull config options, should we deprecated them at the same time that we make all options mandatory or should we do it after? I'd say we deprecated them at the same time so that whenever the rule comes into effect, the mandatory elements are ones we actually want.

@martindholmes
Copy link
Collaborator Author

I agree: deprecate unnecessary elements and make the others mandatory at the same time.

Thinking about interdependent config items: could we unify them into single elements with a more sophisticated and explicit range of options?

@joeytakeda
Copy link
Contributor

joeytakeda commented Jan 26, 2022

For our reference, here's a table of the child elements of <params> noting whether that option depends on anything and whether it's already mandatory.

Element Depends Currently Mandatory
<searchFile>
<versionFile>
<stemmerFolder>
<recurse>
<minWordLength>
<createContexts>
<linkToFragmentId> <createContexts> = true
<scrollToTextFragment> <createContexts> = true
<phrasalSearch> <createContexts> = true
<wildcardSearch> <createContexts> = true (and possibly <phrasalSearch> = true)
<maxKwicsToShow> <createContexts> = true
<totalKwicLength> <createContexts> = true
<kwicTruncateString> <createContexts> = true
<maxKwicsToHarvest> <createContexts> = true , <phrasalSearch> = false, <wildcardSearch> = false
<scoringAlgorithm>
<verbose>
<stopwordsFile>
<dictionaryFile>
<indentJSON>
<outputFolder>
<resultsPerPage>
<resultsLimit>

@joeytakeda
Copy link
Contributor

In terms of interdependence, it all comes back to <createContexts>. My feeling here is that not creating contexts is going to be fairly rare—I can't imagine many cases where resources are so limited that you wouldn't want to store any contexts at all. That said, I also don't want there to be a case where someone has a completely valid config, but we have schematron warning them about some option they've been forced to set.

Brainstorming a few possible solutions:

  1. We keep the structure the same, but use schematron to dictate the content model. That's simplest, but the least helpful when it comes to editing in oXygen, of course.

  2. We could transform createContexts into an attribute on <params>, which would then dictate the content model. If @createContexts is true, then linkToFragmentId, scrollToTextFragment, phrasalSearch, wildcardSearch, maxKwicsToShow, totalKwicLength, and kwicTruncateString are all mandatory; otherwise, they're forbidden. We still have the issue where maxKwicsToHarvest is superfluous unless phrasalSearch and wildcardSearch are both set to false, but it's mostly just annoying.

  3. The elements that depend on <createContexts> could be child elements on <createContexts>: we'd have to put some sort of boolean attribute on <createContexts> so that it could dictate the content model. if <createContexts value="false"/> then it must be empty; otherwise, you must put in all of the other elements. maxKwicsToHarvest could then just be an attribute that would have no effect unless createContexts/phrasalSearch and createContexts/wildcardSearch are both set to false (with schematron warnings etc that say that)

Option 3 is the most disruptive and introduces some nesting, but it is also the most clear in my opinion: those options all have to do with the creation of contexts, so it makes sense to nest them. Option 2 is less disruptive, but introduces an attribute and still has the superfluous option. Option 1 is the least disruptive, but not particularly helpful for those wanting content completion.

@martindholmes
Copy link
Collaborator Author

Thanks for this detailed breakdown. I like option 3. I also think that a) we should deprecate scrollToTextFragment and get rid of the related code, because it doesn't show signs of taking hold; b) we could get rid of linkToFragmentId because we should just always do that if createContexts is true.

@joeytakeda
Copy link
Contributor

I agree re: deprecating scrollToTextFragment and linkToFragmentId. If we are deprecating options, I think indentJSON could be deprecated (mea culpa since I know I advocated for it) and verbose could be deprecated and superseded by a parameter sent to ant since it's annoying to have to change the verbose parameter in the config to get verbose logging when trying to debug something in a single place.

@joeytakeda joeytakeda added the validation and diagnostics Work to be done on the schema, schematron, or the diagnostics label Feb 5, 2022
@joeytakeda
Copy link
Contributor

joeytakeda commented Feb 9, 2022

Attached to this ticket is a zip archive that contains an ODD, a produced RNG, and some sample configuration files that model the changes proposed based on the above discussion. For my own edification, I did this via ODD chaining (using the GitHub repo ODD as the base and modifying) to show exactly what we're changing, which worked precisely as advertised—of course when we actually do this, we'll do this in the real ODD, but it was easier to show the changes up for discussion.

staticSearchSchemaFor195.zip

Here's a sample of the ODD (with schematron removed for brevity):

      <elementSpec ident="params" ns="http://hcmc.uvic.ca/ns/staticSearch" module="ss" mode="change">
          <content>
            <!--Order isn't preserved, but
              listed here in logical groups anyway-->
            <sequence preserveOrder="false">
              <!--Input stuff-->
              <elementRef key="searchFile"/>
              <elementRef key="recurse"/>
              <elementRef key="versionFile"/>
              <elementRef key="stemmerFolder"/>
              <elementRef key="stopwordsFile"/>
              <elementRef key="dictionaryFile"/>
              <!--Tokenizing/Parsing-->
              <elementRef key="scoringAlgorithm"/>
              <elementRef key="createContexts"/>
              <!--Output stuff-->
              <elementRef key="outputFolder"/>
              <elementRef key="resultsPerPage"/>
              <elementRef key="resultsLimit"/>
            </sequence>
          </content>
        </elementSpec>
        
        <elementSpec ident="createContexts" ns="http://hcmc.uvic.ca/ns/staticSearch" module="ss" mode="change">
          <content>
            <!--Could also be <alternate> with <empty/>, I think-->
            <sequence minOccurs="0" maxOccurs="1">
              <elementRef key="phrasalSearch"/>
              <elementRef key="wildcardSearch"/>
              <elementRef key="maxKwicsToHarvest"
                minOccurs="0" maxOccurs="1"/>
              <elementRef key="maxKwicsToShow"/>
              <elementRef key="totalKwicLength"/>
              <elementRef key="kwicTruncateString"/>
              <elementRef key="linkToFragmentId"/>
            </sequence>
          </content>
       <!--snip: bunch of constraintSpecs-->
          <attList>
            <attDef ident="value" usage="req" mode="add">
              <datatype>
                <dataRef name="boolean"/>
              </datatype>
            </attDef>
          </attList>
        </elementSpec>
        
        <!--DEPRECATIONS-->
        <elementSpec ident="verbose" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="indentJSON" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="scrollToTextFragment" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        <elementSpec ident="linkToFragmentId" module="ss" ns="http://hcmc.uvic.ca/ns/staticSearch" mode="delete"/>
        

@martindholmes
Copy link
Collaborator Author

Thanks for this; I'm slightly surprised that the ODD-chaining worked, but that's good. :-)

I suggest we set everything to version 2 right now, and first do the deletions (deleting rather than deprecating; we can handle the mechanics of alerting users by having Ant do a quick check of the file version, and if it's not 2, then run a special XSLT to help users upgrade, then exit. So I don't think we actually need a deprecation period for these four specific things. As they're deleted, we can a) remove any handling for them from the XSLT, and b) add the info to the what's new page.

It's refreshing to be able to delete stuff from a project, rather than having all the old cruft just pile up with the new floating on top of it.

@joeytakeda
Copy link
Contributor

I'm going to do the deletions now in a separate branch--I'll open a ticket for that to keep things clean.

@joeytakeda
Copy link
Contributor

Arising from #270—we should make sure the documentation is clear about the mandatory elements and resolve the ambiguous language that params "requires two elements"

@peterrobinson
Copy link

I've added in #271 another case where the "default" behaviour of Static Search seems problematic. My two cents worth: version 2 (I think that is what we are talking about) should enact what the current version states: just two <params> should need to be explicitly declared in each project's config file, with default values implemented in the build. Any serious user is going to need to delve into various <params> and <rules> soon enough, and will find her or his own way to declaring these as needed. This approach would also spare people like me the initial puzzlement: why isn't this working?

@martindholmes
Copy link
Collaborator Author

This is the key deliverable for version 2.0 at this point. Once this is sorted, we can release, assuming no other breaking or blocking things emerge from testing with real sites.

@joeytakeda joeytakeda added the release-blocker The dev branch must not be merged into master while a release-blocker bug is open. label Sep 25, 2024
@joeytakeda joeytakeda changed the title Should we make more config items mandatory? Reconfigure search parameters in config file Oct 18, 2024
@joeytakeda
Copy link
Contributor

joeytakeda commented Oct 18, 2024

Following a very productive meeting today (Oct 18), @martindholmes and I have rethought this quite significantly so that each configuration element describes its purpose with properties specified as attributes — this means far fewer elements and more straightforward values with better ways of constraining values (and enforcing co-occurrence constraints). Here's what we're thinking:

    <params>
        <searchPage file="test/search.html"/>
        <index recurse="yes"/>
        <stopwords
            file="test/test_stopwords.txt"/>
        <dictionary
            file="xsl/english_words.txt"/>
        <tokenizer
            minWordLength="2"/>
        <scoringAlgorithm name="tf-idf"/>
        <stemmer
            dir="stemmers/en/"/>
        <contexts
            create="true"
            phrasalSearch="true" 
            wildcardSearch="true"
            maxKwicsToHarvest="5"
            maxKwicLength="15"
            kwicTruncateString="..."
            linkToFragmentId="true"
        />
        <results 
            resultsPerPage="5"
            maxKwicsToShow="#"/>
        <version file="test/VERSION"/>
        <outputDir name="ssTest"/>
    </params>
  • @file can be an attribute class
  • Some elements (like index), which are more generic, will allow for better future-proofing
  • outFolder => outDir
  • createContexts => contexts
  • totalKwicLength => maxKwicLength

This will be a major change, so we'll certainly need to make sure that we have clear guidance and mapping for upgrading from 1.x to 2.0 (and an automated process for doing so).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested release-blocker The dev branch must not be merged into master while a release-blocker bug is open. validation and diagnostics Work to be done on the schema, schematron, or the diagnostics
Projects
None yet
Development

No branches or pull requests

3 participants