add the ~ operators from vanilla Poliqarp #119

bansp · 2022-09-07T11:57:46Z

TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.

Below, I paste a longer passage from Eureco meeting materials:

“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to
disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.

Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for
any given segment; some examples of sentences containing an ambiguous segment which cannot be
uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”

Source of the quote: https://dl.acm.org/doi/pdf/10.5555/1557769.1557795
“Poliqarp: an open source corpus indexer and search engine with syntactic extensions”, by Janus and Przepiórkowski

So at this point, the morphological info has two basic parts. One is the “traditional” part (<f name=”lex”>) with the added “translit” container, renamed to “orig”:

            <f name="lex"><!-- _zdarza-->  this is the “orth”, just for testing, can be suppressed
               <fs>
                  <f name="orig">zdarza</f>	the original spelling, maybe with typos
                  <f name="lemma">zdarzać</f>	the “base” in Poliqarpish
                  <f name="pos">fin</f>
                  <f name="msd">sg:ter:imperf</f>    morphosyntactic info, may be missing
               </fs>			(recall that “orth” is recovered from the offsets)

The new part is <f name="interps"> – the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in <fs type="alt">. These are all potential values of the given token before disambiguation.

            <f name="interps">
               <fs type="alt" n="choice">
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:ncol" n="choice"/>
                        <symbol value="sg:acc:n:ncol"/>
                        <symbol value="sg:voc:n:ncol"/>
                     </vAlt>
                  </f>
               </fs>
               <fs type="alt">
                  <f name="lemma">doświadczyć</f>
                  <f name="pos">ger</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:perf:aff"/>
                        <symbol value="sg:acc:n:perf:aff"/>
                     </vAlt>
                  </f>
               </fs>
            </f>

Notice that there are two sets of alternatives: one is at the lexical level (`<fs type="alt">`), and the other, within a single lexical hypothesis, involves a set of alternative morphosyntactic descriptions, contained inside `<vAlt>` (which is TEI-speak for “alternative values”).

So, `<f name="lex">` is post-disambiguation, and in the last case at hand, it looks as follows:
```xml
            <f name="lex">
               <fs>
                  <f name="orig">doświadczenie</f>
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">sg:nom:n:ncol</f>
               </fs>
            </f>

and <f name="interps"> is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attribute n="choice", which points at the same info as what <f name="lex"> contains.

Note:

the NKJP2KorAP dataset will surely be linked from here when ready
the feature values are going to get modified into attribute:value datasets (at this time, attribute names are only implied by the NKJP schema, not provided here)

The text was updated successfully, but these errors were encountered:

Akron · 2023-02-06T16:17:14Z

I am thinking about how to implement this to be universal useful.

For the moment I would guess we have to add another Layer for interpretations, like "pv" for pos-variants
and "mv" for morphosyntactic variants.
Then

[case=acc] -> [nkjp/m=case:acc]
[case~acc] -> [nkjp/m=case:acc | nkjp/mv=case:acc]
[case==acc] -> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])
[case~~acc] -> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])|[nkjp/m=case:acc & nkjp/mv=case:acc]

where exclude() is the negative match operator, matching at the span of the first operand whenever no second operand has the same span.

Is that correct?

bansp added the enhancement label Sep 7, 2022

Akron transferred this issue from KorAP/Kalamar Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add the ~ operators from vanilla Poliqarp #119

add the ~ operators from vanilla Poliqarp #119

bansp commented Sep 7, 2022

Akron commented Feb 6, 2023

add the ~ operators from vanilla Poliqarp #119

add the ~ operators from vanilla Poliqarp #119

Comments

bansp commented Sep 7, 2022

Akron commented Feb 6, 2023