Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the ~ operators from vanilla Poliqarp #119

Open
bansp opened this issue Sep 7, 2022 · 1 comment
Open

add the ~ operators from vanilla Poliqarp #119

bansp opened this issue Sep 7, 2022 · 1 comment

Comments

@bansp
Copy link
Member

bansp commented Sep 7, 2022

TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.


Below, I paste a longer passage from Eureco meeting materials:

“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to
disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.

Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for
any given segment; some examples of sentences containing an ambiguous segment which cannot be
uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”

Source of the quote: https://dl.acm.org/doi/pdf/10.5555/1557769.1557795
“Poliqarp: an open source corpus indexer and search engine with syntactic extensions”, by Janus and Przepiórkowski

So at this point, the morphological info has two basic parts. One is the “traditional” part (<f name=”lex”>) with the added “translit” container, renamed to “orig”:

            <f name="lex"><!-- _zdarza-->  this is the “orth”, just for testing, can be suppressed
               <fs>
                  <f name="orig">zdarza</f>	the original spelling, maybe with typos
                  <f name="lemma">zdarzać</f>	the “base” in Poliqarpish
                  <f name="pos">fin</f>
                  <f name="msd">sg:ter:imperf</f>    morphosyntactic info, may be missing
               </fs>			(recall that “orth” is recovered from the offsets)

The new part is <f name="interps"> – the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in <fs type="alt">. These are all potential values of the given token before disambiguation.

            <f name="interps">
               <fs type="alt" n="choice">
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:ncol" n="choice"/>
                        <symbol value="sg:acc:n:ncol"/>
                        <symbol value="sg:voc:n:ncol"/>
                     </vAlt>
                  </f>
               </fs>
               <fs type="alt">
                  <f name="lemma">doświadczyć</f>
                  <f name="pos">ger</f>
                  <f name="msd">
                     <vAlt>
                        <symbol value="sg:nom:n:perf:aff"/>
                        <symbol value="sg:acc:n:perf:aff"/>
                     </vAlt>
                  </f>
               </fs>
            </f>

Notice that there are two sets of alternatives: one is at the lexical level (`<fs type="alt">`), and the other, within a single lexical hypothesis, involves a set of alternative morphosyntactic descriptions, contained inside `<vAlt>` (which is TEI-speak for “alternative values”).

So, `<f name="lex">` is post-disambiguation, and in the last case at hand, it looks as follows:
```xml
            <f name="lex">
               <fs>
                  <f name="orig">doświadczenie</f>
                  <f name="lemma">doświadczenie</f>
                  <f name="pos">subst</f>
                  <f name="msd">sg:nom:n:ncol</f>
               </fs>
            </f>

and <f name="interps"> is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attribute n="choice", which points at the same info as what <f name="lex"> contains.


Note:

  • the NKJP2KorAP dataset will surely be linked from here when ready
  • the feature values are going to get modified into attribute:value datasets (at this time, attribute names are only implied by the NKJP schema, not provided here)
@Akron Akron transferred this issue from KorAP/Kalamar Sep 7, 2022
@Akron
Copy link
Member

Akron commented Feb 6, 2023

I am thinking about how to implement this to be universal useful.

For the moment I would guess we have to add another Layer for interpretations, like "pv" for pos-variants
and "mv" for morphosyntactic variants.
Then

  • [case=acc] -> [nkjp/m=case:acc]
  • [case~acc] -> [nkjp/m=case:acc | nkjp/mv=case:acc]
  • [case==acc] -> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])
  • [case~~acc] -> exclude([nkjp/m=case:acc],[nkjp/mv=case:.*])|[nkjp/m=case:acc & nkjp/mv=case:acc]

where exclude() is the negative match operator, matching at the span of the first operand whenever no second operand has the same span.

Is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants