You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.
Below, I paste a longer passage from Eureco meeting materials:
“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to
disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.
Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for
any given segment; some examples of sentences containing an ambiguous segment which cannot be
uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”
So at this point, the morphological info has two basic parts. One is the “traditional” part (<f name=”lex”>) with the added “translit” container, renamed to “orig”:
<fname="lex"><!-- _zdarza--> this is the “orth”, just for testing, can be suppressed
<fs>
<fname="orig">zdarza</f> the original spelling, maybe with typos
<fname="lemma">zdarzać</f> the “base” in Poliqarpish
<fname="pos">fin</f>
<fname="msd">sg:ter:imperf</f> morphosyntactic info, may be missing
</fs> (recall that “orth” is recovered from the offsets)
The new part is <f name="interps"> – the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in <fs type="alt">. These are all potential values of the given token before disambiguation.
<fname="interps">
<fstype="alt"n="choice">
<fname="lemma">doświadczenie</f>
<fname="pos">subst</f>
<fname="msd">
<vAlt>
<symbolvalue="sg:nom:n:ncol"n="choice"/>
<symbolvalue="sg:acc:n:ncol"/>
<symbolvalue="sg:voc:n:ncol"/>
</vAlt>
</f>
</fs>
<fstype="alt">
<fname="lemma">doświadczyć</f>
<fname="pos">ger</f>
<fname="msd">
<vAlt>
<symbolvalue="sg:nom:n:perf:aff"/>
<symbolvalue="sg:acc:n:perf:aff"/>
</vAlt>
</f>
</fs>
</f>
Notice that there are two sets of alternatives: one is at the lexical level (`<fstype="alt">`), and the other, within a single lexical hypothesis, involves a set of alternative morphosyntactic descriptions, contained inside `<vAlt>` (which is TEI-speak for “alternative values”).
So, `<fname="lex">` is post-disambiguation, and in the last case at hand, it looks as follows:
```xml
<fname="lex">
<fs>
<fname="orig">doświadczenie</f>
<fname="lemma">doświadczenie</f>
<fname="pos">subst</f>
<fname="msd">sg:nom:n:ncol</f>
</fs>
</f>
and <f name="interps"> is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attribute n="choice", which points at the same info as what <f name="lex"> contains.
Note:
the NKJP2KorAP dataset will surely be linked from here when ready
the feature values are going to get modified into attribute:value datasets (at this time, attribute names are only implied by the NKJP schema, not provided here)
The text was updated successfully, but these errors were encountered:
TL;DR: consider introducing the ~ family of operators that search among non-disambiguated morphological values as opposed to those disambiguated in the morphosyntactic context. The potential training ground for that is the new NKJP-SGJP dataset (being) converted from TEI to KorAP XML.
Below, I paste a longer passage from Eureco meeting materials:
“A unique feature of Poliqarp is that it may be used for searching corpora containing, in addition to
disambiguated interpretations, information about all possible morphosyntactic interpretations given by the morphological analyser. For example, the query [case~acc] finds all segments with an accusative interpretation (even if this is not the interpretation selected in a given context), while [case=acc] finds segments which were disambiguated to accusative in a given context.
Moreover, Poliqarp does not make the assumption that only one interpretation must be correct for
any given segment; some examples of sentences containing an ambiguous segment which cannot be
uniquely disambiguated even given unlimited context and all the linguistic and encyclopaedic knowledge are cited in (Przepiórkowski et al., 2004). In such cases, the = operator has the existential meaning, i.e., [case=acc] finds segments with at least one accusative interpretation marked as correct in the context (“disambiguated”). On the other hand, the operator == is universal, i.e., [case==acc] finds segments whose all disambiguated interpretations are accusative: segments which were truly uniquely disambiguated to one (accusative) interpretation, or segments which have many interpretations correct in the context, but all of them are accusative. For completeness, the operator ~~ is added, which universally applies to all morphosyntactic interpretations, i.e., [case~~acc] finds segments whose all interpretations as given by a morphological analyser (before disambiguation) are accusative.”
Source of the quote: https://dl.acm.org/doi/pdf/10.5555/1557769.1557795
“Poliqarp: an open source corpus indexer and search engine with syntactic extensions”, by Janus and Przepiórkowski
So at this point, the morphological info has two basic parts. One is the “traditional” part (
<f name=”lex”>
) with the added “translit” container, renamed to “orig”:The new part is
<f name="interps">
– the name of the feature was simply taken over from the original, it stands for “interpretations”, of course. It contains one or more alternatives encoded in<fs type="alt">
. These are all potential values of the given token before disambiguation.and
<f name="interps">
is pre-disambiguation, as described in the above quote from the article by Przepiórkowski and Janus. Redundantly, in the pre-disambiguation part, I have marked the eventual choices with the extra attributen="choice"
, which points at the same info as what<f name="lex">
contains.Note:
The text was updated successfully, but these errors were encountered: