Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO export #2

Open
cneud opened this issue Apr 6, 2017 · 2 comments
Open

ALTO export #2

cneud opened this issue Apr 6, 2017 · 2 comments

Comments

@cneud
Copy link
Member

cneud commented Apr 6, 2017

The way the NER tags are currently exported in ALTO creates considerable overhead in terms of file size, parsing etc. as every occurrence of a named entity will result in a separate NamedEntityTag and TAGREF.

Consider, for example, the following text in ALTO:

<TextLine HEIGHT="71" HPOS="309" VPOS="1004" WIDTH="779">
    <String CONTENT="Berlin" [...] />
    <String CONTENT="is" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="city" [...] />
    <String CONTENT="but" [...] />
    <String CONTENT="Berlin" [...] />
    <String CONTENT="is" [...] />
    <String CONTENT="also" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="state" [...] />
    <String CONTENT="of" [...] />
    <String CONTENT="Germany" [...] />
    <String CONTENT="." [...] />
</TextLine>

Tagging this line will result in two TAGREFS and two NamedEntityTag elements for the named entity "Berlin" being added to the ALTO like this:

<TextLine HEIGHT="71" HPOS="309" VPOS="1004" WIDTH="779">
    <String CONTENT="Berlin" [...] TAGREFS="Tag15"/>
    <String CONTENT="is" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="city" [...] />
    <String CONTENT="but" [...] />
    <String CONTENT="Berlin" [...] TAGREFS="Tag16"/>
    <String CONTENT="is" [...] />
    <String CONTENT="also" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="state" [...] />
    <String CONTENT="of" [...] />
    <String CONTENT="Germany" TAGREFS="Tag17"/>
    <String CONTENT="." [...] />
</TextLine>
[...]
<NamedEntityTag DESCRIPTION="Berlin" ID="Tag15" LABEL="LOC"/>    
<NamedEntityTag DESCRIPTION="Berlin" ID="Tag16" LABEL="LOC"/>
<NamedEntityTag DESCRIPTION="Germany" ID="Tag17" LABEL="LOC"/>

It would be preferable not to repeat NamedEntityTag for identical references and instead write this like:

<TextLine HEIGHT="71" HPOS="309" VPOS="1004" WIDTH="779">
    <String CONTENT="Berlin" [...] TAGREFS="Tag15"/>
    <String CONTENT="is" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="city" [...] />
    <String CONTENT="but" [...] />
    <String CONTENT="Berlin" [...] TAGREFS="Tag15"/>
    <String CONTENT="is" [...] />
    <String CONTENT="also" [...] />
    <String CONTENT="a" [...] />
    <String CONTENT="state" [...] />
    <String CONTENT="of" [...] />
    <String CONTENT="Germany" TAGREFS="Tag16"/>
    <String CONTENT="." [...] />
</TextLine>
[...]
<NamedEntityTag DESCRIPTION="Berlin" ID="Tag15" LABEL="LOC"/>
<NamedEntityTag DESCRIPTION="Germany" ID="Tag16" LABEL="LOC"/>
@TuulaP
Copy link

TuulaP commented Dec 21, 2017

Is there any plan if this is going to be implemented at some point? This would be handy also in our case.

@cneud
Copy link
Member Author

cneud commented Dec 21, 2017

This is definitely on our to-do list for 1st half of 2018, though I cannot make any promises yet as to when it will be implemented exactly.
PR welcome ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants