Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sent marker #47

Open
miguelwon opened this issue Jan 14, 2020 · 1 comment
Open

Sent marker #47

miguelwon opened this issue Jan 14, 2020 · 1 comment

Comments

@miguelwon
Copy link

Is there any marker that separates the sents? Usually, in BIO files each sent is separated by an empty line.

@cneud
Copy link
Member

cneud commented Jan 14, 2020

Hi @miguelwon, I am afraid currently there is no marker for sentence boundaries in the files. I am working on enriching the German and subsequently French data with sentence boundaries (empty lines) as part of the overall reworking of the data here: https://github.com/EuropeanaNewspapers/ner-corpora/tree/0.2. It is tedious though and will require more time. If you need sentence boundaries now, I recommend to use a tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants