Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valency: better sentence splitting #958

Open
myrix opened this issue Mar 30, 2023 · 0 comments
Open

Valency: better sentence splitting #958

myrix opened this issue Mar 30, 2023 · 0 comments
Labels
backend bug is related to backend enhancement this label means that resolving the issue would improve some part of the system

Comments

@myrix
Copy link
Contributor

myrix commented Mar 30, 2023

Current sentence splitting in valency data extraction is rather ad-hoc, using very simple algorithm by Pavel Grashchenkov based on a list of possible sentence-ending punctuation tokens {'.', '!', '?', '...', '?!', '...»'}, see https://github.com/ispras/lingvodoc/blob/2c121263ffe26773bcc34aca1ed6e12c68939060/lingvodoc/scripts/valency.py#L17.

We should consider upgrading to a sentence splitter closer to the state of the art, e.g. NLTK's one, for better overall quality of the sentence splitting.

Though if the sentence structure could change, we would need to carefully and accurately enhance valency data updating procedures, see #775, so that's a point to keep in mind.

@myrix myrix added enhancement this label means that resolving the issue would improve some part of the system backend bug is related to backend labels Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug is related to backend enhancement this label means that resolving the issue would improve some part of the system
Projects
None yet
Development

No branches or pull requests

1 participant