Support for Cyrillic Slavic languages? #37

quinnanya · 2021-07-12T15:56:30Z

Would it be possible to adjust the tokenization to accommodate Cyrillic Slavic languages (e.g. Russian, Bulgarian, Serbian) that are supported by NRC? I understand the challenges with handling tokenization more broadly, and can't vouch for how it would work with some of the other non-Latin alphabets, but I think adding in the Cyrillic Unicode ranges to the current whitespace-oriented code should work okay. Thank you!

mjockers · 2021-07-17T22:30:31Z

This is a heavier lift (especially since I have no experience with tokenizing Cyrillic Slavic languages), so probably not something I will be able to get to anytime soon. Suyzhet already could use better English sentence and word tokenizers, so it would make sense to rebuild the get_sentences and get_tokens functions to take a "language" argument that would then call an appropriate tokenizer. (If this is in your wheelhouse, please feel free to fork and contribute:-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Cyrillic Slavic languages? #37

Support for Cyrillic Slavic languages? #37

quinnanya commented Jul 12, 2021

mjockers commented Jul 17, 2021

Support for Cyrillic Slavic languages? #37

Support for Cyrillic Slavic languages? #37

Comments

quinnanya commented Jul 12, 2021

mjockers commented Jul 17, 2021