You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would it be possible to adjust the tokenization to accommodate Cyrillic Slavic languages (e.g. Russian, Bulgarian, Serbian) that are supported by NRC? I understand the challenges with handling tokenization more broadly, and can't vouch for how it would work with some of the other non-Latin alphabets, but I think adding in the Cyrillic Unicode ranges to the current whitespace-oriented code should work okay. Thank you!
The text was updated successfully, but these errors were encountered:
This is a heavier lift (especially since I have no experience with tokenizing Cyrillic Slavic languages), so probably not something I will be able to get to anytime soon. Suyzhet already could use better English sentence and word tokenizers, so it would make sense to rebuild the get_sentences and get_tokens functions to take a "language" argument that would then call an appropriate tokenizer. (If this is in your wheelhouse, please feel free to fork and contribute:-)
Would it be possible to adjust the tokenization to accommodate Cyrillic Slavic languages (e.g. Russian, Bulgarian, Serbian) that are supported by NRC? I understand the challenges with handling tokenization more broadly, and can't vouch for how it would work with some of the other non-Latin alphabets, but I think adding in the Cyrillic Unicode ranges to the current whitespace-oriented code should work okay. Thank you!
The text was updated successfully, but these errors were encountered: