Load HuggingFace tokenizer with SentencePiece files #2249
-
Hi all, I am deploying the tokenizer of Marian(Opus) MT model with DJL on Android device. The original python huggingface tokenizer is using AutoTokenizer, which is supported by DJL. However, it only supports the one with "tokenizer.json" and the opus mt using SentencePiece tokenizer including files "source.spm", "target.spm" and "vocab.json". Is there any way for DJL to support it or convert the files to "tokenizer.json"? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
DJL support fast tokenizer. Most of the Huggingface model can be converted to use fast tokenizer.
|
Beta Was this translation helpful? Give feedback.
-
That means there is no fast tokenizer implementation. You have to port python code into java. |
Beta Was this translation helpful? Give feedback.
That means there is no fast tokenizer implementation. You have to port python code into java.
You might want to take a look at our SentencePiece extension, and see if you can use it.