Load HuggingFace tokenizer with SentencePiece files #2249

xudongguan202 · 2022-12-20T08:40:43Z

xudongguan202
Dec 20, 2022

Hi all,

I am deploying the tokenizer of Marian(Opus) MT model with DJL on Android device. The original python huggingface tokenizer is using AutoTokenizer, which is supported by DJL. However, it only supports the one with "tokenizer.json" and the opus mt using SentencePiece tokenizer including files "source.spm", "target.spm" and "vocab.json". Is there any way for DJL to support it or convert the files to "tokenizer.json"?

Answered by frankfliu

Dec 21, 2022

That means there is no fast tokenizer implementation. You have to port python code into java.
You might want to take a look at our SentencePiece extension, and see if you can use it.

View full answer

frankfliu · 2022-12-20T15:24:55Z

frankfliu
Dec 20, 2022

DJL support fast tokenizer. Most of the Huggingface model can be converted to use fast tokenizer.
What you need to do is save the tokenizer in python:

        tokenizer.save_pretrained(output_dir)

1 reply

xudongguan202 Dec 21, 2022
Author

Thanks for replying! I have tried both methods on the official document, saving from pipeline and tokenizer, and both give me files like source.spm", "target.spm" and "vocab.json".

frankfliu · 2022-12-21T04:57:06Z

frankfliu
Dec 21, 2022

That means there is no fast tokenizer implementation. You have to port python code into java.
You might want to take a look at our SentencePiece extension, and see if you can use it.

1 reply

xudongguan202 Dec 22, 2022
Author

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load HuggingFace tokenizer with SentencePiece files #2249

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Load HuggingFace tokenizer with SentencePiece files #2249

xudongguan202 Dec 20, 2022

Replies: 2 comments · 2 replies

frankfliu Dec 20, 2022

xudongguan202 Dec 21, 2022 Author

frankfliu Dec 21, 2022

xudongguan202 Dec 22, 2022 Author

xudongguan202
Dec 20, 2022

Replies: 2 comments 2 replies

frankfliu
Dec 20, 2022

xudongguan202 Dec 21, 2022
Author

frankfliu
Dec 21, 2022

xudongguan202 Dec 22, 2022
Author