Skip to content
This repository has been archived by the owner on May 17, 2023. It is now read-only.
/ hftokenizers Public archive

Hugging face tokenizers for R using extendr

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

mlverse/hftokenizers

Repository files navigation

Note: This project is archived. Please refer to tok which is a reimplementation.

HuggingFace tokenizers from R

R build status

This is an experimental project binding HuggingFace tokenizers Rust library to R using the extendr project. Do not use for anything meaninful yet.

Installation

This repository uses the helloextendr template.

Before you can install this package, you need to install a working Rust toolchain. We recommend using rustup.

On Windows, you’ll also have to add the i686-pc-windows-gnu and x86_64-pc-windows-gnu targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

remotes::install_github("mlverse/hftokenizers")

Small example

Here’s a quick demo of what you can do with hftokenizers:

library(hftokenizers)

download.file(
  "https://raw.githubusercontent.com/mlverse/hftokenizers/main/tests/testthat/assets/small.txt",
  "small.txt"
)

tokenizer$
  new(models_bpe$new())$
  train(normalizePath("small.txt"))$
  encode(c("hello world"))$
  ids
#> [1]  57 427  93 275  61  53

About

Hugging face tokenizers for R using extendr

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published