hfds-clj is a lib to help you get to the HuggingFace datasets. The lib provides seamless access to datasets via this process:
- downloading HF dataset,
- caching downloaded set locally, and
- serving it from there for subsequent requests.
It does not aim to replicate the full range of functionality found in the HuggingFace datasets library. Though as an immediate extension, it would be great to support Dataset Features.
Data sets can be downloaded from the command line
clojure -X:download :dataset "allenai/prosocial-dialog"
See next section for parameter description.
(require '[hfds-clj.core :refer [load-dataset]])
Download HF datasets with this oneliner, where a single parameter is the dataset name as provided on the HF dataset page.
(load-dataset "Anthropic/hh-rlhf")
The second call with Anthropic/hh-rlhf
parameter will load it from the cache and return a lazy sequence of all the dataset records.
A more fine-grained data set request is supported via a parameterized call:
(load-dataset {:dataset "allenai/prosocial-dialog"
:split "train"
:config "default"
:offset 0
:length 100}
{:hfds/download-mode :reuse-dataset-if-exists
:hfds/cache-dir "/data"
:hfds/limit 4000}))