-
Notifications
You must be signed in to change notification settings - Fork 7
Guide to Datasets
Nina Gial edited this page Mar 1, 2024
·
1 revision
File | Explanation |
---|---|
dimodis.rda | Scraped data from Δημώδης Ελληνική Γραμματεία |
gutenberg.rda | Scraped data from the Greek side of Gutenberg Project |
glc.rda | Greek Legal Code from Huggingface |
alpaca.rda | Alpaca instruction finetuning dataset from HuggingFace |
result_sentences.pkl | Random sample of circa 350K sentences from Bible, Europarl, HNC, GlobalVoices |
RDA contents are usually R environments.
load("data/dimodis.rda")
ls(dimodis)
str(dimodis$works$ergoes) # reach the actual text data
We will fix this interface soon. Suggest your preferred formats in the issues.
PKL files can be used via pickle.load()
import pickle
with open("result_sentences.pkl", "rb") as f:
sentences = pickle.load(f)
See the scripts section on how to further use the files.