Guide to Datasets

Guide to Datasets | Οδηγός στα Δεδομένα

File	Explanation
dimodis.rda	Scraped data from Δημώδης Ελληνική Γραμματεία
gutenberg.rda	Scraped data from the Greek side of Gutenberg Project
glc.rda	Greek Legal Code from Huggingface
alpaca.rda	Alpaca instruction finetuning dataset from HuggingFace
result_sentences.pkl	Random sample of circa 350K sentences from Bible, Europarl, HNC, GlobalVoices

RDA contents are usually R environments.

load("data/dimodis.rda")
ls(dimodis)
str(dimodis$works$ergoes) # reach the actual text data

We will fix this interface soon. Suggest your preferred formats in the issues.

PKL files can be used via pickle.load()

import pickle

with open("result_sentences.pkl", "rb") as f:
    sentences = pickle.load(f)

See the scripts section on how to further use the files.