Skip to content

Guide to Datasets

Nina Gial edited this page Mar 1, 2024 · 1 revision

Guide to Datasets | Οδηγός στα Δεδομένα

File Explanation
dimodis.rda Scraped data from Δημώδης Ελληνική Γραμματεία
gutenberg.rda Scraped data from the Greek side of Gutenberg Project
glc.rda Greek Legal Code from Huggingface
alpaca.rda Alpaca instruction finetuning dataset from HuggingFace
result_sentences.pkl Random sample of circa 350K sentences from Bible, Europarl, HNC, GlobalVoices

RDA contents are usually R environments.

load("data/dimodis.rda")
ls(dimodis)
str(dimodis$works$ergoes) # reach the actual text data

We will fix this interface soon. Suggest your preferred formats in the issues.

PKL files can be used via pickle.load()

import pickle

with open("result_sentences.pkl", "rb") as f:
    sentences = pickle.load(f)

See the scripts section on how to further use the files.

Clone this wiki locally