We present a text and imaging dataset of Byzantine-era Medieval Greek inscriptions, suitable as a challenging testbed for Computer Vision and Natural Language Processing tasks. The lack of sizable related training sets, as well as difficulties related to the historical character and content of the inscriptions (natural wear of characters, systematic misspellings, etc.) make for a context where modern resource-hungry techniques are not straightforward to apply. In the paper, We describe the dataset contents -- images, geometric and text annotation, metadata -- and discuss baselines for three Computer Vision tasks (Inscription Detection, Text Recognition) and one Natural Language Processing task (Word Classification).
Example images of the dataset, out of a total of 122 images.
Images depict historical donative Byzantine inscriptions, describing lists of the persons or groups
that contributed for the construction of a related site or monument.
The text is written in the Greek language.
Example forms of character images found in the dataset.
We provide annotations in multiple levels;
there are a total of 504 outlined textlines,
2,776 outlined words and
10,414 outlined characters.
Map showing positions of the sites relevant to Bessarion data.
Our dataset comes from a total of 37 locations spanning the region of Epirus, situated in North-Western Greece.
Example ground-truth annotations for selected samples.
Method | AP | AP70 | AP50 | Parameter Size |
---|---|---|---|---|
Sparse R-CNN | 56% | 82% | 63% | 105.94 M |
Q-GAN + CC | 37% | 62% | 49% | 1.6 M |
Numerical comparison of baseline detectors. Detection accuracy in terms of mean average precision and average precision at different IoU thresholds is reported. Network sizes are cited for comparison.
Examples of recognition challenges posed by the nature of Byzantine text painting.
Example of three different types of text annotations.
From left to right: Line-level, word-level, character-level annotation.
Example text lines with the ground truth recognition label (Original)
and the HTR model one as (Predicted). A WER of
We are interested in answering:
- which person donated or contributed for the specific monument,
- when was the monument constructed.
Concerning our baseline NLP method, we aim to answer the aforementioned questions via a word classification task (is this word related to the founder / dating of the monument?)
Sample NLP task ground truth file in JSON format.
As medieval Greek is poorly represented in terms of accessible digitized corpora, we combine a (modern Greek) BERT encoder-based model with question-specific corpus augmentation methods. The baseline method achieves a correct classification ratio ranging from 75% to 99%.
- Image data + PageXML annotations can be found under the folder
dataset/image_resources
. Each folder contains data coming from a single location (= village or monastery). Data from the same location, but different (neighbouring) monument, may be contained in a separate subfolder.
- Data related to the NLP task includes the full transcription for a number of inscriptions, plus metadata related to the content of each inscription. These are found under folder
dataset/nlp_resources
.
The title of the dataset is an homage to the Greek scholar Bessarion.
"Bessarion (Greek: Βησσαρίων | 2 January 1403 - 18 November 1472) was a Byzantine Greek Renaissance humanist, theologian, Catholic cardinal and one of the famed Greek scholars who contributed to the so-called great revival of letters in the 15th century. He was educated by Gemistus Pletho in Neoplatonic philosophy and later served as the titular Latin Patriarch of Constantinople. He eventually was named a cardinal and was twice considered for the papacy" (text from wikipedia)
This dataset has been collected and processed in the context of research project "Bessarion", funded through the National/EU call "Open Innovation in Culture". You can check the project website here.
You can watch the seminar that was held in Arta, Greece, discussing the work carried out for the research project "Bessarion" here. (In Greek)
If you find the dataset useful, please consider citing the paper:
@inproceedings{sfikas2024bessarion,
title={{B}essarion: {M}edieval {G}reek inscriptions on a challenging dataset for {V}ision and {NLP} tasks},
author={Sfikas, Giorgos and Dimitrakopoulos, Panagiotis and Retsinas, George and Nikou, Christophoros and Kitsiou, Pinelopi},
booktitle={International Workshop on Document Analysis Systems},
pages={393--407},
year={2024},
organization={Springer}
}