Bessarion: Medieval Greek Inscriptions on a challenging dataset for Vision and NLP tasks

We present a text and imaging dataset of Byzantine-era Medieval Greek inscriptions, suitable as a challenging testbed for Computer Vision and Natural Language Processing tasks. The lack of sizable related training sets, as well as difficulties related to the historical character and content of the inscriptions (natural wear of characters, systematic misspellings, etc.) make for a context where modern resource-hungry techniques are not straightforward to apply. In the paper, We describe the dataset contents -- images, geometric and text annotation, metadata -- and discuss baselines for three Computer Vision tasks (Inscription Detection, Text Recognition) and one Natural Language Processing task (Word Classification).

About

Dataset outline

Example images of the dataset, out of a total of 122 images. Images depict historical donative Byzantine inscriptions, describing lists of the persons or groups that contributed for the construction of a related site or monument. The text is written in the Greek language.

Example forms of character images found in the dataset. We provide annotations in multiple levels; there are a total of 504 outlined textlines, 2,776 outlined words and 10,414 outlined characters.

Map showing positions of the sites relevant to Bessarion data. Our dataset comes from a total of 37 locations spanning the region of Epirus, situated in North-Western Greece.

Inscription detection

Example ground-truth annotations for selected samples.

Method	AP	AP₇₀	AP₅₀	Parameter Size
Sparse R-CNN	56%	82%	63%	105.94 M
Q-GAN + CC	37%	62%	49%	1.6 M

Numerical comparison of baseline detectors. Detection accuracy in terms of mean average precision and average precision at different IoU thresholds is reported. Network sizes are cited for comparison.

Text Recognition

Examples of recognition challenges posed by the nature of Byzantine text painting.

Example of three different types of text annotations. From left to right: Line-level, word-level, character-level annotation.

Example text lines with the ground truth recognition label (Original) and the HTR model one as (Predicted). A WER of $6.1\%$ is achieved using the HTR model of [Retsinas et al., International Workshop on Document Analysis Systems (DAS) 2022].

Word Classification

We are interested in answering:

which person donated or contributed for the specific monument,
when was the monument constructed.

Concerning our baseline NLP method, we aim to answer the aforementioned questions via a word classification task (is this word related to the founder / dating of the monument?)

Sample NLP task ground truth file in JSON format.

As medieval Greek is poorly represented in terms of accessible digitized corpora, we combine a (modern Greek) BERT encoder-based model with question-specific corpus augmentation methods. The baseline method achieves a correct classification ratio ranging from 75% to 99%.

Data

Imaging data

Image data + PageXML annotations can be found under the folder dataset/image_resources. Each folder contains data coming from a single location (= village or monastery). Data from the same location, but different (neighbouring) monument, may be contained in a separate subfolder.

Natural Language data

Data related to the NLP task includes the full transcription for a number of inscriptions, plus metadata related to the content of each inscription. These are found under folder dataset/nlp_resources.

Other info

Historical note

The title of the dataset is an homage to the Greek scholar Bessarion.

"Bessarion (Greek: Βησσαρίων | 2 January 1403 - 18 November 1472) was a Byzantine Greek Renaissance humanist, theologian, Catholic cardinal and one of the famed Greek scholars who contributed to the so-called great revival of letters in the 15th century. He was educated by Gemistus Pletho in Neoplatonic philosophy and later served as the titular Latin Patriarch of Constantinople. He eventually was named a cardinal and was twice considered for the papacy" (text from wikipedia)

Acknowledgments

This dataset has been collected and processed in the context of research project "Bessarion", funded through the National/EU call "Open Innovation in Culture". You can check the project website here.

Bessarion project seminar

You can watch the seminar that was held in Arta, Greece, discussing the work carried out for the research project "Bessarion" here. (In Greek)

Paper citation

If you find the dataset useful, please consider citing the paper:

@inproceedings{sfikas2024bessarion,
  title={{B}essarion: {M}edieval {G}reek inscriptions on a challenging dataset for {V}ision and {NLP} tasks},
  author={Sfikas, Giorgos and Dimitrakopoulos, Panagiotis and Retsinas, George and Nikou, Christophoros and Kitsiou, Pinelopi},
  booktitle={International Workshop on Document Analysis Systems},
  pages={393--407},
  year={2024},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
figs		figs
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bessarion: Medieval Greek Inscriptions on a challenging dataset for Vision and NLP tasks

About

Dataset outline

Inscription detection

Text Recognition

Word Classification

Data

Imaging data

Natural Language data

Other info

Historical note

Acknowledgments

Bessarion project seminar

Paper citation

About

Releases

Packages

Languages

License

Archaeocomputers/Bessarion

Folders and files

Latest commit

History

Repository files navigation

Bessarion: Medieval Greek Inscriptions on a challenging dataset for Vision and NLP tasks

About

Dataset outline

Inscription detection

Text Recognition

Word Classification

Data

Imaging data

Natural Language data

Other info

Historical note

Acknowledgments

Bessarion project seminar

Paper citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages