Skip to content

A text and imaging dataset of Byzantine-era Medieval Greek inscriptions.

License

Notifications You must be signed in to change notification settings

Archaeocomputers/Bessarion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bessarion: Medieval Greek Inscriptions on a challenging dataset for Vision and NLP tasks

We present a text and imaging dataset of Byzantine-era Medieval Greek inscriptions, suitable as a challenging testbed for Computer Vision and Natural Language Processing tasks. The lack of sizable related training sets, as well as difficulties related to the historical character and content of the inscriptions (natural wear of characters, systematic misspellings, etc.) make for a context where modern resource-hungry techniques are not straightforward to apply. In the paper, We describe the dataset contents -- images, geometric and text annotation, metadata -- and discuss baselines for three Computer Vision tasks (Inscription Detection, Text Recognition) and one Natural Language Processing task (Word Classification).

About

Dataset outline


Example images of the dataset, out of a total of 122 images. Images depict historical donative Byzantine inscriptions, describing lists of the persons or groups that contributed for the construction of a related site or monument. The text is written in the Greek language.



Example forms of character images found in the dataset. We provide annotations in multiple levels; there are a total of 504 outlined textlines, 2,776 outlined words and 10,414 outlined characters.



Map showing positions of the sites relevant to Bessarion data. Our dataset comes from a total of 37 locations spanning the region of Epirus, situated in North-Western Greece.


Inscription detection


Example ground-truth annotations for selected samples.


Method AP AP70 AP50 Parameter Size
Sparse R-CNN 56% 82% 63% 105.94 M
Q-GAN + CC 37% 62% 49% 1.6 M

Numerical comparison of baseline detectors. Detection accuracy in terms of mean average precision and average precision at different IoU thresholds is reported. Network sizes are cited for comparison.

Text Recognition


Examples of recognition challenges posed by the nature of Byzantine text painting.



Example of three different types of text annotations. From left to right: Line-level, word-level, character-level annotation.



Example text lines with the ground truth recognition label (Original) and the HTR model one as (Predicted). A WER of $6.1\%$ is achieved using the HTR model of [Retsinas et al., International Workshop on Document Analysis Systems (DAS) 2022].


Word Classification

We are interested in answering:

  • which person donated or contributed for the specific monument,
  • when was the monument constructed.

Concerning our baseline NLP method, we aim to answer the aforementioned questions via a word classification task (is this word related to the founder / dating of the monument?)


Sample NLP task ground truth file in JSON format.


As medieval Greek is poorly represented in terms of accessible digitized corpora, we combine a (modern Greek) BERT encoder-based model with question-specific corpus augmentation methods. The baseline method achieves a correct classification ratio ranging from 75% to 99%.

Data

Imaging data

  • Image data + PageXML annotations can be found under the folder dataset/image_resources. Each folder contains data coming from a single location (= village or monastery). Data from the same location, but different (neighbouring) monument, may be contained in a separate subfolder.

Natural Language data

  • Data related to the NLP task includes the full transcription for a number of inscriptions, plus metadata related to the content of each inscription. These are found under folder dataset/nlp_resources.

Other info

Historical note

The title of the dataset is an homage to the Greek scholar Bessarion.

"Bessarion (Greek: Βησσαρίων | 2 January 1403 - 18 November 1472) was a Byzantine Greek Renaissance humanist, theologian, Catholic cardinal and one of the famed Greek scholars who contributed to the so-called great revival of letters in the 15th century. He was educated by Gemistus Pletho in Neoplatonic philosophy and later served as the titular Latin Patriarch of Constantinople. He eventually was named a cardinal and was twice considered for the papacy" (text from wikipedia)

Acknowledgments

This dataset has been collected and processed in the context of research project "Bessarion", funded through the National/EU call "Open Innovation in Culture". You can check the project website here.

Bessarion project seminar

You can watch the seminar that was held in Arta, Greece, discussing the work carried out for the research project "Bessarion" here. (In Greek)

Paper citation

If you find the dataset useful, please consider citing the paper:

@inproceedings{sfikas2024bessarion,
  title={{B}essarion: {M}edieval {G}reek inscriptions on a challenging dataset for {V}ision and {NLP} tasks},
  author={Sfikas, Giorgos and Dimitrakopoulos, Panagiotis and Retsinas, George and Nikou, Christophoros and Kitsiou, Pinelopi},
  booktitle={International Workshop on Document Analysis Systems},
  pages={393--407},
  year={2024},
  organization={Springer}
}