Skip to content

Latest commit

 

History

History
84 lines (70 loc) · 5.38 KB

README.md

File metadata and controls

84 lines (70 loc) · 5.38 KB

BembaSpeech Baseline Experiments

This repository contains resources (dataset and notebooks) for reproducing experiments in the BembaSpeech: A Speech Recognition Corpus for the Bemba Language.

Please consider citing as follows if you use part of the code or data in your work or project:

@InProceedings{sikasote-anastasopoulos:2022:LREC,
  author    = {Sikasote, Claytone  and  Anastasopoulos, Antonios},
  title     = {BembaSpeech: A Speech Recognition Corpus for the Bemba Language},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {7277--7283},
  abstract  = {We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30\% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we explored different approaches; supervised pre-training (training from scratch), cross-lingual transfer learning from a monolingual English pre-trained model using DeepSpeech on the portion of the dataset and fine-tuning large scale self-supervised Wav2Vec2.0 based multilingual pre-trained models on the complete BembaSpeech corpus. From our experiments, the 1 billion XLS-R parameter model gives the best results. The model achieves a word error rate (WER) of 32.91\%, results demonstrating that model capacity significantly improves performance and that multilingual pre-trained models transfers cross-lingual acoustic representation better than monolingual pre-trained English model on the BembaSpeech for the Bemba ASR. Lastly, results also show that the corpus can be used for building ASR systems for Bemba language.},
  url       = {https://aclanthology.org/2022.lrec-1.790}
}

1. DeepSpeech Experiments

In this project we used the DeepSpeech v0.8.2 release for our experiments. We refer the reader to Mozilla DeepSpeech for latest updates.

Dataset

The data used in this project is a 17hrs portion of the [BembaSpeech] corpus consisting of audio files whose size is not more than 10 seconds as per DeepSpeech input pipeline requirement.

ID Dataset CSV file No. of Utterances Size Description
1 training train.csv 10200 14hrs, 20min Used for training
2 development dev.csv 1437 2hrs Used for validation
3 testing test.csv 756 1hr, 18min Used for testing

Language Model

To create the language model for our experiments, we used two sets of Bemba text; transcript (from train and dev sets) denited as [LM1] and a combination of transcripts and JW300 denoted as [LM2].

You can run and follow the [notebook] which provides the step by step process of creating different N-gram language models using KenLM tool.

Notebooks

In the notebooks folder, you will find notebooks used in the training of the DeepSpeech Bemba ASR model.

  • lm.ipynb - used to create the N-gram language models
  • baseline.ipynb - used to train the baseline for our experiments
  • ft_model.ipynb - used to finetune DeepSpeech English pretrained model without inclusion of language model.
  • ftune_5glm_trans.ipynb - used to finetune DeepSpeech`s English pretrained model with inclusion of the 5-gram LM (from LM1 Bemba text) scorer.

Deepspeech Bemba Models

You can download the models (both acoustic and scorer) that achieved the best results 54.78%.

1. Using SSL Models [XLSR] Experiments

The code used to finetune the XLSR models on the BembaSpeech can be found HERE.