BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English.

This repository contains code for the pipeline used to generate the final dataset in the NeMo format. It uses the NeMo Forced Aligner to align sentences with their audio chunks, and then performs bi-text mining by using the Sonar Text Encoder to generate embeddings along with the calculation of Cosine Similarity scores.

Usage

Prerequisites

Conda Environment with Python 3.10 installed
Support for CUDA 12.1

Clone this repository and setup environment

git clone https://github.com/AI4Bharat/BhasaAnuvaad.git
cd BhasaAnuvaad
bash setup.sh

Set all the values in the config.yaml as specified in the sample_pipeline_config.yaml file and generate the input manifest in the format specified in sample_input_manifest.jsonl. When going from X -> Y, one config file and input manifest will be required for each X.
Run pipeline

python3 main.py -c config.yaml

Citation

If you use BhasaAnuvaad in your work, please cite us:

@article{jain2024bhasaanuvaad,
  title   = {BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages},
  author  = {Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2411.04699}
}

License

This dataset is released under the CC BY 4.0.

Contact

For any questions or feedback, please contact:

Raj Dabre ([email protected])
Sparsh Jain ([email protected])
Ashwin Sankar ([email protected])
Nikhil Narasimhan ([email protected])
Mohammed Safi Ur Rahman Khan ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
base		base
error		error
models		models
step		step
utils		utils
.gitignore		.gitignore
CITATION.bib		CITATION.bib
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
sample_input_manifest.jsonl		sample_input_manifest.jsonl
sample_pipeline_config.yaml		sample_pipeline_config.yaml
screen.sh		screen.sh
setup.sh		setup.sh
step_decorator.py		step_decorator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

Usage

Prerequisites

Citation

License

Contact

Links

About

Releases

Packages

Contributors 3

Languages

AI4Bharat/BhasaAnuvaad

Folders and files

Latest commit

History

Repository files navigation

BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

Usage

Prerequisites

Citation

License

Contact

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages