BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English.
This repository contains code for the pipeline used to generate the final dataset in the NeMo format. It uses the NeMo Forced Aligner to align sentences with their audio chunks, and then performs bi-text mining by using the Sonar Text Encoder to generate embeddings along with the calculation of Cosine Similarity scores.
- Conda Environment with Python 3.10 installed
- Support for CUDA 12.1
- Clone this repository and setup environment
git clone https://github.com/AI4Bharat/BhasaAnuvaad.git
cd BhasaAnuvaad
bash setup.sh
-
Set all the values in the config.yaml as specified in the
sample_pipeline_config.yaml
file and generate the input manifest in the format specified insample_input_manifest.jsonl
. When going from X -> Y, one config file and input manifest will be required for each X. -
Run pipeline
python3 main.py -c config.yaml
If you use BhasaAnuvaad in your work, please cite us:
@article{jain2024bhasaanuvaad,
title = {BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages},
author = {Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre},
year = {2024},
journal = {arXiv preprint arXiv: 2411.04699}
}
This dataset is released under the CC BY 4.0.
For any questions or feedback, please contact:
- Raj Dabre ([email protected])
- Sparsh Jain ([email protected])
- Ashwin Sankar ([email protected])
- Nikhil Narasimhan ([email protected])
- Mohammed Safi Ur Rahman Khan ([email protected])