Data Form of the MaLa-ASR #130

zsLin177 · 2024-08-28T02:11:53Z

System Info

torch 2.1

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

bash decode_MaLa-ASR_withkeywords_L95.sh

Hi, I'm currently working on reproducing the results of MaLa-ASR and have downloaded the slidespeech dataset from https://www.openslr.org/144/. While running the provided decoding script, I noticed that it requires the file located at /nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/. Could you please clarify what the format of this file is? Do I need to preprocess the downloaded data in any specific way, such as splitting the audio based on timestamps?

Error logs

no file named test_oracle_v1

Expected behavior

Could you please provide the steps for data processing and explain the format of the data? Thanks, looking forward to your reply.

yanghaha0908 · 2024-09-14T09:38:30Z

The location of the slidespeech dataset can be modified through config file "mala_asr_config.py".
You can change "/nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/." to your own path.

The dataset requires four files: "my_wav.scp", "utt2num_samples", "text", "hot_related/ocr_1gram_top50_mmr070_hotwords_list"

"my_wav.scp" is a file of audio path lists. We transform wav file to ark file, so this file looks like
ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22
ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445

To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. It provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in segments file

This related_files.tar.gz also provides "text" and a file named "keywords". The file "keywords" refers to "hot_related/ocr_1gram_top50_mmr070_hotwords_list", which contains hotwords list.

"utt2num_samples" contains the length of the wavs, which looks like
ID1 103680
ID2 181600
...

Sorry for the late reply, been busy lately, hope your reproduction goes well!

nuaalixu · 2024-10-09T03:39:27Z

@yanghaha0908 Thank you for your answer. It is strongly recommended that this answer be written into the mala README file.

ddlBoJack assigned yanghaha0908 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Form of the MaLa-ASR #130

Data Form of the MaLa-ASR #130

zsLin177 commented Aug 28, 2024

yanghaha0908 commented Sep 14, 2024 •

edited

Loading

nuaalixu commented Oct 9, 2024

Data Form of the MaLa-ASR #130

Data Form of the MaLa-ASR #130

Comments

zsLin177 commented Aug 28, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

yanghaha0908 commented Sep 14, 2024 • edited Loading

nuaalixu commented Oct 9, 2024

yanghaha0908 commented Sep 14, 2024 •

edited

Loading