The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split #92

cahya-wirawan · 2022-12-13T07:01:30Z

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_dataset function to add the train and validation split together. But I think what we really want to use is concatenate_dataset, because according to the docs, the result of function interleave_dataset ends when one of the source datasets runs out of examples (the default mode).
For example, if the train split has 100 entries and validation split has 10 entries, the result would contains only 10 entries from validation split and 10 from train split. That means we waste the existing train split dataset.

as example:

>>> from datasets import Dataset, interleave_datasets, concatenate_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> print(interleave_datasets([d1, d2])['a'])
[0, 10, 1, 11, 2, 12]
>>> print(concatenate_datasets([d1, d2])['a'])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

The text was updated successfully, but these errors were encountered:

sanchit-gandhi · 2023-12-07T13:28:03Z

We need to use interleave_datasets for streaming datasets. Here we do not know the length of each dataset a-priori, and so mix them on-the-fly based on the sampling probabilities that we define, potentially truncating individual datasets when we completely iterate over one of datasets (see "stopping strategies" in the docs).

Whereas we use concatenate_datasets for non-streaming datasets, since we know the lengths of each dataset a-priori, so can mix them entirely). See docs.

sanchit-gandhi · 2023-12-07T13:29:09Z

Ideally, this is the kind of logic that we want to implement, borrowed from the Distil-Whisper training code: https://github.com/huggingface/distil-whisper/blob/914dcdf3919552d5a3826a9d5db99b059ddcc16e/training/run_distillation.py#L600

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split #92

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split #92

cahya-wirawan commented Dec 13, 2022

sanchit-gandhi commented Dec 7, 2023 •

edited

Loading

sanchit-gandhi commented Dec 7, 2023

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split #92

The fine tuning script run_speech_recognition_seq2seq_streaming.py use interleave_datasets which will truncate the train split #92

Comments

cahya-wirawan commented Dec 13, 2022

sanchit-gandhi commented Dec 7, 2023 • edited Loading

sanchit-gandhi commented Dec 7, 2023

sanchit-gandhi commented Dec 7, 2023 •

edited

Loading