Skip to content

Reference code for the paper The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation.

License

Notifications You must be signed in to change notification settings

idiap/translation-aided-slu

Repository files navigation

This is the reference code for the paper The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation, accepted by the Findings of EMNLP 2023. Pretrained models are available here on Huggingface.

The datasets developed in the paper have been also released. See SLURP-Fr/Es (synthetic portion), SLURP-Fr (real portion), and Spoken Gigaword.

Dataset preparation

Please run the corresponding scripts in corpora/ to prepare the datasets used, with the paths in the scripts replaced by your own ones where you downloaded and extracted the data. For pretraining, run covost2.py for CoVoST2, and since MuST-C and TEDx dataset have the same structure, run mustc.py for both of them with different db_path and tmp_path. It should be noted that as we use ASR/ST as a pretraining task, the datasets are further cleaned. Then run minds14.py for MINDS-14, slurp.py for SLURP, and nmsqa.py for NMSQA. The newly released benchmarks are provided, which are synthesized by massive.py for SLURP-Fr and gigawords.py for Spoken Gigawords, requiring the Google TTS environment.

Pretraining

To pretrain the model with ST, run

python train.py --model-dir=/temp/run/st --data-dir=/temp/data --accumulation_steps=5 --hparams=warmup_steps=10000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,use_src_lang_embed=False,batch_frame_limit=7e5,batch_quad_frame_limit=7e10 --src_lang=en:fr --tgt_lang=fr:en --datasets=mustc:covost2 --max_retry=100 --max_steps=130000

To evaluate the pretrained model on the test split of the cleaned datasets, run

python infer.py --model-path=/temp/run/st --output-path=/temp/run/st/test_results --data-dir=/temp/data --datasets=mustc:covost2 --hparams=eval_filter_samples=True --include_steps=120000 --eval_meta=meta.test.txt:meta.test.txt

Similarly, for ASR pretraining, run

python train.py --model-dir=/temp/run/asr --data-dir=/temp/data --accumulation_steps=5 --hparams=warmup_steps=10000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,data_format=Il__as,use_src_lang_embed=False,batch_frame_limit=7e5,batch_quad_frame_limit=7e10 --src_lang=en:fr --tgt_lang=fr:en --datasets=mustc:covost2 --max_retry=100 --max_steps=130000

For ST+ASR pretraining, run

python train.py --model-dir=/temp/run/joint --data-dir=/temp/data --accumulation_steps=5 --hparams=warmup_steps=10000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,data_format=Il__as:Il__as:Iltbas:Iltbas,use_src_lang_embed=False,batch_frame_limit=7e5,batch_quad_frame_limit=7e10 --src_lang=en:fr --tgt_lang=fr:en --datasets=mustc:covost2:mustc:covost2 --max_retry=100 --max_steps=10000

Downstream tasks

Below we list examples for commands to run experiments on downstream tasks.

For fine-tuning ST-pretraining models to SLURP,

python train.py --model-dir=/temp/run/slurp/st --data-dir=/temp/data/slurp_full --train_meta=/temp/data/slurp_full/meta.train.syn.txt --accumulation_steps=4 --hparams=warmup_steps=10000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,data_format=nltLa,use_classifier=True,freeze_module=asr_model:decoder:adaptor,freeze_steps=10000,classifier_keep_layer_from=1,classifier_keep_layer_to=4,classifier_position_layer=12,decoder_remove_decoder=True,max_eval_samples=1000,use_src_lang_embed=False,freeze_feature_encoder=False,classifier_pooling=mean --restore_from=/temp/run/st/model.ckpt-120000 --reset_training --max_retry=100

For joint ST-SLURP training,

python train.py --model-dir=/temp/run/slurp/st_joint --data-dir=/temp/data --accumulation_steps=6 --hparams=warmup_steps=10000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,use_classifier=True,data_format=Iltbas:Iltbas:nltLa,data_groups=2:1,classifier_keep_layer_from=1,classifier_keep_layer_to=4,classifier_position_layer=12,data_group_ratio=1:3,use_src_lang_embed=False,freeze_feature_encoder=False,classifier_pooling=mean --restore_from=/temp/run/st/model.ckpt-120000 --reset_training --max_retry=100 --src_lang=en:fr:u --tgt_lang=fr:en:u --datasets=covost2:mustc:slurp_full --train_meta=::meta.train.syn.txt

To test it,

python infer.py --model-path=/temp/run/slurp/st_joint --output-path=/temp/run/slurp/st_joint/test_results/ --data-dir=/temp/data --datasets=slurp_full --hparams=data_format=nltLa,eval_filter_samples=False --eval_meta=meta.test.txt

It can be further fine-tuned on SLURP-Fr by,

python train.py --model-dir=/temp/run/slurp_fr/slurp_st_joint --data-dir=/temp/data/slurp_fr --accumulation_steps=4 --hparams=warmup_steps=5000,decay_steps=50000,eval_filter_samples=False,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,data_format=nltLa,use_classifier=True,freeze_module=asr_model:decoder:adaptor,freeze_steps=5000,classifier_keep_layer_from=1,classifier_keep_layer_to=4,classifier_position_layer=12,decoder_remove_decoder=True,use_src_lang_embed=False,freeze_feature_encoder=False,classifier_pooling=mean,data_warmup_steps=5000 --restore_from=/temp/run/slurp/st_joint/model.ckpt-150000 --reset_training --max_retry=100 --max_steps=50000 --eval_interval=1000 --checkpoint_interval=5000

For MINDS-14,

python train.py --model-dir=/temp/run/minds14/st --data-dir=/temp/data/minds14 --accumulation_steps=4 --hparams=warmup_steps=1000,decay_steps=20000,eval_filter_samples=False,input_length_final_lower_bound=1600,input_length_final_upper_bound=480000,filter_by_charset=True,data_format=nltLa,use_classifier=True,freeze_module=asr_model:decoder:adaptor,freeze_steps=1000,classifier_keep_layer_from=1,classifier_keep_layer_to=4,classifier_position_layer=12,decoder_remove_decoder=True,max_eval_samples=1000,use_src_lang_embed=False,freeze_feature_encoder=False,classifier_pooling=mean,classifier_num_targets=1,classifier_num_labels=14,data_warmup_steps=2000 --max_retry=100 --max_steps=20000 --reset_training --restore_from=/temp/run/st/model.ckpt-120000 --eval_interval=250 --checkpoint_interval=500

For Spoken Gigaword,

python train.py --model-dir=/temp/run/gigaword/st_joint --data-dir=/temp/data --src_lang=en:fr --tgt_lang=fr:en:en-sum --datasets=mustc:covost2:gigaword --accumulation_steps=5 --hparams=warmup_steps=2000,eval_filter_samples=True,input_length_final_lower_bound=1600,input_length_final_upper_bound=320000,filter_by_charset=True,data_format=Iltbas:Iltbas:Iltbas,data_groups=2:1,freeze_module=asr_model,freeze_steps=2000,max_eval_samples=1000,data_group_ratio=1:1,use_src_lang_embed=False,freeze_feature_encoder=False,batch_frame_limit=7e5,batch_quad_frame_limit=7e10,decoder_dropout=0.2,data_warmup_steps=2000 --max_retry=100 --reset_training --restore_from=/temp/run/st/model.ckpt-120000 --eval_interval=500 --checkpoint_interval=2000 --max_steps=16000 --eval_meta=.:.:meta.dev.txt

For NMSQA,

python train.py --model-dir=/temp/run/nmsqa/st --data-dir=/temp/data --datasets=nmsqa --accumulation_steps=2 --hparams=warmup_steps=5000,eval_filter_samples=False,input_length_final_lower_bound=1600,input_length_final_upper_bound=1440000,filter_by_charset=True,use_classifier=True,data_format=DlStLa,freeze_module=asr_model:adaptor:decoder.model.encoder.layers.0.:decoder.model.encoder.layers.1.:decoder.model.encoder.layers.2.:decoder.model.encoder.layers.3.:decoder.model.encoder.layers.4.:decoder.model.encoder.layers.5.:decoder.model.encoder.layers.6.,freeze_steps=200000,decoder_remove_decoder=True,classifier_keep_layer_from=1,classifier_keep_layer_to=4,classifier_position_layer=12,use_src_lang_embed=False,freeze_feature_encoder=False,batch_size=32,batch_frame_limit=6e6,batch_quad_frame_limit=6e11,classifier_type=longformer,classifier_name=allenai/longformer-large-4096,classifier_head_type=qa,classifier_hidden_size=1024,data_warmup_steps=5000,qa_segment_cls=True,qa_balance_factor=1.0,qa_label_type=sample --reset_training --restore_from=/temp/run/st/model.ckpt-120000 --eval_meta=meta.dev.ex.ds.300.txt --train_meta=meta.train.ex.txt --checkpoint_interval=2000 --eval_interval=2000 --max_retry=100 , which only evaluate on part of the dev split each time for efficiency. For full evaluation on dev,

python infer.py --model-path=/temp/run/nmsqa/st --output-path=/temp/run/nmsqa/st/dev_results/ --data-dir=/temp/data --datasets=nmsqa --eval_meta=meta.dev.ex.txt

About

Reference code for the paper The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages