Skip to content

Latest commit

 

History

History
73 lines (46 loc) · 2.22 KB

README.md

File metadata and controls

73 lines (46 loc) · 2.22 KB

Goal

Conversational agents are sweet! See Hume's EVI, Call Alice, and Fixie's ai.town. These three know when they are being interrupted, somehow.

My guesses

  • A speaker diarization model
  • A speech activity model
  • DSP in the browser

Results

Speech Activity

Build a dataset

I made a dataset of overlapped speech called grid-overlap from audio in the GRID audiovisual sentence corpus.

How?

  • I recorded six clips of myself saying something I might say to interrupt an assistant. See the filenames in interrupts
  • I embedded the clips into one half of the total audio recordings from GRID. See add_overlaps.py
    1. 50% with interruption, 50% without interruption.
    2. 10% validation, 20% test, 70% train

More info

audio_25k from the Grid AudioVisual dataset contains 1k recordings for each of 30 speakers.

Finetune wav2vec2

See this train_overlap Colab notebook

Test inference speed

  • Code for inference is at the end of the above notebook
  • A finetuned wav2vec running on a T4 took about 1.41 seconds to classify a <1 second recording.
  • This is too slow

DSP in the browser

see vui

archive

Mission: WavSurfer

wav2vec in the browser

Q: test time it takes to

  1. featureExtract 1sec of audio
  2. classify that audio as overlap or not overlap A: 1.41 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ^^ With a T4 in python, using the pipeline abstraction:
path_to_model="/content/drive/MyDrive/wav2vec2-base-detect-overlap"
classifier = pipeline("audio-classification", model=path_to_model)

Recon: Is there a wav2vec model that can be used in the browser? Yes. Might be worth learning the time it takes your browser to classify audio.

Fear: Is tokenizing input going to take a long time? Maybe not, cuz browsers have ASR and speech recognition

Known Unknowns

  • Dataset is in 44khz
  • Same speaker is used for every interrupt
  • there are only 6 interrupt variants. These are likely easily learned