Skip to content

Commit

Permalink
Upload Very Attentive Tacotron video and add it to demo page
Browse files Browse the repository at this point in the history
  • Loading branch information
JulianSlzr committed Nov 1, 2024
1 parent 7d41ebf commit e7db686
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
17 changes: 11 additions & 6 deletions publications/very_attentive_tacotron/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@
<h1>Audio samples from "Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech"</h1>
</header>
</article>
<div align="center">
<video width="540" controls>
<source src="https://github.com/google/tacotron/raw/refs/heads/master/publications/very_attentive_tacotron/video/vat_demos.mp4">
</video>
</div>
<div><p><b>Paper:</b> <a href="https://arxiv.org/abs/2410.22179">arXiv</a></p></div>
<div><p><b>Authors:</b> Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao</p></div>
<div><p><b>Abstract:</b> Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self and cross-attention operations. A system incorporating these improvements, which we call <i>Very Attentive Tacotron</i>, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.</p></div>
Expand Down Expand Up @@ -118,12 +123,12 @@ <h1>Audio samples from "Very Attentive Tacotron: Robust and Unbounded Length Gen
<p class="toc_title">Contents</p>
<div id="toc_container">
<ul>
<li><a href="#test-set-samples:-lessac-voice"> 1. Test Set Samples: Lessac Voice
<li><a href="#test-set-samples:-libritts"> 2. Test Set Samples: LibriTTS
<li><a href="#generalization-to-long-utterances:-lessac-voice"> 3. Generalization to Long Utterances: Lessac Voice
<li><a href="#generalization-to-long-utterances:-libritts"> 4. Generalization to Long Utterances: LibriTTS
<li><a href="#repeated-words"> 5. Repeated Words
<li><a href="#additional-speakers:-internal-multi-speaker-dataset"> 6. Additional Speakers: Internal Multi-speaker Dataset
<li><a href="#test-set-samples:-lessac-voice"> 1. Test Set Samples: Lessac Voice</a></li>
<li><a href="#test-set-samples:-libritts"> 2. Test Set Samples: LibriTTS</a></li>
<li><a href="#generalization-to-long-utterances:-lessac-voice"> 3. Generalization to Long Utterances: Lessac Voice</a></li>
<li><a href="#generalization-to-long-utterances:-libritts"> 4. Generalization to Long Utterances: LibriTTS</a></li>
<li><a href="#repeated-words"> 5. Repeated Words</a></li>
<li><a href="#additional-speakers:-internal-multi-speaker-dataset"> 6. Additional Speakers: Internal Multi-speaker Dataset</a></li>
</ul>
</div>
<a name="test-set-samples:-lessac-voice"><h2>1. Test Set Samples: Lessac Voice</h2></a>
Expand Down
Binary file not shown.

0 comments on commit e7db686

Please sign in to comment.