Upload Very Attentive Tacotron video and add it to demo page

google · Nov 1, 2024 · e7db686 · e7db686
1 parent 7d41ebf
commit e7db686
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 6 deletions.
diff --git a/publications/very_attentive_tacotron/index.html b/publications/very_attentive_tacotron/index.html
@@ -81,6 +81,11 @@
         <h1>Audio samples from "Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech"</h1>
       </header>
     </article>
+    <div align="center">
+      <video width="540" controls>
+        <source src="https://github.com/google/tacotron/raw/refs/heads/master/publications/very_attentive_tacotron/video/vat_demos.mp4">
+      </video>
+    </div>
     <div><p><b>Paper:</b> <a href="https://arxiv.org/abs/2410.22179">arXiv</a></p></div>
     <div><p><b>Authors:</b> Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao</p></div>
     <div><p><b>Abstract:</b> Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self and cross-attention operations. A system incorporating these improvements, which we call <i>Very Attentive Tacotron</i>, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.</p></div>
@@ -118,12 +123,12 @@ <h1>Audio samples from "Very Attentive Tacotron: Robust and Unbounded Length Gen
     <p class="toc_title">Contents</p>
     <div id="toc_container">
     <ul>
-      <li><a href="#test-set-samples:-lessac-voice"> 1. Test Set Samples: Lessac Voice
-<li><a href="#test-set-samples:-libritts"> 2. Test Set Samples: LibriTTS
-<li><a href="#generalization-to-long-utterances:-lessac-voice"> 3. Generalization to Long Utterances: Lessac Voice
-<li><a href="#generalization-to-long-utterances:-libritts"> 4. Generalization to Long Utterances: LibriTTS
-<li><a href="#repeated-words"> 5. Repeated Words
-<li><a href="#additional-speakers:-internal-multi-speaker-dataset"> 6. Additional Speakers: Internal Multi-speaker Dataset
+      <li><a href="#test-set-samples:-lessac-voice"> 1. Test Set Samples: Lessac Voice</a></li>
+<li><a href="#test-set-samples:-libritts"> 2. Test Set Samples: LibriTTS</a></li>
+<li><a href="#generalization-to-long-utterances:-lessac-voice"> 3. Generalization to Long Utterances: Lessac Voice</a></li>
+<li><a href="#generalization-to-long-utterances:-libritts"> 4. Generalization to Long Utterances: LibriTTS</a></li>
+<li><a href="#repeated-words"> 5. Repeated Words</a></li>
+<li><a href="#additional-speakers:-internal-multi-speaker-dataset"> 6. Additional Speakers: Internal Multi-speaker Dataset</a></li>
     </ul>
     </div>
   <a name="test-set-samples:-lessac-voice"><h2>1. Test Set Samples: Lessac Voice</h2></a>

diff --git a/publications/very_attentive_tacotron/video/vat_demos.mp4 b/publications/very_attentive_tacotron/video/vat_demos.mp4