You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embedding at one single scale from the information within the current sentence. The context information in neighboring utterances and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict style at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationship in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method is significantly outperformed the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representation that have never been discussed before.
Fig.1: The architecture of our proposed model.
Subjective Evaluation
To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. WSV* means word-level style variations (WSV) model with several changes which are described in detail in the paper. And HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.
The Effect of Using Knowledge Distillation Strategy to Train the Predictor
In-domain
Target Chinese Text
MSStyleTTS
without residual style embedding
必须用你的目光逼退鹰眼射出的寒光。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
顺便来看看你这个小兔崽子,讨你一口儿江鲜野味儿。
她自作主张,选了两个分量足的样式,西施红着脸点了点头。
勾秀云嘴上缺个把门儿的,她调笑四海。
Out-of-domain
Target Chinese Text
MSStyleTTS
without residual style embedding
这个人叫周德兴,我们后面还要经常提到他。
这个人当然就是我们的朱重八。
最典型的疏忽大意,就是所谓的忘却法,我忘了干嘛忘了干嘛。
方向盘也断了,喇叭也坏了,玻璃也摇不下来了,我嗓子也哑了。
我以为我踩错了,又把刹车踩到底,啪两个人被撞死了。
The effect of using residuals to represent style variations
Target Chinese Text
MSStyleTTS
without residual style embedding
GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
勾秀云嘴上缺个把门儿的,她调笑四海。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
瓜尔佳氏哼了一声,呵斥道。
打开食盒,里面儿是血肠儿白肉、大馅儿包子,还有一葫芦酒。
Comparisons of utilizing different ranges of context information in predictor
Target Chinese Text
L=0
L=1
L=2
L=3
L=4
每年腊月门子忙活一阵,賺到的银两都在正月里的赌场上还了人家。
本以为六格会搂席,未承想却斯文起来,端端正正儿地坐在那儿,莞尔一笑,想了半天他说.
贴上余为农,既养了家也解了自己的饥渴。
瓜尔佳氏哼了一声,呵斥道。
再说了,汪半城也脚着,就算这西施有些说道儿。
The effect of multi-scale style predictor
Target Chinese Text
MSStyleTTS
-residual connections
无论啥人想给猩猩怪翻案,都不是那么好相与的。
要想去掉链子,再花三十吊.
刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。
他连忙儿打开了盒子,假地契原封不动儿地还躺在里面儿。余为商抹了把汗,胆儿突突地问.
怀瑾听了若有所悟,双手合十唱了一声佛号,躬身退了出去。
Comparisons between global-level, sentence-level and subword-level style representation
Investigation on global-level style
Target Chinese Text
Proposed
without global-level style
GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
小公母儿俩一进屋儿,屋儿里又多了两个人。
西施留在了汪家,桃儿才体会到了什么叫汪大奶奶。
小施主,关老爷一生最重一个义字。
乌雅氏和勾秀云早已经捷足先登了。
Investigation on global-level and sentence-level style
Target Chinese Text
Proposed
without global-level and sentence-level style
GT
终于有人跳下了炕,明保脑瓜皮酥了一下。
他使劲儿拍了拍穆隆阿,又使劲儿拍了拍六格,骂了一句。
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
瓜尔佳氏哼了一声,呵斥道。
必须用你的目光逼退鹰眼射出的寒光。
Case Study
To further explore the impact of the multi-scale style modeling framework on the expressiveness and prosody of synthesized speech, two case studies are conducted to compare our MSStyleTTS with two mono-scale baselines, respectively. The ground truth speeches are also provided as references.