layout
default

Abstract

Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embedding at one single scale from the information within the current sentence. The context information in neighboring utterances and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict style at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationship in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Evaluations on both in-domain and out-of-domain audiobook datasets demonstrate that the proposed method is significantly outperformed the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representation that have never been discussed before.

Fig.1: The architecture of our proposed model.

Subjective Evaluation

To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. WSV* means word-level style variations (WSV) model with several changes which are described in detail in the paper. And HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.

S-MOS (in-domain)

Target Chinese Text	GT	FastSpeech 2	WSV*	HCE	MSStyleTTS
小公母儿俩一进屋儿，屋儿里又多了两个人。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
晚上，赏三两小醑酒,又把客人吃剩的汤菜做成杂烩，送到砦四海的窝棚。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
小兔崽子！抓帽胡同儿的这几个哈哈珠砸又团聚啦。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
寥花儿打累了，也把胆怯的心给打没了，拥被坐着，喘着粗气。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
马嵬坡下草青青，今日犹存妃子陵。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
郭二坏一眼瞥见余为农，行色匆匆地顺着二道街往前奔。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

S-MOS (out-of-domain)

Target Chinese Text	FastSpeech 2	WSV*	HCE	MSStyleTTS
这把火终于烧起来了,而且是燎原之势。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
作为一个正常人,在做出一个可能会掉脑袋的决定的选择上,是绝对不会如此轻率的。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
在讨饭的时候,他仔细研究了淮西的地理、山脉、风土人情,他开阔了视野,丰富了见识,认识了很多豪杰。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
甚至很多同学动不动,啊还会讨论各种学者的观点,什么张说李说陈说周说王说等等等等等等。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
所以这就决定了,我们的复习方向并不需要面面俱到。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
那学着学着学着,是不是有一种望洋兴叹的感觉?	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

M-MOS

Target Chinese Text	GT	FastSpeech 2	WSV*	HCE	MSStyleTTS	MSStyleTTS（AR）
窜轰子,是黑话就是烧死。六格被绑在了通天神树的半截腰儿上。土匪们抱来了成捆的羊草,堆在了他脚下。六格豪迈地说:大当家哒,羊草是薰蚊子哒。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
指着锅里的附子片说:这玩意儿自古就被列为‘回阳救逆第一品“。但是,你不炮制它,或者炮制的不得法,它就是断肠草啊。令人不能呼吸,心脏骤停。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
双录笑着征求五爷:五爷,我看先开席吧。诶图协领和载佐领官身由不得自己呀,陪着钦差大老爷四处转悠呐”嗯行。咱们爷儿们儿先喝着,他们俩啥时候儿到啥时候儿补上”。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
灶坑火退了,除了炕头儿再没暖和的地方。寥花儿冻得不行,悄悄儿地脱了棉祅棉裤,往暖呼儿呼儿的被窝儿里钻。六格一直在装睡“嘿嘿地笑了”一声。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
穆隆阿吃惊地问桃儿:诶呦,啥事儿把你急成这样儿啊”。桃儿拉着穆隆阿进了里屋儿,把刚才的事儿根根梢梢儿地说了一遍,末了儿她问穆隆阿。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
看着前翰林府塌了架,任木匠才回到了刘二华堂大车店。刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。刘二华堂殷勤地端茶递水,任木匠也不避讳刘二华堂,对四梁八柱说。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

Ablation Study

The Effect of Using Knowledge Distillation Strategy to Train the Predictor

In-domain

Target Chinese Text	MSStyleTTS	without residual style embedding
必须用你的目光逼退鹰眼射出的寒光。	Your browser does not support the audio element.	Your browser does not support the audio element.
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。	Your browser does not support the audio element.	Your browser does not support the audio element.
顺便来看看你这个小兔崽子,讨你一口儿江鲜野味儿。	Your browser does not support the audio element.	Your browser does not support the audio element.
她自作主张,选了两个分量足的样式,西施红着脸点了点头。	Your browser does not support the audio element.	Your browser does not support the audio element.
勾秀云嘴上缺个把门儿的,她调笑四海。	Your browser does not support the audio element.	Your browser does not support the audio element.

Out-of-domain

Target Chinese Text	MSStyleTTS	without residual style embedding
这个人叫周德兴,我们后面还要经常提到他。	Your browser does not support the audio element.	Your browser does not support the audio element.
这个人当然就是我们的朱重八。	Your browser does not support the audio element.	Your browser does not support the audio element.
最典型的疏忽大意,就是所谓的忘却法,我忘了干嘛忘了干嘛。	Your browser does not support the audio element.	Your browser does not support the audio element.
方向盘也断了,喇叭也坏了,玻璃也摇不下来了,我嗓子也哑了。	Your browser does not support the audio element.	Your browser does not support the audio element.
我以为我踩错了,又把刹车踩到底,啪两个人被撞死了。	Your browser does not support the audio element.	Your browser does not support the audio element.

The effect of using residuals to represent style variations

Target Chinese Text	MSStyleTTS	without residual style embedding	GT
您老放心，漫说开荒累不死人，就是赴汤蹈火，您侄子第一个跳进去。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
勾秀云嘴上缺个把门儿的，她调笑四海。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
小兔崽子！抓帽胡同儿的这几个哈哈珠砸又团聚啦。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
瓜尔佳氏哼了一声，呵斥道。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
打开食盒，里面儿是血肠儿白肉、大馅儿包子，还有一葫芦酒。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

Comparisons of utilizing different ranges of context information in predictor

Target Chinese Text	L=0	L=1	L=2	L=3	L=4
每年腊月门子忙活一阵,賺到的银两都在正月里的赌场上还了人家。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
本以为六格会搂席,未承想却斯文起来,端端正正儿地坐在那儿,莞尔一笑,想了半天他说.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
贴上余为农,既养了家也解了自己的饥渴。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
瓜尔佳氏哼了一声,呵斥道。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
再说了,汪半城也脚着,就算这西施有些说道儿。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

The effect of multi-scale style predictor

Target Chinese Text	MSStyleTTS	-residual connections
无论啥人想给猩猩怪翻案,都不是那么好相与的。	Your browser does not support the audio element.	Your browser does not support the audio element.
要想去掉链子,再花三十吊.	Your browser does not support the audio element.	Your browser does not support the audio element.
刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。	Your browser does not support the audio element.	Your browser does not support the audio element.
他连忙儿打开了盒子,假地契原封不动儿地还躺在里面儿。余为商抹了把汗,胆儿突突地问.	Your browser does not support the audio element.	Your browser does not support the audio element.
怀瑾听了若有所悟,双手合十唱了一声佛号,躬身退了出去。	Your browser does not support the audio element.	Your browser does not support the audio element.

Comparisons between global-level, sentence-level and subword-level style representation

Investigation on global-level style

Target Chinese Text	Proposed	without global-level style	GT
您老放心,漫说开荒累不死人，就是赴汤蹈火，您侄子第一个跳进去。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
小公母儿俩一进屋儿，屋儿里又多了两个人。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
西施留在了汪家，桃儿才体会到了什么叫汪大奶奶。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
小施主，关老爷一生最重一个义字。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
乌雅氏和勾秀云早已经捷足先登了。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

Investigation on global-level and sentence-level style

Target Chinese Text	Proposed	without global-level and sentence-level style	GT
终于有人跳下了炕，明保脑瓜皮酥了一下。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
他使劲儿拍了拍穆隆阿，又使劲儿拍了拍六格，骂了一句。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
小兔崽子！抓帽胡同儿的这几个哈哈珠砸又团聚啦。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
瓜尔佳氏哼了一声，呵斥道。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
必须用你的目光逼退鹰眼射出的寒光。	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

Case Study

To further explore the impact of the multi-scale style modeling framework on the expressiveness and prosody of synthesized speech, two case studies are conducted to compare our MSStyleTTS with two mono-scale baselines, respectively. The ground truth speeches are also provided as references.

Test case 1

Model	Target Chinese Text	Audio
HCE	明保听了大为高兴啊。	Your browser does not support the audio element.
GT	明保听了大为高兴啊。	Your browser does not support the audio element.
Proposed	明保听了大为高兴啊。	Your browser does not support the audio element.

Test case 2

Model	Target Chinese Text	Audio
WSV	裕瑚鲁氏摇了摇头。	Your browser does not support the audio element.
GT	裕瑚鲁氏摇了摇头。	Your browser does not support the audio element.
Proposed	裕瑚鲁氏摇了摇头。	Your browser does not support the audio element.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Abstract

Subjective Evaluation

S-MOS (in-domain)

S-MOS (out-of-domain)

M-MOS

Ablation Study

The Effect of Using Knowledge Distillation Strategy to Train the Predictor

In-domain

Out-of-domain

The effect of using residuals to represent style variations

Comparisons of utilizing different ranges of context information in predictor

The effect of multi-scale style predictor

Comparisons between global-level, sentence-level and subword-level style representation

Investigation on global-level style

Investigation on global-level and sentence-level style

Case Study

Test case 1

Test case 2

Files

index.md

Latest commit

History

index.md

File metadata and controls

Abstract

Subjective Evaluation

S-MOS (in-domain)

S-MOS (out-of-domain)

M-MOS

Ablation Study

The Effect of Using Knowledge Distillation Strategy to Train the Predictor

In-domain

Out-of-domain

The effect of using residuals to represent style variations

Comparisons of utilizing different ranges of context information in predictor

The effect of multi-scale style predictor

Comparisons between global-level, sentence-level and subword-level style representation

Investigation on global-level style

Investigation on global-level and sentence-level style

Case Study

Test case 1

Test case 2