Transformation and interpolation of language varieties for speech synthesis / by Dipl.-Ing. Markus Toman, BSc
Akustische Modellierung, Transformation und Interpolation von Sprachvarietäten für Sprachsynthese
VerfasserToman, Markus
Begutachter / BegutachterinRauber, Andreas
ErschienenWien 11.1.2016
Umfangx, 124 Seiten : Illustrationen, Diagramme
HochschulschriftTechnische Universität Wien, Univ., Dissertation, 2016
Schlagwörter (EN)Speech Processing / Speech Synthesis / Hidden Markov Model / Language Varieties / Dialects / Voice Conversion
URNurn:nbn:at:at-ubtuw:1-1503 Persistent Identifier (URN)
This thesis aims to advance the field of speech synthesis by investigating and developing new concepts for acoustic modeling, transformation and interpolation of language varieties (i.e. dialects, sociolects, foreign accents). The goal is to enable systems with speech output to adapt to individual needs and preferences of their users. Transformation of language varieties aims to convert a voice model from one variety to a model in another variety while retaining the voice characteristics. Between multiple voice models of different varieties, interpolation allows to generate intermediate varieties. Both approaches are used to widen the range of speaking styles available to speech output systems. Further, two specific applications are investigated in this thesis: foreign accent reduction and the generation of intelligible fast speech for visually impaired users. All presented methods are evaluated through listening tests and objective measures where appropriate. To conduct these experiments, phone sets and recording scripts for three Austrian German dialects have been created and speech corpora from selected native dialect speakers have been recorded in studio quality. We present a method for unsupervised dialect interpolation and show that listeners are able to correctly perceive the changes in degree of dialect for different settings of the interpolation parameter. We show that transformation of dialects while retaining the original speaker characteristics is possible with the methods presented here. We also compare different approaches for generation of fast synthetic speech. Our experiments show that linearly compressed, natural speech signals are more intelligible than naturally produced fast speech produced by our professional speakers. Overall, this thesis shows how adaptive modeling can be applied to control and modify the language variety of a speech synthesis system.