In this thesis we examine the autonomous oscillator model for synthesis of speech signals. The contributions comprise an analysis of realizations and training methods for the nonlinear function used in the oscillator model, the combination of the oscillator model with inverse filtering, both significantly increasing the number of `successfully' re-synthesized speech signals, and the introduction of a new technique suitable for the re-generation of the noise-like signal component in speech signals.
Nonlinear function models are compared in a one-dimensional modeling task regarding their presupposition for adequate re-synthesis of speech signals, in particular considering stability. The considerations also comprise the structure of the nonlinear functions, with the aspect of the possible interpolation between models for different speech sounds. Both regarding stability of the oscillator and the premiss of a nonlinear function structure that may be pre-defined, RBF networks are found a preferable choice. In particular in combination with a Bayesian training algorithm, RBF networks with Gaussian basis functions outperform other nonlinear function models concerning the requirements for the application in the oscillator model.
The application of inverse filtering, in particular linear prediction as a model for speech production, in addition to nonlinear oscillator modeling, allows the oscillator to model an estimated speech source signal as evoked by the oscillatory motion of the vocal folds. The combination of linear prediction inverse filtering and the nonlinear oscillator model is shown to provide a significantly higher number of stably re-synthesized vowel signals, and better spectral reconstruction than the oscillator model applied to the full speech signal. However, for wide-band speech signals the reconstruction of the high-frequency band is still unsatisfactory. With a closer analysis it becomes clear that -- while the oscillatory component can now be reproduced satisfactorily -- a model for the noise-like component of speech signals is still missing.
Our remedy is to extend the oscillator model by a nonlinear predictor used to re-generate the amplitude modulated noise-like signal component of stationary mixed excitation speech signals (including vowels and voiced fricatives). The resulting `oscillator-plus-noise' model is able to re-generate vowel signals, as well as voiced fricatives signals with high fidelity in terms of time-domain waveform, signal trajectory in phase space, and spectral characteristics. Moreover, due to the automatic determination of a zero oscillatory component, also unvoiced fricatives are reproduced adequately as the noise-like component only. With one instance of the proposed model all kinds of stationary speech sounds can be re-synthesized, by applying model parameters -- i.\,e., the RBF network weights and linear prediction filter coefficients -- learned from a natural speech signal for each sound.
In a first objective analysis of naturalness of the oscillator-plus-noise model generated signals measures for short-term variations in fundamental frequency and amplitude are found to better resemble the measures of the original signal than for the oscillator model only, suggesting an improvement in naturalness.