Post-doc ("Habilitation") thesis download page - Prof. Dr. Mixdorff

Wählen Sie eine weitere Seite:

Veröffentlichungen (D.Eng. Thesis)
Post-doc Thesisitation
Fujisaki model model parameter extraction environment
List of Publications

From here you can download a pdf version of my post-doc thesis entitled.

"An Integrated Approach to Modeling German Prosody". A paperback version is available as Volume 25 of the series "Studientexte zur Sprachkommunikation" at w.e.b. UniversitäProcessingerlag, Dresden (www.web-univerlag.de).

Abstract (August 2002). This thesis documents the past five years of work in prosody research performed by the author. The thesis starts from the quantitative model of German intonation, which the author developed in his D.Eng. thesis, and which uses the Fujisaki model model for parametrizing F0 contours. The symbolic representation of German intonation adopted is based on the concept of `tone switches' first introduced by Isacenko and further developed into a theory of basic intonational units, called `intonemes' by Stock et al. Intonemes are categorized by the type of tone switch, that is, distinctive transitions in the F0 contours which occur at accented syllables, with which they are associated. When applied to Text-to-Speech synthesis, the quantitative model (henceforth called MFGI, Mixdorff-Fujisaki model model of German Intonation) produces the F0 contour for a given sentence following a two-stage process: (1) Symbolic Prosody Generation, (2) F0 Contour Generation.

[ nach oben ] [ Website-Navigation ]

First, by applying accentuation and phrasing rules originally developed by Stock, but further refined by the author, phrase boundaries and accented syllables are determined. These lead to the appropriate sequence of intonemes underlying the utterance to be generated, as well as the locations and depths of phrase boundaries, i.e. the symbolic prosody. This information is then used in a second step, the F0 Control, again by applying a set of rules, to derive the amplitudes of phrase and accent commands and the fine timing of these commands with respect to the phones in the utterance. The F0 contour proper is then straightforwardly generated by the Fujisaki model model. MFGI was implemented during the course of a two-year DFG research project, integrated into the Dresden TTS system (DRESS) and evaluated in a series of perceptual studies. These studies concerned the comparison of several F0 control modules tested within the DRESS framework, as well as the comparison between DRESS and other TTS systems for German.

[ nach oben ] [ Website-Navigation ]

Experimental results showed that, although the model yielded a higher naturalness compared with other approaches, the imperfections of the TTS system's rule-based duration module impaired the speech quality produced. Furthermore, the obvious deficiencies compared with natural speech were still paramount.

[ nach oben ] [ Website-Navigation ]

Based on these observations, the author decided to investigate whether an integrated approach to modeling prosody producing durations and F0 in parallel would be appropriate to yield prosodically more coherent synthetic speech. The syllable was chosen as the modeling unit for anchoring the prosodic features. In a series of preliminary studies the relationship between the durational and F0 contours was examined. These showed that, inter alia, (1) the timing of intonational events is influenced significantly by the phonetic structure of the accented syllables these events are connected to, (2) the perceived prominence of syllables is closely related to the accent command amplitude Aa and the duration of the respective syllable, (3) the focal structure of an utterance is closely tied to the F0 contour, whereas boundaries are consistently signaled by lengthening of pre-boundary syllables.

[ nach oben ] [ Website-Navigation ]

In contrast to the original rule-based approach, a neural network trained with data from a speech corpus was implemented. Comparing the performance of the neural network with linear regression models as a baseline, however, did not yet show advantages of the joint prediction implemented in the integrated model with respect to all parameters.

[ nach oben ] [ Website-Navigation ]

An extensive evaluation was performed for assessing the perceptual quality of the integrated model. An important paradigm adopted in the evaluation was to employ resynthesized stimuli which were created by prosodic degrading of natural speech. By applying this technique a reference matrix of segmentally high-quality stimuli was yielded which were defined by their distance from the original speech in terms of the de-correlation in the F0 and duration domains. The main outcomes of the perceptual study were as follows: (1) Subjects were far more sensitive to deviations in the duration contours than to deviations in the F0 contours, (2) the integrated model was perceived more natural than degraded stimuli of comparable correlations between observed and predicted parameters, (3) the integrated model performed less acceptably on sentences from outside the corpus from which it was trained, (4) the integrated approach outperformed the original rule-based model mainly in terms of the accuracy of its prediction of duration, not in terms of the quality of the F0 contour. A major conclusion from the latter result is that the input information extracted from text is insufficient for predicting relative constituent prominence, a major correlate of meaning.

Download D.Eng. Thesis (4,5 MB)

[ nach oben ] [ Website-Navigation ]

Seite drucken

Informationen zu dieser Seite:

Letzte Änderung: 24-July-2006
URL dieser Seite: http://public.bht-berlin.de/~hmixdorff/thesis/habil_thesis.html

Erstellt mit:

•Seitenende•