DEPARTMENT OF LINGUISTICS
SIDGWICK AVENUE
CAMBRIDGE CB3 9DA
UNITED KINGDOM
TEL: +44 (0)1223 335010
FAX: +44 (0)1223 335053
An Integrated Prosodic Approach to Device-Independent,
Natural-Sounding Speech Synthesis
This grant runs from October 1997 to May 2000. The award
holders are Sarah Hawkins
at Cambridge, in collaboration with John
Local and Richard Ogden at the University of York, and Jill
House and Mark Huckvale at University College London. The £268,000
project is funded by EPSRC grants
GR/L53069 (Cambridge), GR/L51829 (York) and GR/L52109 (UCL)
Objectives
This project explores the viability of a phonological model that rectifies
some of the phonetic weaknesses of current concatenative and formant-based
text-to-speech systems. The new model integrates timing, intonation and
systematic segmental variation. For the selected linguistic structures
modelled, the result should be high-quality, natural-sounding synthetic
speech that is robust in noise. Our objectives are:
Summary
Current text-to-speech systems, both concatenative and formant-based,
have some common shortcomings: the speech often sounds unnatural because
the rhythm, intonation and fine phonetic detail reflecting coarticulatory
patterns are poor, so although intelligibility rates may be good, listeners
experience increased cognitive load and poorer perception in noise. These
shortcomings restrict the applications for which synthetic speech is useful.
This collaborative project aims to integrate and extend existing knowledge
to produce the core of a new model of computational phonology and phonetic
interpretation which will deliver high-quality speech synthesis. The complete
model will comprise a unified, language- and accent-independent linguistic
representation. The current project is developing a partial model, using
representative linguistic structures which test the viability of our approach,
applied initially to Southern British English. The three focal areas of
research are intonation, morphological structure, and systematic segmental
variation. The common factor is a temporal model that systematically structures
information from all three areas and governs the output of synthesizer
parameters. The signal generation component is based on time-domain modification
of natural speech signals, supplemented by formant-based synthesis and
is adaptable to concatenative and formant-based methods. Evaluation includes
perceptual tests for naturalness, intelligibility and communicative success
under conditions of high cognitive load.
Progress
General information on the status of ProSynth can be obtained from the
ProSynth page and the ProSynth
newsletter.
Cambridge's contribution to ProSynth is to model acoustic-phonetic fine
detail and its control in the overall structure of the synthesizer, and
to assess the intelligibility and naturalness of the synthesis. The following
describes our progress.
Related research issues
There is scope for research on grammatical and phonological determinants
of perceptually-relevant allophonic variation, including the perceptual
role of systematic long-domain cues to phonemic identity, and on developing
software for conducting intelligibility and naturalness tests. These tests
assess the perceptual salience of particular acoustic properties in various
types of adverse listening conditions, including noise, and high cognitive
load due to carrying out simultaneous linguistic and non-linguistic tasks.
last updated: 29 March 2000