|
Why High Quality Speech Synthesis is still a great Challenge Whether industrial applications of the new spoken language processing technologies will become a successful component in man-machine-interfaces or not, depends strongly on the degree of acceptability of the quality of the text-to-speech synthesis part in these systems. Therefore, to achieve so-called high definition spoken language quality in the text-to-speech components of any automatic dialog system is still a great and widely unsolved challenge. There are different aspects of acceptability that must be taken into account. On the pragmatic level, the first question is how to determine the proper speaking style in a given dialog situation. So, for instance, when the driver of a car in heavy traffic is listening to the voice of the automatic speaking system it has to sound either as giving a warning, or at least as attracting attention, or also only just as giving some information about the available choices in a given driving situation. The pragmatically adequate speaking style could even have to change within a single utterance. Still on the pragmatic level and depending on the nature of an application even the type of a speaker itself may play a crucial role. One example is that the characteristic voice properties of a synthetic speaker may even be chosen to define the so-called speech logo for corporate identities. On the other hand, the choice of the individual voice quality of a synthetic speaker will mainly have to depend on the nature of the given application. Even the gender and the age of the speaker may have to be taken into account. And it is absolute true that one cannot use always one and the same speaker for the output of all different man-machine systems. A technical information system such as a telephone directory requires another speaker type than a toy for children or the agent in a system for adults playing games, or a teaching system for the acquisition of a new second language. If more than one speaker is needed in an automatic dialog system the distinguishability of their voices and their bindings to certain functionalities becomes also a relevant question. Speech synthesis will play important role in future speech to speech translation systems. A favourable feature of such systems would be that the translated synthesized speech has the same speaker and emotional characteristic as the voice of the speaker to be translated. On the purely technical level, the first question is how the phonetic output of any of those pragmatically determined speakers (and speaking styles) can be controlled parametrically to achieve the desired results. It seems to be clear that today's state of the art concatenative text-to-speech systems will not be sufficient for solving this task. In future spoken language dialog systems concatenative speech synthesis may still be useful for the production of so-called citation forms of single words, uttered in isolation by certain individual speakers. These canonical pronunciations of lexical items in a neutral speaking style provide the phonetic forms of words which then can be parametrically modified in order to change them into the prosodic form needed for fluently connected speech in a certain speaking style. So the future research in the field of producing high quality speech synthesis will have to find out by means of which parameters different speaking styles and voice qualities can be achieved in the voice of any given individual speaker (synthesis-by-analysis vs. analysis-by-synthesis as research paradigm). An extension of this research is the development of cross language conversion techniques, where a synthetic voice is adapted across languages. Another central research question in this field involves what different kinds of emotions can be expressed by systematically controlling the prosodically determined modifications of words in fluent speech. A major topic in this new research field will be to investigate the prosodic variation of local speech tempo in connection with local voice quality, local pitch and local excitation energy. The restricted and strongly correlated combination of these locally varying parameters determine the individual voice characteristic of any naturally given or artificially defined speaker. A last but also very important question of future speech research is how speech production can be prosodically integrated into the broader multimedially given context of gesturing and of facial expression, i.e. the ensemble of movements of hands, arms, head, lips, chin, eyes and eyebrow, indeed of the whole body of the speaker. Social and Economic Impact Systems enabling and enhancing communication are basic for the evolving information society. At present it is possible to access several services and applications through telephones and other means such as the web. Analysing the different application scenarios we notice that mobile communication in private and commercial environments play a dominant role for interaction between humans and machines. Multi-modal interfaces will be used more and more to interact with those systems. Comparing the different interaction modes - haptic, visual and vocal - vocal interaction is essential for mobile communication. As a consequence, speech driven interfaces embedded in mobile applications and network based servers will gain further importance. An important role plays the language used to access the services. Although English is more or less accepted as a language for international communication in commerce, R&D, education and politics, English speaking and understanding is restricted due to cultural and political heritage. Furthermore, there are large economic regions such as China, South America and the Arab world where communication in English is currently difficult. Speaking in one's mother tongue is the most natural and intuitive way of communicating and interacting with both humans and machines. Consequently speech driven services should interact with the user in it's mother tongue. To build speech driven interfaces currently the main technologies are speech recognition (ASR), speech synthesis (TTS), dialog handling and in future speech to speech translation. ECESS is concentrating on speech synthesis. Evaluation study on speech driven dialogue systems have shown, that users of such systems are very sensitive with respect to the quality of the speech output overriding to a certain extent the quality of the other system components as ASR. As mentioned in the previous section current speech synthesis systems have not yet reached the quality needed for many potential application areas. The main goal of ECESS is to build such a technology basis for 'application adequate' speech synthesis systems. In this way ECESS helps to build up a user friendly information society.
|