Speech generation in D2S

D2S currently has two different output modes available in the Speech Generation Module (SGM). One is phonetics-to-speech synthesis and the other is phrase concatenation. These modes are discussed in more detail below.

Phonetics-to-speech

Phonetics-to-speech generates speech not from unrestricted text as in text-to-speech but from a phonetic transcription with prosodic annotations. This means that the linguistic analysis stage in text-to-speech is skipped. Because the LGM generates an orthographic representation with a unique phonetic representation, it is possible to do errorless grapheme-to-phoneme conversion by lexical lookup instead of rules. The speech output is generated by concatenating diphones, small speech segments consisting of the transition between two adjacent phonemes. A complete diphone inventory for a language covers all possible transitions between any two sounds of that language. The phonetics-to-speech system Calipso, developed at IPO, provides GoalGetter with PSOLA-based diphones.

Phrase concatenation using prosodic variants

Phrase concatenation makes use of prerecorded phrases: entire words and phrases are prerecorded, and these are played back in different orders to form complete utterances. This method is particularly well suited to be used in a carrier-and-slot situation, i.e., when there are a limited number of types of utterances to be pronounced, with variable information to be inserted in fixed positions in those utterances. In D2S, the carriers are the syntactic templates, and these have slots for variable information.

In the standard approach to phrase concatenation, the words and phrases to be concatenated are recorded in one prosodically neutral version only. This way, prosodic variation is not accounted for, resulting in a suboptimal quality of the speech output. In our approach, however, several prosodic variants of otherwise identical words and phrases are used. Stylizations of these prosodic variants are depicted below. Which variant is chosen when a generated text is made audible, depends on the prosodic markers that have been assigned by the Prosody component of the LGM.

Prosodic variants

The phrase concatenation which you can hear in GoalGetter is the result of our first experiment with this technique, using the speech of a colleague (not a professional speaker). Therefore you may hear some flaws in the speech output. These flaws have largely been overcome in the OVIS system.


Back to GoalGetter homepage