Cynthia is the name of the text-to-speech system that I am
writing in Prolog. You can read about Cynthia
in the news.
There are two major parts to a text-to-speech project. One is the sound production. The other is the phonological engine. An example of something that the phonological engine does is it changes the pronunciation of certain sounds in different environments. For instance, the initial vowel sound in the word you sounds different in isolation that it does in the phrase did you. I have been focusing on the phonological part of Cynthia, but I will say a little about what I am doing for sound production.
I am using a database of prerecorded diphone waveforms; it is available through the MBROLA project. This database does not support allophones of phonemes. Instead it has every possible combination of two consecutive phonemes that can occur in English. A diphone is the concatenation of the second half of the first phoneme and the first half of the second phoneme.
Since the database does not contain allophones, it is missing the tap (flap) which is an allophone of /t/. It also does not have a glottal stop.
The MBROLA program reads a file that is generated by Cynthia; this tells MBROLA how to get the diphones from the database. The entries in the database are just waveforms. They have no length or pitch information. Each phoneme in the phonetic transcription of a text is written to the input file along with a string of numbers for each phoneme. The MBROLA programs decides which diphone should be retrieved and applies the length and pitch information.
I am getting the raw phonetic transcriptions from the CMU pronouncing dictionary. If a word is not contained in the dictionary, Cynthia sounds it out using pronunciation rules much like a child uses when learning to read. It mos When Cynthia gets a word from the input stream, she looks it up in the dictionary to get the raw transciption. The CMU dictionary does not divide the words into syllables, but it does have primary and secondary stress marks. Cynthia takes the transcription for a given word and divides it into syllables. Then the phonological rules are applied resulting in a more natural transcription.