Cynthia is the name of the speech synthesizer that I am writing.
The picture at the top of this page is of a waveform of
Cynthia saying the words speech technology. I demonstrated
Cynthia at the
Athens Linux Fest this year.
There are two major parts to a text-to-speech project. One is the sound production. The other is the phonology part. I am currently focusing on the phonological part, but I will first say a little about what I am doing for sound production.
I am using a database of prerecorded diphone waveforms; it is available through the MBROLA project. This database does not support allophones of phonemes. Instead it has every possible combination of two consecutive phonemes that can occur in English. A diphone is the concatenation of the second half of the first phoneme and the first half of the second phoneme.
Since the database does not contain allophones, it is missing the tap (flap) which is ian allophone of /t/. It also does not have a glottal stop.
The MBROLA program reads a file that is generated by Cynthia; this tells MBROLA how to get the diphones from the database. The entries in the database are just waveforms. They have no length or pitch information. Each phoneme in the phonetic transcription of a text is written to the input file along with a string of numbers for each phoneme. The MBROLA programs decides which diphone should be retrieved and applies the length and pitch information.
I am getting the raw phonetic transcriptions from the CMU pronouncing dictionary. It mostly uses a subset of ARPAbet as its phonetic alphabet, except that it uses [hh] instead of [h]. It also does not contain allophones.
When Cynthia gets a word from the input stream, she looks it up in the dictionary to get the raw transciption. The CMU dictionary does not divide the words into syllables, but it does have primary and secondary stress marks. Cynthia takes the transcription for a given word and divides it into syllables. Then the phonological rules are applied resulting in a better transcription.