Text-to-speech for speech-to-speech translation

alt text 

We designed a finite state transducer (FST) based front end system for text-to-speech synthesis which processes input text and passes a feature file to a HMM-based speech synthesizer.

We posed the problem of predicting features (such as phone identity, duration, intonation, etc.) from input text in a machine learning framework as a statistical transformation of the input text sequence into an output feature sequence. We further proposed a novel duration modeling technique that optimally quantizes the duration space into a lower dimensional space. Finally we evaluated the system both objectively and subjectively and find that it competes well with state-of-the-art front end systems.

This implementation is advantageous in that it requires minimal human input during the learning process and can enable rapid front-end creation for different languages using generic resources such as e-books.