@fastfinge Really interesting article. I'm particularly passionate about this subject, I've been fascinated with TTS for a number of years. I've trained many voices, both for Piper and some of the newer LLM based systems, and while I can't speak to the speed issue, training data is extremely important.
What you feed into these models has a big impact on the voice's performance overall. If you give it stuff scrape from the web, random audiobooks that weren't optimized for TTS, things like that, you're not going to get good results for the type of work screen reader users do every day. This applies to all of these systems, not even just neural networks. The latency / responsiveness issue is something we'll have to solve at some point, because I don't think using TTS systems last updated in 2003 is going to work out in the longterm, as much as I love Eloquence.
In my ideal world, we would have either a machine learning based or formant system that is easy to train / maintain. Big companies have lost interest in on device TTS, not even just for screen reader users. Many of the solutions being put out now are cloud based, and while developers are still creating on device models, as said in the article, they're not optimized for our needs and may never be. I think we have to take matters into our own hands and figure this out, but I believe with enough people we can make it happen.
@ZBennoui We need a good formant system. Machine learning is useful for setting the model parameters. But I think the word to phoneme rules can’t be a neural network, because they have to be reproducible and modifiable. Even here though, machine learning could help though. I’d love a system where a user could submit a recording of a word, and the system could create the phonetic representation.
@fastfinge Yeah I completely agree, I happen to know Philip and have been talking with him extensively about his experiments with TTS. I can't go into a ton of detail, but I'll say what he said publicly. The system he's using is a hybrid approach of neural networks and formant synthesis, where he trains a model to output formant frequencies based on the audio data he feeds into it. I won't pretend to understand all the details, this is way above my pay grade, but as far as I understand this has never been done before by another developer.
@ZBennoui Yup. I just wish he wasn’t also trying to train his own phonemizer, because I really believe that has to be reproducible and modifiable for users. I’ve swapped multiple emails with him about an NVDA addon. But he’s pretty set on sapi for now until things stabilize both on the NVDA side and on his side.