Note by @fastfinge

@hosford42 The reason I say systems-level programming is mostly because for a text to speech system used by a blind power user, you need to keep an eye on performance. If the system crashes and the computer stops talking, the only choice the user has is to hard reset. It would be running and speaking the entire time the computer is in use, so memory leaks and other inefficiencies are going to add up extremely quickly.

From what I can tell, the ideal is some sort of formant-based vocal tract model. Espeak sort of does this, but only for the voiced sounds. Plosives are generated from modeling recorded speech, so sound weird and overly harsh to most users, and I suspect this is where most of the complaints about espeak come from. A neural network or other sort of machine learning model could be useful to discover the best parameters and run the model, but not for generating audio itself, I don't think. This is because most modern LLM-based neural network models can't allow changing of pitch, speed, etc, as all of that comes from the training data.

Secondly, the phonemizer needs to be reproducible. What if, say, it mispronounces "Hermione". With most modern text to speech systems, this is hard to fix; the output is not always the same for any given input. So a correction like "her my oh nee" might work in some circumstances, but not others, because how the model decides to pronounce words and where it puts the emphasis are just a black box. The state of the art, here, remains Eloquence. But it uses no machine learning at all, just hundreds of thousands of hand-coded rules and formants. But, of course, it's closed source (and as far as anyone can tell the source has actually been lost since the early 2000's), so goodness knows what all those rules are.