Note by @fastfinge

We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg

@clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg AI coding is nowhere near advanced enough for this.

Thank you who answered. Lets hope a solution is thought and found soon enough. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg

@clv1 so it's a lot.
1) Signal generation (the “voice box”)
This is the DSP engine: glottal source → filter(s) → radiation → output audio.
2) Control model (turning phonemes into trajectories)
You need to decide how parameters move over time:
•
How /a/ differs from /i/ in F1/F2
•
How consonants inject noise and shape transitions
•
Coarticulation: the “smearing” of neighboring sounds into each other
•
Rules for duration and transitions (and exceptions)
This is where “it works” becomes “it sounds like a person instead of a kazoo.”
AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.).
3) Text to phonemes (G2P)
For English you can ship a dictionary + rules.
•
normalization (numbers, dates, abbreviations)
•
tokenization
•
stress rules
•
phoneme mapping5) Voice design + tuning
Even with a perfect engine, it’s easy to end up with “robotic but intelligible” rather than “pleasant.”
This is typically the biggest time because it’s:
•
parameter tables
•
hundreds of little exceptions
•
endless listening tests
•
DSP engine: days to a couple weeks
•
G2P + normalization: weeks
•
coarticulation + durations: weeks to months
•
prosody: weeks to months
•
tuning to ‘nice’: open-ended
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate

@Tamasg @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate thanks for this overview. Indeed, we would need skilled developers, engineers and maybe linguists working full time on a project like this for a few months at least.

@clv1 yeah, I think if a team came together for it, splitting that work perhaps by person or 1 to 2 people per section, could really work. I know I could be useful here at the later shaping stages, so do count me in, it's that architecture creation and initial rules I'm a bit out on. But yeah, not against on being included.
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate

@Tamasg @clv1 @fastfinge @jscholes @cachondo @amir @ZBennoui @pixelate I can do nothing here other than testing, so feel free to bring me back in at the latter stages. Until then, have 0 things to contribute. Very much outside of my understanding.

@clv1 @cachondo @Tamasg @jscholes @FreakyFwoof @pixelate @ZBennoui @amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.

@fastfinge Sort of my thought sadly. It's gotten better, no doubt, you can now get AI to spit out 60 KB of slop in one go, wow progress. xD So context improved, maybe a slightly better skillset, but the amount of time you'd spend debugging and seeing which step it went wrong on, especially for all the low-level plumbing an engine needs is brutal. @clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate

@Tamasg @clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate If you knew the technical requirements well enough to do it yourself, AI could do it for you slightly faster. But if you couldn’t have done it on your own, AI won’t help.