We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate@Tamasg
@clv1 so it's a lot. 1) Signal generation (the “voice box”) This is the DSP engine: glottal source → filter(s) → radiation → output audio. 2) Control model (turning phonemes into trajectories) You need to decide how parameters move over time: • How /a/ differs from /i/ in F1/F2 • How consonants inject noise and shape transitions • Coarticulation: the “smearing” of neighboring sounds into each other • Rules for duration and transitions (and exceptions) This is where “it works” becomes “it sounds like a person instead of a kazoo.” AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.). 3) Text to phonemes (G2P) For English you can ship a dictionary + rules. • normalization (numbers, dates, abbreviations) • tokenization • stress rules • phoneme mapping5) Voice design + tuning Even with a perfect engine, it’s easy to end up with “robotic but intelligible” rather than “pleasant.” This is typically the biggest time because it’s: • parameter tables • hundreds of little exceptions • endless listening tests • DSP engine: days to a couple weeks • G2P + normalization: weeks • coarticulation + durations: weeks to months • prosody: weeks to months • tuning to ‘nice’: open-ended @fastfinge@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate
@clv1@cachondo@Tamasg@jscholes@FreakyFwoof@pixelate@ZBennoui@amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.