We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate@Tamasg
@clv1 so it's a lot. 1) Signal generation (the âvoice boxâ) This is the DSP engine: glottal source â filter(s) â radiation â output audio. 2) Control model (turning phonemes into trajectories) You need to decide how parameters move over time: ⢠How /a/ differs from /i/ in F1/F2 ⢠How consonants inject noise and shape transitions ⢠Coarticulation: the âsmearingâ of neighboring sounds into each other ⢠Rules for duration and transitions (and exceptions) This is where âit worksâ becomes âit sounds like a person instead of a kazoo.â AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.). 3) Text to phonemes (G2P) For English you can ship a dictionary + rules. ⢠normalization (numbers, dates, abbreviations) ⢠tokenization ⢠stress rules ⢠phoneme mapping5) Voice design + tuning Even with a perfect engine, itâs easy to end up with ârobotic but intelligibleâ rather than âpleasant.â This is typically the biggest time because itâs: ⢠parameter tables ⢠hundreds of little exceptions ⢠endless listening tests ⢠DSP engine: days to a couple weeks ⢠G2P + normalization: weeks ⢠coarticulation + durations: weeks to months ⢠prosody: weeks to months ⢠tuning to âniceâ: open-ended @fastfinge@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate
@clv1 yeah, I think if a team came together for it, splitting that work perhaps by person or 1 to 2 people per section, could really work. I know I could be useful here at the later shaping stages, so do count me in, it's that architecture creation and initial rules I'm a bit out on. But yeah, not against on being included. @fastfinge@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate
@clv1@cachondo@Tamasg@jscholes@FreakyFwoof@pixelate@ZBennoui@amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.
@fastfinge Sort of my thought sadly. It's gotten better, no doubt, you can now get AI to spit out 60 KB of slop in one go, wow progress. xD So context improved, maybe a slightly better skillset, but the amount of time you'd spend debugging and seeing which step it went wrong on, especially for all the low-level plumbing an engine needs is brutal. @clv1@jscholes@cachondo@FreakyFwoof@amir@ZBennoui@pixelate