User avatar
clv1 has moved @clv1@mastodon.social
5mo
We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg
1
0
0
0
5mo
@clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg AI coding is nowhere near advanced enough for this.
2
0
0
0

User avatar
clv1 has moved @clv1@mastodon.social
5mo
Thank you who answered. Lets hope a solution is thought and found soon enough. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg
1
0
0
0
User avatar
Tamas G @Tamasg@mindly.social
5mo
@clv1 so it's a lot.
1) Signal generation (the “voice box”)
This is the DSP engine: glottal source → filter(s) → radiation → output audio.
2) Control model (turning phonemes into trajectories)
You need to decide how parameters move over time:
•
How /a/ differs from /i/ in F1/F2
•
How consonants inject noise and shape transitions
•
Coarticulation: the “smearing” of neighboring sounds into each other
•
Rules for duration and transitions (and exceptions)
This is where “it works” becomes “it sounds like a person instead of a kazoo.”
AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.).
3) Text to phonemes (G2P)
For English you can ship a dictionary + rules.
•
normalization (numbers, dates, abbreviations)
•
tokenization
•
stress rules
•
phoneme mapping5) Voice design + tuning
Even with a perfect engine, it’s easy to end up with “robotic but intelligible” rather than “pleasant.”
This is typically the biggest time because it’s:
•
parameter tables
•
hundreds of little exceptions
•
endless listening tests
•
DSP engine: days to a couple weeks
•
G2P + normalization: weeks
•
coarticulation + durations: weeks to months
•
prosody: weeks to months
•
tuning to ‘nice’: open-ended
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate
1
1
0
0
User avatar
clv1 has moved @clv1@mastodon.social
5mo
@Tamasg @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate thanks for this overview. Indeed, we would need skilled developers, engineers and maybe linguists working full time on a project like this for a few months at least.
2
0
1
0
User avatar
Tamas G @Tamasg@mindly.social
5mo
@clv1 yeah, I think if a team came together for it, splitting that work perhaps by person or 1 to 2 people per section, could really work. I know I could be useful here at the later shaping stages, so do count me in, it's that architecture creation and initial rules I'm a bit out on. But yeah, not against on being included.
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate
1
0
1
0
User avatar
Andre Louis @FreakyFwoof@universeodon.com
5mo
@Tamasg @clv1 @fastfinge @jscholes @cachondo @amir @ZBennoui @pixelate I can do nothing here other than testing, so feel free to bring me back in at the latter stages. Until then, have 0 things to contribute. Very much outside of my understanding.
1
0
1
0
User avatar
Luis Carlos @luiscarlosgonzalez@mastodon.social
5mo
0
0
0
0
@clv1 @cachondo @Tamasg @jscholes @FreakyFwoof @pixelate @ZBennoui @amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.
0
0
1
0
User avatar
Tamas G @Tamasg@mindly.social
5mo
@fastfinge Sort of my thought sadly. It's gotten better, no doubt, you can now get AI to spit out 60 KB of slop in one go, wow progress. xD So context improved, maybe a slightly better skillset, but the amount of time you'd spend debugging and seeing which step it went wrong on, especially for all the low-level plumbing an engine needs is brutal. @clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate
1
0
0
0
@Tamasg @clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate If you knew the technical requirements well enough to do it yourself, AI could do it for you slightly faster. But if you couldn’t have done it on your own, AI won’t help.
0
0
0
0