User avatar
clv1 has moved @clv1@mastodon.social
5mo
We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
5mo
@clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg AI coding is nowhere near advanced enough for this.
2
0
0
0
User avatar
clv1 has moved @clv1@mastodon.social
5mo
Thank you who answered. Lets hope a solution is thought and found soon enough. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg
1
0
0
0
User avatar
Tamas G @Tamasg@mindly.social
5mo
@clv1 so it's a lot.
1) Signal generation (the “voice box”)
This is the DSP engine: glottal source → filter(s) → radiation → output audio.
2) Control model (turning phonemes into trajectories)
You need to decide how parameters move over time:

How /a/ differs from /i/ in F1/F2

How consonants inject noise and shape transitions

Coarticulation: the “smearing” of neighboring sounds into each other

Rules for duration and transitions (and exceptions)
This is where “it works” becomes “it sounds like a person instead of a kazoo.”
AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.).
3) Text to phonemes (G2P)
For English you can ship a dictionary + rules.

normalization (numbers, dates, abbreviations)

tokenization

stress rules

phoneme mapping5) Voice design + tuning
Even with a perfect engine, it’s easy to end up with “robotic but intelligible” rather than “pleasant.”
This is typically the biggest time because it’s:

parameter tables

hundreds of little exceptions

endless listening tests

DSP engine: days to a couple weeks

G2P + normalization: weeks

coarticulation + durations: weeks to months

prosody: weeks to months

tuning to ‘nice’: open-ended
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate
1
1
0
0
User avatar
clv1 has moved @clv1@mastodon.social
5mo
@Tamasg @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate thanks for this overview. Indeed, we would need skilled developers, engineers and maybe linguists working full time on a project like this for a few months at least.
2
0
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
5mo
@clv1 @cachondo @Tamasg @jscholes @FreakyFwoof @pixelate @ZBennoui @amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.
0
0
1
0