Note by @fastfinge

clv1 has moved @clv1@mastodon.social

5mo

We could for example attach an Eloquence audio sample, then ask for a synth that sounds similar. In case the AI couldn't make it from scratch, we could ask whether another synth could be the basis, for example ESpeak's klatt variants. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg

1

0

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@clv1 @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg AI coding is nowhere near advanced enough for this.

2

0

clv1 has moved @clv1@mastodon.social

5mo

Thank you who answered. Lets hope a solution is thought and found soon enough. @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate @Tamasg

1

0

Tamas G @Tamasg@mindly.social

5mo

@clv1 so it's a lot.
1) Signal generation (the “voice box”)
This is the DSP engine: glottal source → filter(s) → radiation → output audio.
2) Control model (turning phonemes into trajectories)
You need to decide how parameters move over time:
•
How /a/ differs from /i/ in F1/F2
•
How consonants inject noise and shape transitions
•
Coarticulation: the “smearing” of neighboring sounds into each other
•
Rules for duration and transitions (and exceptions)
This is where “it works” becomes “it sounds like a person instead of a kazoo.”
AI helps, but you still need a design. AI can implement whichever model you pick (Klatt-style rules, gestural targets, diphones-with-formants, etc.).
3) Text to phonemes (G2P)
For English you can ship a dictionary + rules.
•
normalization (numbers, dates, abbreviations)
•
tokenization
•
stress rules
•
phoneme mapping5) Voice design + tuning
Even with a perfect engine, it’s easy to end up with “robotic but intelligible” rather than “pleasant.”
This is typically the biggest time because it’s:
•
parameter tables
•
hundreds of little exceptions
•
endless listening tests
•
DSP engine: days to a couple weeks
•
G2P + normalization: weeks
•
coarticulation + durations: weeks to months
•
prosody: weeks to months
•
tuning to ‘nice’: open-ended
@fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate

1

0

clv1 has moved @clv1@mastodon.social

5mo

@Tamasg @fastfinge @jscholes @cachondo @FreakyFwoof @amir @ZBennoui @pixelate thanks for this overview. Indeed, we would need skilled developers, engineers and maybe linguists working full time on a project like this for a few months at least.

2

0

1

0

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@clv1 @cachondo @Tamasg @jscholes @FreakyFwoof @pixelate @ZBennoui @amir And also UX researchers, probably. I can't articulate why eloquence is better than dectalk, for me. Neither, I bet, could Andre articulate what makes Orpheus better than Eloquence, for him. So to get something that makes the largest number of people as happy as possible is a classic UX research problem, probably involving massive surveys, rating and ranking of samples, and so on. I work with the kind of people qualified to do this, and it's a unique skill-set in and of itself.

0

1

0