new, experimental build of SpeechPlayer. I would like those who are brave to test, for a new, more Eloquence-like sound. Shoutout does go out to @fastfinge for helping to get this idea going and colaborating on the repository. eurpod.com/synths/nvSpeechPlayer-experimental.nvda-addon The big switch is that it is no longer a sawtooth wave. Instead, it now uses asymmetric cosine glottal-flow pulse (a pitch-synchronous "glottal pulse train"). So, glottal flow pulses, not continuous oscillator shapes like triangle/saw/square. This has allowed us to achieve a much smoother voice, with clearer consonants but the familiarity of the voice people know.
@ZBennoui@Tamasg Can you articulate what you dislike about it? I'm still not a fan the way I am of eloquence, but it's getting harder and harder for me to define why, so we can actually fix it.
@fastfinge@ZBennoui@Tamasg I wish I could say why, too. There's something ... forceful about the way it says everything, a harsher sound than eloquence. But my brain is just not coming up with the right technical terms.
@cachondo@fastfinge@ZBennoui@Tamasg I struggled to understand it. Relatively sibilant, bassy, nasal, and forceful as you said. "Folks" sounded more like "follks", and "here" was cut off.
@jscholes@cachondo@Tamasg@ZBennoui So some of the issues you're identifying have to do with the espeak phonemizer, and the way phonemes are tuned. We have a lot of work to do there, too. But I feel like if we can get the sound of the voice correct, that will be a lot easier. For example, compare speech player and eloquence saying "eeeeeeeeee". Side by side, it's clear something is still not right with speech player. It needs to be...rounder and brighter? And...those are not useful terms because I'm still struggling to define exactly what I mean by them haha.
@fastfinge@jscholes@cachondo@ZBennoui I think phonemizer yes, but also language-specific phonetic rules that I and others tune, like how we got the word "start" to no longer sound disjointed and like "st-ah-rt" as it was in earlier builds. does nobody seriously give me any credit for improvements and people only want to complain on how it's drifting and sounding shittier? Honestly I get more of that feedback and each time it makes me want to just give up on this entire thing fully. If you really hate it, then, fix it yourself, don't put that honus all on me. Going away for the rest of the day. I'm super sad.
@Tamasg@fastfinge@cachondo@ZBennoui The thread specifically asked people to try and articulate how they felt about the voice, so I did. You are publicly burning yourself out on this and other projects in a way that constantly makes me want to tell you to get some sleep and look after yourself, so maybe temporarily stepping away would be a good thing.
@jscholes@cachondo@Tamasg@ZBennoui The other thing that makes this super, super hard is that there are like nine different systems, and all of them need tuning. And it's impossible to ask people who haven't spent pretty much four days straight thinking exclusively about this for feedback on a particular system, because they all work together to make up the voice, and you can't know where any given issue comes from. There's the rules for going from text to IPA phonemes. Then the rules for determining the way IPA phonemes are actually voiced and fit together. And then there's the intonation table. And then there's the two systems that actually make the sound. Right now I'm mostly looking at the system that actually makes the sound, IE when you do "aaaaaaaaaaaaa" or "eeeeeeeeeeeeeeeee", because that's still not right. But because it's an entire voice, it's even hard for me to separate my own perceptions and fix anything.
The important thing to remember is that eloquence began development in 1982, by a team of about a dozen researchers. It wasn't in the state we know it until around 2002. We have existing research to build on, but no funding and fewer people, and no PHD level speech researchers. So actually doing this, even with the help of AI, is a 20-30 year project before we get close to eloquence levels. Because we have something that "works" and improves step by step, it's easy to lose sight of the size of the problem we're taking on, because it feels like we should be able to get there in a month or two. But that's not realistic.
@fastfinge@jscholes@cachondo@ZBennoui I will say this though. If people had a chance to try that huge 700 MB version you built with Gruut as the phonemizer, I think they'd like some of the sounds a bit better. Not perfect by any means, not at all, but like, I can also see that Espeak phonemizer does not do a good job at inserting things like stress marks in the IPA before vowels, causing either disjointed or too rapid speech when moving from vowel-to-vowel transitions. So then we're adding a dozen vowel rules to "fight" ESpeak's phonemizer system. And that's not ideal either, because while we can tweak it, and as much as it's multilingual, it's too not suffisticated and even if my frontend has rules for things like those stress marks or word-joined markers, if the output itself isn't clean and detailed, we lose good continuety between glides, or when moving from vowel-to-consonant transitions a bit. They're so, so critical, and part of my frustration is this fork in the road feeling about all that, and in that sense the phoneme tuning is the easier part in some ways than the linguistics of it all. The blame isn't 100% ESpeak, but it's not 100% phoneme tuning either.
@Tamasg@cachondo@jscholes@ZBennoui Yup. This all comes back to that problem of so many systems all needing tuning. I do think it would really help us to just focus on a single one. IE get this voice sounding correct with pure "aaaaaaaa" and "eeeeee" tones. No words, no pitch or intonation, nothing. Then hook that up to the rest of the systems. Then see where we land and tackle the next thing. Because thinking about the phonemizer and the IPA rules and the intonation system all at once is burning us out and distracting us from finishing the thing we have the most control over: IE the klatt synthesizer and the wave generator. Those are ours, entirely and completely. The other things are not, so those problems are harder.
@fastfinge@Tamasg@cachondo@ZBennoui This is all fair and, to me at least, interesting insight. But part of embarking upon such a huge project is dealing with feedback from people on what is ultimately seen a user-facing, homogenous blob.
Many such people won't understand and, if we're being honest, care about how the sausage is being made. Particularly if the changes between builds are too subtle for most of them to perceive.
Regardless, I'm glad this is being worked on and deeply appreciative of the effort.
@jscholes@cachondo@Tamasg@ZBennoui Yup, agreed. And also, people have different preferences. So we can and do get contradictory feedback. Sometimes even from the same person LOL. On top of it all, most people don't have the vocabulary to talk about this stuff. Heck, I don't even have it; I'm not sure the words exist in English. Brighter? Rounder? What do I mean! Do I mean the same thing that you mean? Impossible to tell.