new, experimental build of SpeechPlayer. I would like those who are brave to test, for a new, more Eloquence-like sound. Shoutout does go out to @fastfinge for helping to get this idea going and colaborating on the repository. eurpod.com/synths/nvSpeechPlayer-experimental.nvda-addon The big switch is that it is no longer a sawtooth wave. Instead, it now uses asymmetric cosine glottal-flow pulse (a pitch-synchronous "glottal pulse train"). So, glottal flow pulses, not continuous oscillator shapes like triangle/saw/square. This has allowed us to achieve a much smoother voice, with clearer consonants but the familiarity of the voice people know.
@ZBennoui@Tamasg Can you articulate what you dislike about it? I'm still not a fan the way I am of eloquence, but it's getting harder and harder for me to define why, so we can actually fix it.
@fastfinge@ZBennoui@Tamasg I wish I could say why, too. There's something ... forceful about the way it says everything, a harsher sound than eloquence. But my brain is just not coming up with the right technical terms.
@cachondo@fastfinge@ZBennoui yup. I think the Sawtooth version was harsher by a lot, but this is more on the actual "voice personality" that it expresses than like, the final shape of the wave. Now it really does sound a bit like Eloquence and DECTalk had a child, ha. But yeah, there's something in the voicing that's missing which could potentially raise the bass floor a bit from being as low, still keep the crisp-ends on consonants, but introduce a bit more smoothness. Funny to me that on the old version, I liked the Adam voice, on the new experimental, Benjamin to me does sound a lot closer to what I'd like it to be.
@cachondo@fastfinge@ZBennoui@Tamasg I struggled to understand it. Relatively sibilant, bassy, nasal, and forceful as you said. "Folks" sounded more like "follks", and "here" was cut off.
@jscholes@cachondo@Tamasg@ZBennoui So some of the issues you're identifying have to do with the espeak phonemizer, and the way phonemes are tuned. We have a lot of work to do there, too. But I feel like if we can get the sound of the voice correct, that will be a lot easier. For example, compare speech player and eloquence saying "eeeeeeeeee". Side by side, it's clear something is still not right with speech player. It needs to be...rounder and brighter? And...those are not useful terms because I'm still struggling to define exactly what I mean by them haha.
@fastfinge@jscholes@cachondo@ZBennoui I think phonemizer yes, but also language-specific phonetic rules that I and others tune, like how we got the word "start" to no longer sound disjointed and like "st-ah-rt" as it was in earlier builds. does nobody seriously give me any credit for improvements and people only want to complain on how it's drifting and sounding shittier? Honestly I get more of that feedback and each time it makes me want to just give up on this entire thing fully. If you really hate it, then, fix it yourself, don't put that honus all on me. Going away for the rest of the day. I'm super sad.
@Tamasg@cachondo@jscholes@ZBennoui Aww. This is a super hard problem. Especially because nobody can articulate what they want, we all just know when it's wrong.
@fastfinge@Tamasg@cachondo@jscholes@ZBennoui Honestly it's hard to understand. It's like someone talking with something between their teeth and in their throat. Both. at once. I don't know how else to explain it, though.
@MariahL@cachondo@Tamasg@jscholes@ZBennoui I hear where you're coming from. It needs to be...rounder or wider or more relaxed or something. But I'm at the point where every change I make causes it to sound worse or introduces strange new issues.
@Tamasg@fastfinge@cachondo@ZBennoui The thread specifically asked people to try and articulate how they felt about the voice, so I did. You are publicly burning yourself out on this and other projects in a way that constantly makes me want to tell you to get some sleep and look after yourself, so maybe temporarily stepping away would be a good thing.
@jscholes@cachondo@Tamasg@ZBennoui The other thing that makes this super, super hard is that there are like nine different systems, and all of them need tuning. And it's impossible to ask people who haven't spent pretty much four days straight thinking exclusively about this for feedback on a particular system, because they all work together to make up the voice, and you can't know where any given issue comes from. There's the rules for going from text to IPA phonemes. Then the rules for determining the way IPA phonemes are actually voiced and fit together. And then there's the intonation table. And then there's the two systems that actually make the sound. Right now I'm mostly looking at the system that actually makes the sound, IE when you do "aaaaaaaaaaaaa" or "eeeeeeeeeeeeeeeee", because that's still not right. But because it's an entire voice, it's even hard for me to separate my own perceptions and fix anything.
The important thing to remember is that eloquence began development in 1982, by a team of about a dozen researchers. It wasn't in the state we know it until around 2002. We have existing research to build on, but no funding and fewer people, and no PHD level speech researchers. So actually doing this, even with the help of AI, is a 20-30 year project before we get close to eloquence levels. Because we have something that "works" and improves step by step, it's easy to lose sight of the size of the problem we're taking on, because it feels like we should be able to get there in a month or two. But that's not realistic.
@cachondo@Tamasg@jscholes@ZBennoui We also lack tools as blind people. For example, I wish I could visually examine the shape and spectrogram of sounds. A useful step here would be to get eloquence to generate a pure, single-note open "aaa" or "eee" tone, and understand the shape of the resulting sound wave. But there are just no accessible tools to do that. So instead we have to go by listening and guess work.
@fastfinge@jscholes@cachondo@ZBennoui I will say this though. If people had a chance to try that huge 700 MB version you built with Gruut as the phonemizer, I think they'd like some of the sounds a bit better. Not perfect by any means, not at all, but like, I can also see that Espeak phonemizer does not do a good job at inserting things like stress marks in the IPA before vowels, causing either disjointed or too rapid speech when moving from vowel-to-vowel transitions. So then we're adding a dozen vowel rules to "fight" ESpeak's phonemizer system. And that's not ideal either, because while we can tweak it, and as much as it's multilingual, it's too not suffisticated and even if my frontend has rules for things like those stress marks or word-joined markers, if the output itself isn't clean and detailed, we lose good continuety between glides, or when moving from vowel-to-consonant transitions a bit. They're so, so critical, and part of my frustration is this fork in the road feeling about all that, and in that sense the phoneme tuning is the easier part in some ways than the linguistics of it all. The blame isn't 100% ESpeak, but it's not 100% phoneme tuning either.
@Tamasg@cachondo@jscholes@ZBennoui Yup. This all comes back to that problem of so many systems all needing tuning. I do think it would really help us to just focus on a single one. IE get this voice sounding correct with pure "aaaaaaaa" and "eeeeee" tones. No words, no pitch or intonation, nothing. Then hook that up to the rest of the systems. Then see where we land and tackle the next thing. Because thinking about the phonemizer and the IPA rules and the intonation system all at once is burning us out and distracting us from finishing the thing we have the most control over: IE the klatt synthesizer and the wave generator. Those are ours, entirely and completely. The other things are not, so those problems are harder.
@fastfinge@Tamasg@cachondo@ZBennoui This is all fair and, to me at least, interesting insight. But part of embarking upon such a huge project is dealing with feedback from people on what is ultimately seen a user-facing, homogenous blob.
Many such people won't understand and, if we're being honest, care about how the sausage is being made. Particularly if the changes between builds are too subtle for most of them to perceive.
Regardless, I'm glad this is being worked on and deeply appreciative of the effort.
@jscholes@cachondo@Tamasg@ZBennoui Yup, agreed. And also, people have different preferences. So we can and do get contradictory feedback. Sometimes even from the same person LOL. On top of it all, most people don't have the vocabulary to talk about this stuff. Heck, I don't even have it; I'm not sure the words exist in English. Brighter? Rounder? What do I mean! Do I mean the same thing that you mean? Impossible to tell.
@jscholes@fastfinge@cachondo@ZBennoui oh thanks. And I truly am sorry if I was a bit too harsh reacting to your feedback, that was never my intention. I've tuned works like "folks" and "hear" to sound way better now in the latest pack (it's only in the repo for now.) But yeah, that's the kind of feedback I ultimately do need because without it I have no baseline to work on. If I let AI tune my phonemes without direction it could easily mess them up, and I need that "other human ears" type experience to tune it myself manually sometimes by hand or ask AI to create formulas that approximate the sound and waveform for me so we tune it through that based on what feedback people give. Without a spectrogram that does that visual-based tuning for us by hand and gives us baselines of other voice's waveforms, it's the most efficient way we can work.
@Tamasg@cachondo@jscholes@ZBennoui It really feels like AI should be able to help us here. But I'm not sure how. Some kind of system that takes a waveform and finds the closest approximation it can get by modifying our parameters.
@fastfinge@Tamasg@cachondo@jscholes@ZBennoui In any case, this is one more opportunity for us to build an Eloquence alternative. This will be vital not only on Windows systems, but on Linux and others in the future. Hence, I hope you don't give up. I won't give up working with Tamas to improve pt-br, my language.
@clv1@cachondo@Tamasg@jscholes@ZBennoui While the code is all cross-platform, and there's no reason it shouldn't work on Linux and mac, we're best to stick with Windows and NVDA exclusively until we get something we all love.
@fastfinge@cachondo@Tamasg@jscholes@ZBennoui As for depreciative comments, I recall similar episodes when I was learning to use Internet, some 25 years ago, so it has always existed. It's up to us to be psycologically prepared. I would suggest we focus only at the people we are really working for. For example: if someone doesn't like our work, them it means it's not for that person that we are working for.
@clv1 It really does not mean that. If someone doesn't like your work, it often just means you haven't yet landed on a form they find pleasing or useful.
And that might be fine if, say, you're building a command line app and the only thing a given user would find acceptable is a graphical one. If you have no plans to build a graphical one, they're shit out of luck and you have to move on.
Other times, though, it does point to a real problem in the thing you're building that you should legitimately keep trying to solve. Dismissing everybody who expresses negativity outright only leads to an echo chamber.
@cachondo@ZBennoui sadly today my brain and ears both just feel fried for more speech-player tuning. Ugh. Now I get that feeling, @fastfinge had it yesterday, today I'm really feeling that too. Haha. When your ear has to hear 30 differen build versions of the same DLL with sometimes the shittiest of compression and extreme clipping as you tune knobs, it really does grade on your hearing, ha. Never would have thought that kind of thing can only be done in small chunks too, at one point I had actual ringing in my ear for like 5 minutes after testing a bad combo of sound.
@Tamasg@cachondo@ZBennoui Yup. And I find I have to swap back to eloquence frequently, or else I lose my way completely, and everything starts to sound fine to me.
@cachondo@fastfinge@Tamasg Yeah that's kinda what I'm hearing as well. It reminds me a bit of Keynote, which is a synth I really dislike. There's just something about the way Eloquence reads that I really like, and I feel like it's not easily replicated, but I'm not sure why.
@cachondo@fastfinge@Tamasg Then again, Alex on Mac OS has been my Constant companion for over 10 years now, and Eloquence is really the only formant synth I can listen to for long periods of time without getting a headache.
@fastfinge@ZBennoui@Tamasg to metherethere's a Bounciness to the speech. where Eloquence Has a smooth reading. but it just cloud always be my perception and hearing.