Note by @fastfinge

Tamas G @Tamasg@mindly.social

5mo

I don't know. My partner (Jess) asked if I'm building this for other people or myself. I feel like I'm building it for other people but I really should be happy with it for myself at least. My only goal was to get this thing working in modern NVDA. Then US English lead us to other languages, people asked if they can have theirs, and then it turned into this big sad project. But maybe I shouldn't feel so sad over it. For myself, it sounds nearly there. Yes some words are off, and some things still stick out a bit sharp, like the word words, ironically. But I can understand and use it, probably 80% of the time I am instead of Eloquence. In that way, mission accomplished, and we have a big robust frontend to tune, so I probably should feel less sad about it.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@Tamasg The key is to realize this can't be only your project. Think of it like you're founding an organization that needs to persist over the years. You're already doing that work, by documenting everything really well, and giving lots of people other features they can use. As well as creating tools. But your goal should be getting it into a state you, personally, like, and then in moving towards having other people in charge of different things. So all you do is final tests and sign off on releases.

Tamas G @Tamasg@mindly.social

5mo

@fastfinge I just made a massive tools update, formant_trajectory.py and the frame inspector use lang_pack.py and a simple_yaml.py to parse it in a less strict way. So now it's really solid on tooling. People can do tests against them, and build languages easier. I think you're right, hopefully it can get to a point where I'm sitting back and accepting PRs from people tuning phonemes, and carefully weighing bigger changes to anything with the community, liaison for improving it. But this is so so far away from that, although interest is definitely picking up and the more I can simplify tools and add them in many ways I'm hoping the flexibility will make it shine for it. Whether you use the phoneme editor, the frame inspector / trajectory tool, now you really have a way to dig into the rules.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@Tamasg So one thing to think about: If people change settings in the NVDA addon, then you release an update, it looks like there settings aren't always updated. And it's really, really easy to break subtle things. If I were you I might consider adding a reset to defaults button in the addon. Because otherwise you're going to get feedback from people who toggled a checkbox like co-articulation without thinking about any of the associated settings and now wonder why everything sounds bad.

Tamas G @Tamasg@mindly.social

5mo

@fastfinge I think we can. If we shipped a .defaults folder with the untouched language files, then the person just hits "reset to defaults" in the NV Speech Player panel, and boom. We just copy over the files from .defaults and they have unchanged settings. But sadly it can't live in the voice panel, because you cannot put buttons there. So it has to live in the NV Speech Player settings area near the top.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@Tamasg Something about the new release sounds really good! I can get up to 80 without having issues understanding. Before it was around 65.

Tamas G @Tamasg@mindly.social

5mo

@fastfinge haha maybe some of the phoneme tuning with the formant trajectory tool? Here's what's crazy. You can give an AI those tools, and if it has a container environment like Claude / OpenAI do, they will happily execute it and generate the spectrogram PNG, do the math on the frequency variation without loss of the sound's shape, ETC. But you have to do it in clusters or groups of particularly sibilant phonemes. It's hard, hard work, because if you just tell an AI, "here's the phonemes, tune them", it'll make a big lispy mess out of it. I tried that too. But if you give it targeted instructions on what about the sound is off, here's your 4 tools, your YAML parser, get it done, yes, it'll happily tune with specifics like that.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg So weirdly, I'm using this voice pretty much exclusively at work. It's perfect for reading emails and generating reports and proofreading my writing and stuff. But for home use, I'm still not finding it a good fit for reading fanfic or ebooks or articles. It's...not relaxing? I can't identify if it's just Eloquence is what I'm used to, or if it's something about the voice.

Pratik Patel @ppatel@mstdn.social

4mo

@fastfinge @Tamasg There's something about these voices that sound similar to ESpeak that give me a headache if I listen to them for more than half an hour. I can notice the fatigue every time.

Tamas G @Tamasg@mindly.social

4mo

@ppatel @fastfinge so what do y'all think. Give up on project until we know a better phonemizer path (or multiple if we needed that for multilingual coverage, but that gets dicey and bloated fast.) Haha. Phoneme table tuning works up to a point, and rules will help make words like "you" on their own not be cut off as fast, sure. But I don't think it will change how words get stressed and where, and besides the "classic" pitch mode that tries to really override intonation we can't create a different shape on how things are pronounced without creating a million word-specific rules Espeak might sound odd on (which En-us.YAML already has a lot of, it would just need even more sadly.) Thing is though, all those rules could no longer apply when you switch phonemizers, so I really am stuck on progress until this is unblocked there.

Pratik Patel @ppatel@mstdn.social

4mo

@Tamasg @fastfinge You can only take it so far. With your current set of tools, there's enough to make improvements to how ESpeak operates. I've been watching to see if your project would come close enough to Eloquence, hoping that we wouldn't have to write a completely new speech engine. I was considering starting a one from scratch. I'll have more time starting next month. With good community input and contributions, I think we can have something in a year or so.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@ppatel @Tamasg Good luck with that. The more I dig in, the more complex it all gets. But you could make a big difference if you focused on the phonemizer (IE going from text to IPA). Then we wouldn't depend on espeak at all anymore, and would effectively be a speech engine from scratch. The phoneme editor includes support for third party phonemizers already, so you can test easily.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge @ppatel I think it would definitely be interesting to explore non-Klatt implementations, like Diphone synthesis, although considering Eloquence did it with Klatt is why I kind of stuck with improving Speech Player. I've considered extending frames, ETC, but then realized that the 47 we have are probably the good core set, and adding more resonators / (cascade and parallel) wouldn't really improve clarity because we have all the fundemental ones needed for a proper Klatt model already, it's just a matter of (A) continuing to tune phonemes and (B) allowing for DSP tweaks like the new Tilt one that change voicing shape, ETC. So no breaking the frame struct unless there's absolutely something that can't be represented in phoneme data or the DSP layer - it would have to have a clear distinction in architecture to b be added.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg @ppatel I've never heard Diphone synthesis that sounds good. That's pretty much what festival and flight do.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge @ppatel OK y'all, new branch: voicingtone_struct_break - right now I warn you, it makes things more clicky on words like notification and thirty. But this will try to reshape filters and resonators based on the Qlatt repo. For now it adds a new tuning knob: t.noiseGlottalModDepth to enable Qlatt-like noise AM (voiced only) ranges from 0 to 1. Also a DSP verion check function so breaking the struct won't cause weirdnesses with mismatched files. This will probably be more major work as DSP improvements are added and won't get merged until it's not clicky for sure.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge @ppatel it does definitely sound different in the new branch. no clicks. added 3 new params:
•Noise glottal modulation (0-100, default 0) → maps to 0.0-1.0
• Pitch-sync F1 delta (0-100, default 50) → maps to 0-120 Hz (so 50 = 60 Hz)
• Pitch-sync B1 delta (0-100, default 50) → maps to 0-100 Hz (so 50 = 50 Hz)

Pratik Patel @ppatel@mstdn.social

4mo

@Tamasg @fastfinge Definitely sounds different in this new branch.

Tamas G @Tamasg@mindly.social

4mo

@ppatel @fastfinge it's nicer at keeping the voice same at higher sample rates, so no more "woah, the voice is so different at 11025!" surprise feeling, and there's a fullness I can't quite explain to it. Hmm. But the new sliders in the NVDA driver (need replacing of speechplayer.py / __init_.py) allow you to mess with the pitch-sync f1 delta and b1 delta values. And it makes a drastic diff when you twittle with them at opposing levels, I'm fascinated.