Note by @fastfinge

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

Tagging @Tamasg odorediamanka600-source/FYLs-G2P: A lightweight hybrid G2P engine with less than 1.8M parameters and can be deployed on any devices (almost) github.com/odorediamanka600-source/FYLs-G2P

Tamas G @Tamasg@mindly.social

4mo

@fastfinge The homograph handling is the killer feature — "present" as noun vs verb, "desert" vs "desert", "lead" the metal vs "lead" the verb — it gets these right via POS tagging. eSpeak can't do that.

Alex Hall @alexhall@mastodon.social

4mo

@Tamasg @fastfinge I don't know if this is quite what you mean, but ESpeak can handle, at least, present. Consider:

He gave me a present. I wanted to present it to my family.

When I read the above two sentences together, such as with read line or say all, the two instances of "present" are spoken correctly.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge oooh this looks really neat. Only English - other languages would still need eSpeak or another solution. Also, no sentence-level prosody - FYLs-G2P doesn't output intonation/prosody markers. Hmm.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Yeah, not quite what we need. But a step in the right direction...

Tamas G @Tamasg@mindly.social

4mo

@fastfinge also bundling Onnxruntime (but maybe easier as a DLL module not full Python bloat?) for older NVDA. This would be smaller than your Gruut experiment: I'm looking at about 11 MB for the lightweight model, and then another 15-20 MB for onnxruntime DLLS, then anothr 20 MB for numpy. Which it also requires. So in total the add-on would become 80 MB.
We could do a mechanism where we use this G2P for English only but other G2Ps or Espeak for foreign languages.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg The lack of prosody markers is a blocker, though.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge nah. You're overestimating what eSpeak actually contributes to prosody in my pipeline. The driver detects punctuation itself (lines 1186-1199 of init.py) and passes clauseType to the frontend. Then our prosody.cpp pass handles pitch contours based on position in utterance, stress marks, and clause type. All of that machinery runs on the IPA + clauseType after G2P. eSpeak has zero input into it.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge OMG, the lag on this is horrendous when reading longer-chunked sentences, wow! But it definitely works!

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Also some phoneme tuning: thread and threat sound almost identical.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge and it has zero tuning for single-letter names, which is great. I realized this really quickly, "G" is pronounced as "gwee" and "h" as "hu" haha! So yeah, we'd have to override every single English language letter as a normalization rule, yikes.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg I suspect you're going to wind up training your own g2p model. Eloquence can already output its phonemes, so you'd just have to write a script to convert from eloquence to IPA, and then you could just make a bunch of training data.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge yeah but, like, lag. I'm noticing it is really really bad when doing full sentences, because it all gets pushed to the G2P. i actually don't think these are viable for relying on Onnxruntime too much and being slower on even worse CPUs - what if someone runs it on something as low-end as an Intel Atom? It's going to take 10 seconds to process the ONNX conversion part. Hmm.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg It's possible to get Onnxruntime to be snappy with a correctly optimized model, though. Blastbay TTS uses it.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg I've also been meaning to look into this. They advertise streaming TTS lag at faster than realtime on CPU, with 100 MS lag or less: github.com/kyutai-labs/pocket-tts?tab=readme-ov-file

Tamas G @Tamasg@mindly.social

4mo

@fastfinge I guess not worth it? Some folks made an NVDA add-on already at the AG forum topic. forum.audiogames.net/topic/58526/pocket-tts/

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Right, but they just used codex and probably included all of torch. There are onnxruntime versions available, and compiled versions in rust.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge ah darn. not multilingual kind of stings. I think it's why I built Speechbox. Hungary is as much at of an "only comercial TTS is available that sounds horrible with some words" problem as we are with the antiquated eloquence problem for US English. People there only have comercial TTS options for their screen reader, and or Espeak. JAWS switched to using Vocalizer Hungarian voice, over the homegrown Profivox voice. Americans would cry if they had Hungary's TTS situation lol.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg "Well they should just learn English. Idiots. They should stop being blind while they're at it." -- probably some big tech CEO, somewhere

Tamas G @Tamasg@mindly.social

4mo

@fastfinge The story of what actually happened there is a bit sad. I first translated NVDA to Hungarian in 2007, then later a dedicated group came together to maintain it from Hungary. They contacted the organization who licenses the voice for JAWS (Profivox at the time) to get it into NVDA at low cost or free. The organization, which is a blindness-specific one kind of like an "NFB" type here, said that NVDA is in direct competition with JAWS and because they are FreedomScientific's prime distributor, they will not allow ever the use of the engine in NVDA. That was that. Kind of sad when you see everyone else outside of you as "just the competition" and can't live with the idea of coexistance. It's why I don't trash other formant Klatt projects, never will,, people are putting time into them.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge here you go. I won't put this in synths, but this is wiring up g2p phonemizer you mentioned.
It's much, much snappier when using 64-bit NVDA and the onnxruntime directly rather than Microsoft's own x86 dll. that was my issue. So this is more viable for NVDA future, not for NVDA past. but there you go.
eurpod.com/tgSpeechBox-2026-g2p.nvda-addon

Tamas G @Tamasg@mindly.social

4mo

@fastfinge But I mean look at this bad list. the lexicon has decent coverage of common words, but the neural model for OOV words is really struggling with compound words and tech terms. And even some in-lexicon words have weird outputs.
• equals → ˈikwᵊlz (OOV, wrong - sounds like "eekwulz")
• dropdown → truncated W at end
• firefox → truncated
• bluetooth → truncated
• youtube → truncated (jˈutˌu)
• github → truncated (ɡˈɪt)
• stackoverflow → truncated
• localhost → wrong
• ctrl → garbage
• alt → garbage (ˈI = "eye"??)
• spacebar → wrong (spˈAsbˌɑɹ = "spaysbahr")
• wifi → wrong stress pattern
• combobox → OOV
• focusable → OOV, wrong
This is a research-quality G2P, not production-quality for screen reader use is sadly my final conclusion.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg I'm not actually having any of those issues. I'm having other ones, though. Like for me "notifications" is said "norifications". But "g" is said fine. As is ctrl and alt.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge oh now that's interesting!
But you know what? This was still solid. A good lesson and it gave me a window into what I'd need.
1. A model trained on compounds and morphological decomposition (not just dictionary lookups)
2. A model that outputs a richer IPA inventory (length marks, tie bars) so it drops in without a normalization shim
3. Or even a hybrid that uses the neural model's confidence score and only overrides eSpeak when confidence is high
So yeah. Again, lots of learning here, I don't consider my time wasted on it and appreciate you bringing the link to my attention.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg I feel like it's caching something, somewhere. If I delete the addon and reinstall it, it has all different problems each time.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge Ha! So that's the thing. Different numerical behavior between native Python onnxruntime bindings and a hand-rolled ctypes vtable makes total sense. Also the X86 ONNXRuntime is 22.1, because that's the last one Microsoft released precompiled. What you're getting in 64-bit NVDA is probably producing more accurate numbers in the output from the weights.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge take it as you will, but I asked Claude 4.6 on that one.
"The "caching" / non-determinism issue — that's almost certainly ONNX Runtime's thread parallelism. When ORT runs matrix multiplications and GRU cells across multiple CPU threads, the floating-point accumulation order is non-deterministic. Tiny rounding differences (1e-7) compound through the recurrent steps, and occasionally push a different phoneme past the argmax threshold. So you get "norifications" one run, maybe "notificashions" the next. It's not caching — it's the butterfly effect in floating point math."
The fix would be setting intra_op_num_threads=1 in the ORT session options, which forces deterministic single-threaded execution. But that'd slow it down. This is actually another strike against shipping this particular model — non-deterministic pronunciation in a screen reader would drive users crazy.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Also, could we maybe get a neural model that just overrides espeaks stress marks? That seems to be the main place espeak falls down.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge actually, this is probably the most brilliant idea there is. Because it would only be a few hundred KB. a simple LSTM or logistic classifier over syllable features type of thing. We keep Espeak's phoneme inventory and move the stress marks, and Determinism is easier to guarantee with a simpler architecture. So we just insert this "stress model" between Espeak's IPA pipeline and before it hits the DSP. Simple.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge If I install Claude Code,, get all the LLM tools up here,, I might take a crack at this honestly. CMU dict has stress annotations for ~130K words too. We honestly have a lot of data there and I just basically built the framework to swap in any Onnxruntime thing, we just remove the shim. But if I did this I would probably more directly integrate it into the Frontend
So the cleanest frontend integration might be: a compiled stress dictionary + suffix rules, implemented in pure C++. No ONNX, no runtime dependency. A trie or hash map from word → stress pattern, maybe 2-3MB, with a suffix-based fallback for OOV. It loads alongside phonemes.yaml, runs as a post-eSpeak normalization pass, and works identically on Windows and Linux.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Also, another stress test I've been using on text to speech systems lately is the name of my friend "Hrvoje". It's croatian, and pronounced "her voy yay". Every AI text to speech system does something new and awful with it. So does every klatt system haha.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge oh my gosh I know that person. Cattic I believe is last name. Yes. I'm very glad you two are friends :) Yep. met in high school I think or something like that, back in the good old days of Live Messenger. Haha.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Yeah, that's the guy LOL. We're re-enforcing the stereotype that all blind people know each other!

Tamas G @Tamasg@mindly.social

4mo

@fastfinge hahahahahaaha well I'm always more surprised when it's someone international, because honestly the blind community operates in little cliques too especially in ones that don't cross international boundaries as far. Argueably the web has improved this a lot in some groups but not all.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

4mo

@Tamasg Yeah true. I remember one time I was on the train, and a random sighted person said to me "Hey! There's another blind person just down the car! Don't you two know each other? Why aren't you sitting together? Here, I'll take you to him." I was like "No, I'm traveling alone. We're not together." Then the other blind person overheard the conversation, and it turned out we'd known each other for years. So we sat together and chatted for the rest of the train ride. I was so tempted to pretend I didn't know him at all, just so I didn't validate this random sighted person's stereotypes.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge
The pass would be: eSpeak emits IPA → frontend looks up each word's stress pattern → moves ˈ/ˌ if they disagree → continues to frame generation.
Yo I might go crazy and design this.

Tamas G @Tamasg@mindly.social

4mo

@fastfinge ah sigh. It uses Apache V2. Until we relicense, this is also a dead end for public release. Sigh.