Tagging @Tamasg odorediamanka600-source/FYLs-G2P: A lightweight hybrid G2P engine with less than 1.8M parameters and can be deployed on any devices (almost) github.com/odorediamanka600-source/FYLs-G2P
@fastfinge oooh this looks really neat. Only English - other languages would still need eSpeak or another solution. Also, no sentence-level prosody - FYLs-G2P doesn't output intonation/prosody markers. Hmm.
@fastfinge also bundling Onnxruntime (but maybe easier as a DLL module not full Python bloat?) for older NVDA. This would be smaller than your Gruut experiment: I'm looking at about 11 MB for the lightweight model, and then another 15-20 MB for onnxruntime DLLS, then anothr 20 MB for numpy. Which it also requires. So in total the add-on would become 80 MB. We could do a mechanism where we use this G2P for English only but other G2Ps or Espeak for foreign languages.
@fastfinge nah. You're overestimating what eSpeak actually contributes to prosody in my pipeline. The driver detects punctuation itself (lines 1186-1199 of init.py) and passes clauseType to the frontend. Then our prosody.cpp pass handles pitch contours based on position in utterance, stress marks, and clause type. All of that machinery runs on the IPA + clauseType after G2P. eSpeak has zero input into it.
@fastfinge and it has zero tuning for single-letter names, which is great. I realized this really quickly, "G" is pronounced as "gwee" and "h" as "hu" haha! So yeah, we'd have to override every single English language letter as a normalization rule, yikes.
@Tamasg I suspect you're going to wind up training your own g2p model. Eloquence can already output its phonemes, so you'd just have to write a script to convert from eloquence to IPA, and then you could just make a bunch of training data.
@fastfinge yeah but, like, lag. I'm noticing it is really really bad when doing full sentences, because it all gets pushed to the G2P. i actually don't think these are viable for relying on Onnxruntime too much and being slower on even worse CPUs - what if someone runs it on something as low-end as an Intel Atom? It's going to take 10 seconds to process the ONNX conversion part. Hmm.
@fastfinge ah darn. not multilingual kind of stings. I think it's why I built Speechbox. Hungary is as much at of an "only comercial TTS is available that sounds horrible with some words" problem as we are with the antiquated eloquence problem for US English. People there only have comercial TTS options for their screen reader, and or Espeak. JAWS switched to using Vocalizer Hungarian voice, over the homegrown Profivox voice. Americans would cry if they had Hungary's TTS situation lol.
@fastfinge The story of what actually happened there is a bit sad. I first translated NVDA to Hungarian in 2007, then later a dedicated group came together to maintain it from Hungary. They contacted the organization who licenses the voice for JAWS (Profivox at the time) to get it into NVDA at low cost or free. The organization, which is a blindness-specific one kind of like an "NFB" type here, said that NVDA is in direct competition with JAWS and because they are FreedomScientific's prime distributor, they will not allow ever the use of the engine in NVDA. That was that. Kind of sad when you see everyone else outside of you as "just the competition" and can't live with the idea of coexistance. It's why I don't trash other formant Klatt projects, never will,, people are putting time into them.
@fastfinge here you go. I won't put this in synths, but this is wiring up g2p phonemizer you mentioned. It's much, much snappier when using 64-bit NVDA and the onnxruntime directly rather than Microsoft's own x86 dll. that was my issue. So this is more viable for NVDA future, not for NVDA past. but there you go. eurpod.com/tgSpeechBox-2026-g2p.nvda-addon
@fastfinge But I mean look at this bad list. the lexicon has decent coverage of common words, but the neural model for OOV words is really struggling with compound words and tech terms. And even some in-lexicon words have weird outputs. • equals → ˈikwᵊlz (OOV, wrong - sounds like "eekwulz") • dropdown → truncated W at end • firefox → truncated • bluetooth → truncated • youtube → truncated (jˈutˌu) • github → truncated (ɡˈɪt) • stackoverflow → truncated • localhost → wrong • ctrl → garbage • alt → garbage (ˈI = "eye"??) • spacebar → wrong (spˈAsbˌɑɹ = "spaysbahr") • wifi → wrong stress pattern • combobox → OOV • focusable → OOV, wrong This is a research-quality G2P, not production-quality for screen reader use is sadly my final conclusion.
@Tamasg I'm not actually having any of those issues. I'm having other ones, though. Like for me "notifications" is said "norifications". But "g" is said fine. As is ctrl and alt.
@fastfinge oh now that's interesting! But you know what? This was still solid. A good lesson and it gave me a window into what I'd need. 1. A model trained on compounds and morphological decomposition (not just dictionary lookups) 2. A model that outputs a richer IPA inventory (length marks, tie bars) so it drops in without a normalization shim 3. Or even a hybrid that uses the neural model's confidence score and only overrides eSpeak when confidence is high So yeah. Again, lots of learning here, I don't consider my time wasted on it and appreciate you bringing the link to my attention.
@fastfinge Ha! So that's the thing. Different numerical behavior between native Python onnxruntime bindings and a hand-rolled ctypes vtable makes total sense. Also the X86 ONNXRuntime is 22.1, because that's the last one Microsoft released precompiled. What you're getting in 64-bit NVDA is probably producing more accurate numbers in the output from the weights.
@fastfinge take it as you will, but I asked Claude 4.6 on that one. "The "caching" / non-determinism issue — that's almost certainly ONNX Runtime's thread parallelism. When ORT runs matrix multiplications and GRU cells across multiple CPU threads, the floating-point accumulation order is non-deterministic. Tiny rounding differences (1e-7) compound through the recurrent steps, and occasionally push a different phoneme past the argmax threshold. So you get "norifications" one run, maybe "notificashions" the next. It's not caching — it's the butterfly effect in floating point math." The fix would be setting intra_op_num_threads=1 in the ORT session options, which forces deterministic single-threaded execution. But that'd slow it down. This is actually another strike against shipping this particular model — non-deterministic pronunciation in a screen reader would drive users crazy.
@fastfinge actually, this is probably the most brilliant idea there is. Because it would only be a few hundred KB. a simple LSTM or logistic classifier over syllable features type of thing. We keep Espeak's phoneme inventory and move the stress marks, and Determinism is easier to guarantee with a simpler architecture. So we just insert this "stress model" between Espeak's IPA pipeline and before it hits the DSP. Simple.
@fastfinge If I install Claude Code,, get all the LLM tools up here,, I might take a crack at this honestly. CMU dict has stress annotations for ~130K words too. We honestly have a lot of data there and I just basically built the framework to swap in any Onnxruntime thing, we just remove the shim. But if I did this I would probably more directly integrate it into the Frontend So the cleanest frontend integration might be: a compiled stress dictionary + suffix rules, implemented in pure C++. No ONNX, no runtime dependency. A trie or hash map from word → stress pattern, maybe 2-3MB, with a suffix-based fallback for OOV. It loads alongside phonemes.yaml, runs as a post-eSpeak normalization pass, and works identically on Windows and Linux.
@Tamasg Also, another stress test I've been using on text to speech systems lately is the name of my friend "Hrvoje". It's croatian, and pronounced "her voy yay". Every AI text to speech system does something new and awful with it. So does every klatt system haha.
@fastfinge oh my gosh I know that person. Cattic I believe is last name. Yes. I'm very glad you two are friends :) Yep. met in high school I think or something like that, back in the good old days of Live Messenger. Haha.
@fastfinge hahahahahaaha well I'm always more surprised when it's someone international, because honestly the blind community operates in little cliques too especially in ones that don't cross international boundaries as far. Argueably the web has improved this a lot in some groups but not all.
@Tamasg Yeah true. I remember one time I was on the train, and a random sighted person said to me "Hey! There's another blind person just down the car! Don't you two know each other? Why aren't you sitting together? Here, I'll take you to him." I was like "No, I'm traveling alone. We're not together." Then the other blind person overheard the conversation, and it turned out we'd known each other for years. So we sat together and chatted for the rest of the train ride. I was so tempted to pretend I didn't know him at all, just so I didn't validate this random sighted person's stereotypes.
@fastfinge lol the irony in that story isn't lost at all! Hahahaa too funny. I had similar experiences when at Microsoft because I would know people who were blind working there, but then at times I'd go out to a bar or somewhere and same thing. Sighted person asks if I know the blind man sitting across the bar, I'm like "probably not, ha, ha!" but start chatting and it's a one of my coworkers who's blind and happened to be in the same bar. So yeah, that feeling's real :D
@fastfinge The pass would be: eSpeak emits IPA → frontend looks up each word's stress pattern → moves ˈ/ˌ if they disagree → continues to frame generation. Yo I might go crazy and design this.