User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
The State of Modern AI Text To Speech Systems for Screen Reader Users: The past year has seen an explosion in new text to speech engines based on neural networks, large language models, and machine learning. But has any of this advancement offered anything to those using screen readers? stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html
12
41
15
0

User avatar
Landon @Landon205@vee.seedy.cc
4mo
@fastfinge Oh god, what if Chatgpt gets intergrateded into nvda?
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@Landon205 There's already addons that do that.
1
0
0
0
User avatar
Landon @Landon205@vee.seedy.cc
4mo
@fastfinge Oh that's cool. Not sure if I'll get one or not.
0
0
1
0
User avatar
Shriram Krishnamurthi @shriramk@mastodon.social
2mo
@fastfinge Super important stuff, thanks for the work and review!
0
0
0
0
User avatar
Matthew @rmcpantoja@mastodon.social
4mo
@fastfinge My results shows that a dedicated IPC server performs faster, E.G the synthDriver is ready for use in just 5 seconds; response is good even in longer sentences, but this can be attributed to the 4.2m model I'm using. And when I ran the model through a streaming vocoder, response is surprisingly realtime, suitable for screen reader. As for voice rate, I'm using a modification of the good "audiostretchy" pip package. I can't give more details ATM, but I hope this helps in your research
1
0
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@rmcpantoja Also, I'd love to hear if and when you release anything!
0
0
0
0
User avatar
Andre Louis @FreakyFwoof@universeodon.com
4mo
@fastfinge Supertonic works way better on my 2012 machine than Kitn. That one just cuts off things all the time and just doesn't sound good at all. I'm pretty sure it's not supposed to do that. Very weird.
1
0
1
0
User avatar
patricus @patricus@snac.posix.live
4mo
@fastfinge yeah, AI TTS ein't for us. as always sighted get the best and we get the scraps, sadly.
0
0
1
0
User avatar
Matthew @rmcpantoja@mastodon.social
4mo
@fastfinge I have been developing a neural TTS system, focused on screen reading for many months, which offers instant responsiveness, but maintains good synthesis quality at the same time. And, BTW, it is not recommended at all to use espeak as a phonemizer backend as breaks the text embeddings during model training, especially if we use linguistic information. And, please consider to avoid overeading NVDA's python environment in your add-ons.
2
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@rmcpantoja Yes, the only way to avoid messing with the NVDA Python environment would be to do an IPC server. But at that point, you're really just rewriting SAPI and it seems pointless.
0
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@rmcpantoja The issue with not using Espeak is that it makes it impossible to have user dictionaries. When we use a neural network, linguistic rules are no longer deterministic. So it might say a word correctly with one voice, at one time, but not with another voice, or at another time. This makes it impossible for us to correct mispronounced words in a reliable way.
0
0
0
0
User avatar
Zach Bennoui @ZBennoui@dragonscave.space
4mo
@fastfinge Really interesting article. I'm particularly passionate about this subject, I've been fascinated with TTS for a number of years. I've trained many voices, both for Piper and some of the newer LLM based systems, and while I can't speak to the speed issue, training data is extremely important.

What you feed into these models has a big impact on the voice's performance overall. If you give it stuff scrape from the web, random audiobooks that weren't optimized for TTS, things like that, you're not going to get good results for the type of work screen reader users do every day. This applies to all of these systems, not even just neural networks. The latency / responsiveness issue is something we'll have to solve at some point, because I don't think using TTS systems last updated in 2003 is going to work out in the longterm, as much as I love Eloquence.

In my ideal world, we would have either a machine learning based or formant system that is easy to train / maintain. Big companies have lost interest in on device TTS, not even just for screen reader users. Many of the solutions being put out now are cloud based, and while developers are still creating on device models, as said in the article, they're not optimized for our needs and may never be. I think we have to take matters into our own hands and figure this out, but I believe with enough people we can make it happen.
3
2
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@ZBennoui Agreed. I think blast bay is close to the right track. If only it was open and the issues pronouncing words were fixed. The speed and sound of the voices are top notch.
0
1
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@ZBennoui We need a good formant system. Machine learning is useful for setting the model parameters. But I think the word to phoneme rules can’t be a neural network, because they have to be reproducible and modifiable. Even here though, machine learning could help though. I’d love a system where a user could submit a recording of a word, and the system could create the phonetic representation.
1
1
0
0
User avatar
Zach Bennoui @ZBennoui@dragonscave.space
4mo
@fastfinge Yeah I completely agree, I happen to know Philip and have been talking with him extensively about his experiments with TTS. I can't go into a ton of detail, but I'll say what he said publicly. The system he's using is a hybrid approach of neural networks and formant synthesis, where he trains a model to output formant frequencies based on the audio data he feeds into it. I won't pretend to understand all the details, this is way above my pay grade, but as far as I understand this has never been done before by another developer.
1
1
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@ZBennoui Yup. I just wish he wasn’t also trying to train his own phonemizer, because I really believe that has to be reproducible and modifiable for users. I’ve swapped multiple emails with him about an NVDA addon. But he’s pretty set on sapi for now until things stabilize both on the NVDA side and on his side.
0
1
0
0
User avatar
clv1 has moved @clv1@mastodon.social
4mo
@amir @fastfinge @ZBennoui I too think we can make it happen by taking matters into our hands. I don't know how to code, but I'm at disposal to work on Portuguese language support, e.g. improving pronounciation rules, when time comes.
0
0
1
0
User avatar
Paul L @polx@mastodon.online
4mo
@fastfinge isn't it possible to "pregenerate" the speech with all the necessary IDs so that you can navigate and interrupt at will?
Just as one generates SSML from rich text (including maths formulas) before generating speech.

It would even be better to catch intonations, breaths and others, unchanged instead of letting the TTS generating a "pleasant full phrase" (a wrong expectation).

I find your post intriguingly close to the emerging reaction against the Ai-generated
;-).
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@polx Maybe, but probably not. Doing that would result in a lot of wasted resources generating text I'm never going to listen to. Think about the average user interface: dozens of menus, and toolbars, and ads, and comments, and so on. Plus, the text changes constantly, on even simple websites. That's not even taking into account websites that just scroll constantly. It might be possible to create some kind of algorithm to predict the most likely text I'll want next, but now we've just added another AI on top of the first AI.

I think a better solution might be to make the text to speech system run on different hardware from the computer itself. This is, in fact, how text to speech was done in the past, before computers had multi-channel soundcards. This has a few advantages. First, even if the computer itself is busy, the speech never crashes or falls behind. Second, if the computer crashes, it could be possible to actually read out the last error encountered. Third, specialized devices could be perhaps more power and CPU efficient.

The reason text to speech systems became software, instead of hardware, is largely because of cost. It's much cheaper to just download and install a program than it is to purchase another device. Also, it means you don't have to carry around another dongle and plug it into the computer.
0
0
0
0
User avatar
Jayson Smith @jaybird110127@dragonscave.space
4mo
@fastfinge I assume you didn't mention the modern efforts with regard to DECtalk due to the legal situation with that source code being about as clear as mud?
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@jaybird110127 Yup. That and the fact that the sourcecode isn't getting updated; just getting it to keep compiling is a huge effort. There is a 64-bit build, but it doesn't actually work. I consider dectalk pretty much dead, even though the source is available.
0
0
0
0
User avatar
Chris Smart @VE3RWJ@mastodon.radio
4mo
@fastfinge Malware Bytes advises me not to visit this page, and that's after I paste the URL in my browser because Tweesecake doesn't recognize it as a URL. :)
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@VE3RWJ Shrug. Nobody else has reported that issue. Probably a false positive from malwarebites.
1
0
0
0
User avatar
Chris Smart @VE3RWJ@mastodon.radio
4mo
@fastfinge Ok. I'll go ahead and read it. nervus laugh
0
0
0
0
User avatar
Amir @amir@dragonscave.space
4mo
@fastfinge What an interesting read! Needless to say, I read it with Eloquence - LOL!
1
0
1
0
User avatar
Sean Randall @cachondo@defcon.social
4mo
@amir @fastfinge It's crazy that everyone is layering it in wrappers nowadays.
Do you know if codefactory are doing the same with their new android build?
2
0
0
0
User avatar
Andre Louis @FreakyFwoof@universeodon.com
4mo
@cachondo @amir @fastfinge I sincerely hope someone will do the same for Orpheus. I'd even pay for it.
5
0
0
0
User avatar
James Scholes @jscholes@dragonscave.space
4mo
@FreakyFwoof There is a 32-bit compatibility layer in the works for NVDA itself (although it currently only references SAPI4). But with any luck the need for every add-on to implement its own will go away.

github.com/nvaccess/nvda/pull/19412

@cachondo @amir @fastfinge
2
0
0
0
User avatar
Sean Randall @cachondo@defcon.social
4mo
@jscholes @FreakyFwoof @amir @fastfinge It does seem incredible to cut every 32 bit thing out so suddenly.
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@cachondo @jscholes @FreakyFwoof @amir They don't have much choice. A lot of the libraries NVDA depends on are stopping 32-bit support this year.
1
0
0
0
User avatar
Sean Randall @cachondo@defcon.social
4mo
@fastfinge @jscholes @FreakyFwoof @amir I guess if this had happened a decade ago it'd have excited me. I'm obviously getting too old!
0
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@jscholes @cachondo @FreakyFwoof @amir My understanding is that when this comes to addons, it's going to require some kind of secure addons API/layer. And it won't be ready for 2026.1, or maybe not even 2026.2.
1
0
0
0
User avatar
James Scholes @jscholes@dragonscave.space
4mo
@fastfinge Where are you getting the first part of that understanding from? I.e. the dependence on the secure add-on runtime. @cachondo @FreakyFwoof @amir
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@jscholes @cachondo @FreakyFwoof @amir It was mentioned in the roadmap NVDA released a while back.
1
0
0
0
User avatar
James Scholes @jscholes@dragonscave.space
4mo
@fastfinge I see the "Secure add-on runtime" on the roadmap, with the note that "The first version of the runtime will provide support for speech synthesis and braille devices."

I don't see any implication that any 32-bit compatibility layer will only work for secure add-ons, which is hopefully a bit of a leap.

Still, the fact that people don't know what will or won't be happening, or whether their preferred synthesiser(s) will work or not, continues to be a big part of the problem.
@cachondo @FreakyFwoof @amir
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@jscholes @cachondo @FreakyFwoof @amir That's my assumption because the only things that really need a 32-bit compatibility layer are speech synthesizers and braille devices.
3
0
0
0
User avatar
James Scholes @jscholes@dragonscave.space
4mo
@fastfinge In the absence of actual data, we'll just have to hope that doesn't happen.

To me at least, binding the availability of a low level compatibility shim to a higher level security mechanism seems like extra work with no benefits. Whether or not an add-on uses 32-bit libraries seems architecturally irrelevant to whether or not it can be considered "secure."
@cachondo @FreakyFwoof @amir
0
0
1
0
User avatar
clv1 has moved @clv1@mastodon.social
4mo
@fastfinge @jscholes @cachondo @FreakyFwoof @amir Regarding ESpeak-ng, AFAIC, the main complaint from users is its base tone, which cannot be solved by simply making new variants. In this regard, how about improving its MBrola voices?
1
0
1
0
User avatar
patricus @patricus@snac.posix.live
4mo
@clv1 @fastfinge @jscholes @cachondo @FreakyFwoof @amir the biggest gripe for me is it's rrrrroughness at higher WPM's and it's why I have it maxed but without boost.
0
0
0
0
User avatar
clv1 has moved @clv1@mastodon.social
4mo
@fastfinge @jscholes @cachondo @FreakyFwoof @amir And what about recording new voices for RHVoice?
1
0
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@clv1 @jscholes @cachondo @FreakyFwoof @amir The issue is that both of these are effectively concatenative or parametric, rather than formant, systems. So they will never be as intelligible as eloquence.
0
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@FreakyFwoof @cachondo @amir You should be able to get either Gemini or Codex to help you, depending on what AI you have access to. The workflow would be:
1. download gemini-cli or codex-cli, and get them installed and configured.
2. clone all of the sourcecode from
github.com/fastfinge/eloquence_64/
3. Delete the tts.txt and tts.pdf files, so you don't confuse it with incorrect documentation.
4. Find any API documentation for orphius that's available, and add it into the folder.
4. Run codex-cli or gemini-cli, and tell it something like: "Using the information about how to develop NVDA addons you can find in agents.md, and the information about the Orphius API I've provided in the file Orphius-documentation-filename.txt, I would like you to modify the code in this folder to work with Orpheus instead of eloquence."

It will go away for five or ten minutes, ask you for permission to read and write the files it's interested in, and then give you something that mostly works. Now, build the addon, run it, and tell it about the errors and problems you have and ask it to fix them. In the case of errors, include the error right from the NVDA log, and for bugs and problems, tell it exactly what it's doing wrong, and exactly what you want it to do instead. Keep doing this until you wind up with a working addon.

Think of AI as a particularly stupid programmer, and you're the manager in charge of the project. You should be able to get this done without paying anyone.
1
1
2
0
User avatar
Andre Louis @FreakyFwoof@universeodon.com
4mo
@fastfinge @cachondo @amir Well there's already a 32-bit addon for Orpheus floating about. I'd still rather pay someone competent to do it, even if they use AI. Proper programming terms would help narrow down the broken bits. I'm just an audio guy.
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@FreakyFwoof @cachondo @amir Yeah, you can get AI to modify the 32-bit addon for you. That's how I got the first two eloquence prototypes; it helped me understand the problem and what approaches would work and what wouldn't. If you give it the 32-bit orphius addon, and the 64-bit eloquence addon, it should be able to understand the working approach to make an addon 64-bit, and make the modifications itself. The reason to give it the 64-bit eloquence addon as an example is so it doesn't decide to go down the GRPC route and include protobuf and a bunch of other nonsense.
0
0
0
0
User avatar
Hamish @mishu70@caneandable.social
4mo
@FreakyFwoof @cachondo @amir @fastfinge Oh happy days 😊 that was the voice that used to come with the Hal ScreenReader isn't it? That was my first ScreenReader after my accident back in 1996 and I seem to remember the plug-in synth was called something like Apollo two or thereabouts such happy memories 🙂 but not really I used to sit up till about 4 am banging my head against the brick wall trying to figure it out but hey ho
0
0
0
0
User avatar
Luis Carlos @luiscarlosgonzalez@mastodon.social
4mo
@FreakyFwoof @cachondo @amir @fastfinge And also for Kokoro and other speech synths.
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@luiscarlosgonzalez @cachondo @FreakyFwoof @amir I didn't try Kokoro, because it cannot achieve a real time factor of 1 on CPU. By that I mean, to be fit for consideration with a screen reader, a text to speech voice must be able to generate one second of speech in one second or faster. In general, Kokoro takes two seconds to generate one second of speech. So it's not suitable.
0
0
0
0
User avatar
Luis Carlos @luiscarlosgonzalez@mastodon.social
4mo
@FreakyFwoof @cachondo @amir @fastfinge What even about Tortoise or Cookie or that synth I don't know of
1
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@luiscarlosgonzalez @cachondo @FreakyFwoof @amir It has the same problem with speed.
0
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@cachondo @amir I've heard from a second hand source that they are, yes. But I haven't verified that.
0
0
0
0
User avatar
🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social
4mo
@fastfinge I've started using eSpeak-ng. It's grating, but I can crank the speed up way higher than any other TTS I've ever used, especially the fancy AI shit that simulates breath draws and lip movement
2
0
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@PepperTheVixen The reason it's grating is because unlike Eloquence and dectalk, Espeak only uses formant synthesis for the vowel sounds. For consonants and plosives, it instead uses concatenative recordings based on human speech. That's why even when you switch to a voice that sounds less sharp, the "t", "b", "p", and other sounds are still too sharp. This seems to be the primary cause of the fatigue most people experience while using ESpeak.
0
0
2
0
User avatar
Devin Prater ​:blind:​ @pixelate@tweesecake.social
4mo
@PepperTheVixen @fastfinge Lol just imagining an AI voice with lip smacking noises.
1
0
1
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@pixelate @PepperTheVixen If you give chatterbox-tts an ASMR recording to clone, you can absolutely get it to make lip smacking noises.
1
0
2
0
User avatar
Devin Prater ​:blind:​ @pixelate@tweesecake.social
4mo
@fastfinge @PepperTheVixen Oh my goodness. Or even better, an AI voice chewing gum.
2
0
0
0
User avatar
🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca
4mo
@pixelate @PepperTheVixen If you have a sample of someone talking while chewing gum, you can absolutely make that happen.
0
0
1
0
User avatar
🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social
4mo
@pixelate @fastfinge@interfree.cI think you just did psychic damage lol
2
0
0
0
User avatar
Devin Prater ​:blind:​ @pixelate@tweesecake.social
4mo
@PepperTheVixen Ooo cool! I'll be in Warhammer 40K in no time as a psycher!
0
0
0
0
User avatar
D.Hamlin.Music @dhamlinmusic@dragonscave.space
4mo
@PepperTheVixen @pixelate Oh how about a voice speaking while eating?
1
0
0
0
User avatar
🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social
4mo
@dhamlinmusic @pixelate internal screaming
0
0
0
0