Note by @fastfinge

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

The State of Modern AI Text To Speech Systems for Screen Reader Users: The past year has seen an explosion in new text to speech engines based on neural networks, large language models, and machine learning. But has any of this advancement offered anything to those using screen readers? stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html #ai #tts #llm #accessibility #a11y #screenreaders

Shriram Krishnamurthi @shriramk@mastodon.social

3mo

@fastfinge Super important stuff, thanks for the work and review!

Matthew @rmcpantoja@mastodon.social

6mo

@fastfinge My results shows that a dedicated IPC server performs faster, E.G the synthDriver is ready for use in just 5 seconds; response is good even in longer sentences, but this can be attributed to the 4.2m model I'm using. And when I ran the model through a streaming vocoder, response is surprisingly realtime, suitable for screen reader. As for voice rate, I'm using a modification of the good "audiostretchy" pip package. I can't give more details ATM, but I hope this helps in your research

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@rmcpantoja Also, I'd love to hear if and when you release anything!

Andre Louis @FreakyFwoof@universeodon.com

6mo

@fastfinge Supertonic works way better on my 2012 machine than Kitn. That one just cuts off things all the time and just doesn't sound good at all. I'm pretty sure it's not supposed to do that. Very weird.

patricus @patricus@snac.posix.live

6mo

@fastfinge yeah, AI TTS ein't for us. as always sighted get the best and we get the scraps, sadly.

Zach Bennoui @ZBennoui@dragonscave.space

6mo

@fastfinge Really interesting article. I'm particularly passionate about this subject, I've been fascinated with TTS for a number of years. I've trained many voices, both for Piper and some of the newer LLM based systems, and while I can't speak to the speed issue, training data is extremely important.

What you feed into these models has a big impact on the voice's performance overall. If you give it stuff scrape from the web, random audiobooks that weren't optimized for TTS, things like that, you're not going to get good results for the type of work screen reader users do every day. This applies to all of these systems, not even just neural networks. The latency / responsiveness issue is something we'll have to solve at some point, because I don't think using TTS systems last updated in 2003 is going to work out in the longterm, as much as I love Eloquence.

In my ideal world, we would have either a machine learning based or formant system that is easy to train / maintain. Big companies have lost interest in on device TTS, not even just for screen reader users. Many of the solutions being put out now are cloud based, and while developers are still creating on device models, as said in the article, they're not optimized for our needs and may never be. I think we have to take matters into our own hands and figure this out, but I believe with enough people we can make it happen.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@ZBennoui Agreed. I think blast bay is close to the right track. If only it was open and the issues pronouncing words were fixed. The speed and sound of the voices are top notch.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@ZBennoui We need a good formant system. Machine learning is useful for setting the model parameters. But I think the word to phoneme rules can’t be a neural network, because they have to be reproducible and modifiable. Even here though, machine learning could help though. I’d love a system where a user could submit a recording of a word, and the system could create the phonetic representation.

Zach Bennoui @ZBennoui@dragonscave.space

6mo

@fastfinge Yeah I completely agree, I happen to know Philip and have been talking with him extensively about his experiments with TTS. I can't go into a ton of detail, but I'll say what he said publicly. The system he's using is a hybrid approach of neural networks and formant synthesis, where he trains a model to output formant frequencies based on the audio data he feeds into it. I won't pretend to understand all the details, this is way above my pay grade, but as far as I understand this has never been done before by another developer.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@ZBennoui Yup. I just wish he wasn’t also trying to train his own phonemizer, because I really believe that has to be reproducible and modifiable for users. I’ve swapped multiple emails with him about an NVDA addon. But he’s pretty set on sapi for now until things stabilize both on the NVDA side and on his side.

clv1 has moved @clv1@mastodon.social

6mo

@amir @fastfinge @ZBennoui I too think we can make it happen by taking matters into our hands. I don't know how to code, but I'm at disposal to work on Portuguese language support, e.g. improving pronounciation rules, when time comes.

Paul L @polx@mastodon.online

6mo

@fastfinge isn't it possible to "pregenerate" the speech with all the necessary IDs so that you can navigate and interrupt at will?
Just as one generates SSML from rich text (including maths formulas) before generating speech.

It would even be better to catch intonations, breaths and others, unchanged instead of letting the TTS generating a "pleasant full phrase" (a wrong expectation).

I find your post intriguingly close to the emerging reaction against the Ai-generated #mundaneslop ;-).

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@polx Maybe, but probably not. Doing that would result in a lot of wasted resources generating text I'm never going to listen to. Think about the average user interface: dozens of menus, and toolbars, and ads, and comments, and so on. Plus, the text changes constantly, on even simple websites. That's not even taking into account websites that just scroll constantly. It might be possible to create some kind of algorithm to predict the most likely text I'll want next, but now we've just added another AI on top of the first AI.

I think a better solution might be to make the text to speech system run on different hardware from the computer itself. This is, in fact, how text to speech was done in the past, before computers had multi-channel soundcards. This has a few advantages. First, even if the computer itself is busy, the speech never crashes or falls behind. Second, if the computer crashes, it could be possible to actually read out the last error encountered. Third, specialized devices could be perhaps more power and CPU efficient.

The reason text to speech systems became software, instead of hardware, is largely because of cost. It's much cheaper to just download and install a program than it is to purchase another device. Also, it means you don't have to carry around another dongle and plug it into the computer.

Jayson Smith @jaybird110127@dragonscave.space

6mo

@fastfinge I assume you didn't mention the modern efforts with regard to DECtalk due to the legal situation with that source code being about as clear as mud?

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@jaybird110127 Yup. That and the fact that the sourcecode isn't getting updated; just getting it to keep compiling is a huge effort. There is a 64-bit build, but it doesn't actually work. I consider dectalk pretty much dead, even though the source is available.

Chris Smart @VE3RWJ@mastodon.radio

6mo

@fastfinge Malware Bytes advises me not to visit this page, and that's after I paste the URL in my browser because Tweesecake doesn't recognize it as a URL. :)

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@VE3RWJ Shrug. Nobody else has reported that issue. Probably a false positive from malwarebites.

Chris Smart @VE3RWJ@mastodon.radio

6mo

@fastfinge Ok. I'll go ahead and read it. nervus laugh

Amir @amir@dragonscave.space

6mo

@fastfinge What an interesting read! Needless to say, I read it with Eloquence - LOL!

Sean Randall @cachondo@defcon.social

6mo

@amir @fastfinge It's crazy that everyone is layering it in wrappers nowadays.
Do you know if codefactory are doing the same with their new android build?

Andre Louis @FreakyFwoof@universeodon.com

6mo

@cachondo @amir @fastfinge I sincerely hope someone will do the same for Orpheus. I'd even pay for it.

James Scholes @jscholes@dragonscave.space

6mo

@FreakyFwoof There is a 32-bit compatibility layer in the works for NVDA itself (although it currently only references SAPI4). But with any luck the need for every add-on to implement its own will go away.

github.com/nvaccess/nvda/pull/19412

@cachondo @amir @fastfinge

Sean Randall @cachondo@defcon.social

6mo

@jscholes @FreakyFwoof @amir @fastfinge It does seem incredible to cut every 32 bit thing out so suddenly.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@cachondo @jscholes @FreakyFwoof @amir They don't have much choice. A lot of the libraries NVDA depends on are stopping 32-bit support this year.

Sean Randall @cachondo@defcon.social

6mo

@fastfinge @jscholes @FreakyFwoof @amir I guess if this had happened a decade ago it'd have excited me. I'm obviously getting too old!

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@jscholes @cachondo @FreakyFwoof @amir My understanding is that when this comes to addons, it's going to require some kind of secure addons API/layer. And it won't be ready for 2026.1, or maybe not even 2026.2.

James Scholes @jscholes@dragonscave.space

6mo

@fastfinge Where are you getting the first part of that understanding from? I.e. the dependence on the secure add-on runtime. @cachondo @FreakyFwoof @amir

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@jscholes @cachondo @FreakyFwoof @amir It was mentioned in the roadmap NVDA released a while back.

James Scholes @jscholes@dragonscave.space

6mo

@fastfinge I see the "Secure add-on runtime" on the roadmap, with the note that "The first version of the runtime will provide support for speech synthesis and braille devices."

I don't see any implication that any 32-bit compatibility layer will only work for secure add-ons, which is hopefully a bit of a leap.

Still, the fact that people don't know what will or won't be happening, or whether their preferred synthesiser(s) will work or not, continues to be a big part of the problem. @cachondo @FreakyFwoof @amir

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@jscholes @cachondo @FreakyFwoof @amir That's my assumption because the only things that really need a 32-bit compatibility layer are speech synthesizers and braille devices.

James Scholes @jscholes@dragonscave.space

6mo

@fastfinge In the absence of actual data, we'll just have to hope that doesn't happen.

To me at least, binding the availability of a low level compatibility shim to a higher level security mechanism seems like extra work with no benefits. Whether or not an add-on uses 32-bit libraries seems architecturally irrelevant to whether or not it can be considered "secure." @cachondo @FreakyFwoof @amir

clv1 has moved @clv1@mastodon.social

6mo

@fastfinge @jscholes @cachondo @FreakyFwoof @amir Regarding ESpeak-ng, AFAIC, the main complaint from users is its base tone, which cannot be solved by simply making new variants. In this regard, how about improving its MBrola voices?

patricus @patricus@snac.posix.live

6mo

@clv1 @fastfinge @jscholes @cachondo @FreakyFwoof @amir the biggest gripe for me is it's rrrrroughness at higher WPM's and it's why I have it maxed but without boost.

clv1 has moved @clv1@mastodon.social

6mo

@fastfinge @jscholes @cachondo @FreakyFwoof @amir And what about recording new voices for RHVoice?

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@clv1 @jscholes @cachondo @FreakyFwoof @amir The issue is that both of these are effectively concatenative or parametric, rather than formant, systems. So they will never be as intelligible as eloquence.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@FreakyFwoof @cachondo @amir You should be able to get either Gemini or Codex to help you, depending on what AI you have access to. The workflow would be:
1. download gemini-cli or codex-cli, and get them installed and configured.
2. clone all of the sourcecode from github.com/fastfinge/eloquence_64/
3. Delete the tts.txt and tts.pdf files, so you don't confuse it with incorrect documentation.
4. Find any API documentation for orphius that's available, and add it into the folder.
4. Run codex-cli or gemini-cli, and tell it something like: "Using the information about how to develop NVDA addons you can find in agents.md, and the information about the Orphius API I've provided in the file Orphius-documentation-filename.txt, I would like you to modify the code in this folder to work with Orpheus instead of eloquence."

It will go away for five or ten minutes, ask you for permission to read and write the files it's interested in, and then give you something that mostly works. Now, build the addon, run it, and tell it about the errors and problems you have and ask it to fix them. In the case of errors, include the error right from the NVDA log, and for bugs and problems, tell it exactly what it's doing wrong, and exactly what you want it to do instead. Keep doing this until you wind up with a working addon.

Think of AI as a particularly stupid programmer, and you're the manager in charge of the project. You should be able to get this done without paying anyone.

Andre Louis @FreakyFwoof@universeodon.com

6mo

@fastfinge @cachondo @amir Well there's already a 32-bit addon for Orpheus floating about. I'd still rather pay someone competent to do it, even if they use AI. Proper programming terms would help narrow down the broken bits. I'm just an audio guy.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@FreakyFwoof @cachondo @amir Yeah, you can get AI to modify the 32-bit addon for you. That's how I got the first two eloquence prototypes; it helped me understand the problem and what approaches would work and what wouldn't. If you give it the 32-bit orphius addon, and the 64-bit eloquence addon, it should be able to understand the working approach to make an addon 64-bit, and make the modifications itself. The reason to give it the 64-bit eloquence addon as an example is so it doesn't decide to go down the GRPC route and include protobuf and a bunch of other nonsense.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@cachondo @amir I've heard from a second hand source that they are, yes. But I haven't verified that.

🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social

6mo

@fastfinge I've started using eSpeak-ng. It's grating, but I can crank the speed up way higher than any other TTS I've ever used, especially the fancy AI shit that simulates breath draws and lip movement

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@PepperTheVixen The reason it's grating is because unlike Eloquence and dectalk, Espeak only uses formant synthesis for the vowel sounds. For consonants and plosives, it instead uses concatenative recordings based on human speech. That's why even when you switch to a voice that sounds less sharp, the "t", "b", "p", and other sounds are still too sharp. This seems to be the primary cause of the fatigue most people experience while using ESpeak.

Devin Prater :blind: @pixelate@tweesecake.social

6mo

@PepperTheVixen @fastfinge Lol just imagining an AI voice with lip smacking noises.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@pixelate @PepperTheVixen If you give chatterbox-tts an ASMR recording to clone, you can absolutely get it to make lip smacking noises.

Devin Prater :blind: @pixelate@tweesecake.social

6mo

@fastfinge @PepperTheVixen Oh my goodness. Or even better, an AI voice chewing gum.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

6mo

@pixelate @PepperTheVixen If you have a sample of someone talking while chewing gum, you can absolutely make that happen.

🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social

6mo

@pixelate @fastfinge@interfree.cI think you just did psychic damage lol

Devin Prater :blind: @pixelate@tweesecake.social

6mo

@PepperTheVixen Ooo cool! I'll be in Warhammer 40K in no time as a psycher!

D.Hamlin.Music @dhamlinmusic@dragonscave.space

6mo

@PepperTheVixen @pixelate Oh how about a voice speaking while eating?

🏳️‍⚧️PepperTheVixen🇵🇸 @PepperTheVixen@meow.social

6mo

@dhamlinmusic @pixelate internal screaming