@fastfinge

Admin

completely blind computer geek, lover of science fiction and fantasy (especially LitRPG). I work in accessibility, but my opinions are my own, not that of my employer. Fandoms: Harry Potter, Discworld, My Little Pony: Friendship is Magic, Buffy, Dead Like Me, Glee, and I'll read fanfic of pretty much anything that crosses over with one of those.
keyoxide: aspe:keyoxide.org:PFAQDLXSBNO7MZRNPUMWWKQ7TQ

Location Ottawa

Birthday 1987-12-20

Pronouns he/him (EN)

xmpp fastfinge@im.interfree.ca

keyoxide aspe:keyoxide.org:PFAQDLXSBNO7MZRNPUMWWKQ7TQ

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@polx Maybe, but probably not. Doing that would result in a lot of wasted resources generating text I'm never going to listen to. Think about the average user interface: dozens of menus, and toolbars, and ads, and comments, and so on. Plus, the text changes constantly, on even simple websites. That's not even taking into account websites that just scroll constantly. It might be possible to create some kind of algorithm to predict the most likely text I'll want next, but now we've just added another AI on top of the first AI.

I think a better solution might be to make the text to speech system run on different hardware from the computer itself. This is, in fact, how text to speech was done in the past, before computers had multi-channel soundcards. This has a few advantages. First, even if the computer itself is busy, the speech never crashes or falls behind. Second, if the computer crashes, it could be possible to actually read out the last error encountered. Third, specialized devices could be perhaps more power and CPU efficient.

The reason text to speech systems became software, instead of hardware, is largely because of cost. It's much cheaper to just download and install a program than it is to purchase another device. Also, it means you don't have to carry around another dongle and plug it into the computer.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@ZBennoui Yup. I just wish he wasn’t also trying to train his own phonemizer, because I really believe that has to be reproducible and modifiable for users. I’ve swapped multiple emails with him about an NVDA addon. But he’s pretty set on sapi for now until things stabilize both on the NVDA side and on his side.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@ZBennoui We need a good formant system. Machine learning is useful for setting the model parameters. But I think the word to phoneme rules can’t be a neural network, because they have to be reproducible and modifiable. Even here though, machine learning could help though. I’d love a system where a user could submit a recording of a word, and the system could create the phonetic representation.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@ZBennoui Agreed. I think blast bay is close to the right track. If only it was open and the issues pronouncing words were fixed. The speed and sound of the voices are top notch.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@amberhinds To be fair to your step dad, Karen and Lee, the popular Australian voices found on most GPS devices, as well as on IOS, originally developed by Scansoft, later purchased by Nuance, then transferred to Cerence Automotive, before finally getting owned by Microsoft, are some of the nicest text to speech voices ever made for casual listening. Largely this is due to whomever was in charge of recording and curating the data back in 2002. They did an excellent job editing and aligning the recordings for use with the concatenative synthesis technology that was available at the time, resulting in the Australian voices sounding noticeably better than all of the other English options, even though they all used the same underlying methods. The fact the data capture was so high quality has meant that as technology and training methods improve, those voices have continued to remain a step ahead. The female version of the voice your father is almost certainly using is based on this woman: en.wikipedia.org/wiki/Karen_Jacobsen

If all of my favourite, fast and efficient voices were ripped away from me, those Australian voices are probably what I'd revert to. They're not as fast as I would like, but at least they're clear and accurate. Your step dad has good taste.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@phillycodehound I played briefly with this and it seemed okay on the surface, though I'm not looking for work so didn't go deep: github.com/rendercv/rendercv

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@luiscarlosgonzalez @cachondo @FreakyFwoof @amir It has the same problem with speed.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@luiscarlosgonzalez @cachondo @FreakyFwoof @amir I didn't try Kokoro, because it cannot achieve a real time factor of 1 on CPU. By that I mean, to be fit for consideration with a screen reader, a text to speech voice must be able to generate one second of speech in one second or faster. In general, Kokoro takes two seconds to generate one second of speech. So it's not suitable.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@clv1 @jscholes @cachondo @FreakyFwoof @amir The issue is that both of these are effectively concatenative or parametric, rather than formant, systems. So they will never be as intelligible as eloquence.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@VE3RWJ Shrug. Nobody else has reported that issue. Probably a false positive from malwarebites.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@jscholes @cachondo @FreakyFwoof @amir That's my assumption because the only things that really need a 32-bit compatibility layer are speech synthesizers and braille devices.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@FreakyFwoof @cachondo @amir Yeah, you can get AI to modify the 32-bit addon for you. That's how I got the first two eloquence prototypes; it helped me understand the problem and what approaches would work and what wouldn't. If you give it the 32-bit orphius addon, and the 64-bit eloquence addon, it should be able to understand the working approach to make an addon 64-bit, and make the modifications itself. The reason to give it the 64-bit eloquence addon as an example is so it doesn't decide to go down the GRPC route and include protobuf and a bunch of other nonsense.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@jscholes @cachondo @FreakyFwoof @amir It was mentioned in the roadmap NVDA released a while back.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@cachondo @jscholes @FreakyFwoof @amir They don't have much choice. A lot of the libraries NVDA depends on are stopping 32-bit support this year.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@jscholes @cachondo @FreakyFwoof @amir My understanding is that when this comes to addons, it's going to require some kind of secure addons API/layer. And it won't be ready for 2026.1, or maybe not even 2026.2.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@FreakyFwoof @cachondo @amir You should be able to get either Gemini or Codex to help you, depending on what AI you have access to. The workflow would be:
1. download gemini-cli or codex-cli, and get them installed and configured.
2. clone all of the sourcecode from github.com/fastfinge/eloquence_64/
3. Delete the tts.txt and tts.pdf files, so you don't confuse it with incorrect documentation.
4. Find any API documentation for orphius that's available, and add it into the folder.
4. Run codex-cli or gemini-cli, and tell it something like: "Using the information about how to develop NVDA addons you can find in agents.md, and the information about the Orphius API I've provided in the file Orphius-documentation-filename.txt, I would like you to modify the code in this folder to work with Orpheus instead of eloquence."

It will go away for five or ten minutes, ask you for permission to read and write the files it's interested in, and then give you something that mostly works. Now, build the addon, run it, and tell it about the errors and problems you have and ask it to fix them. In the case of errors, include the error right from the NVDA log, and for bugs and problems, tell it exactly what it's doing wrong, and exactly what you want it to do instead. Keep doing this until you wind up with a working addon.

Think of AI as a particularly stupid programmer, and you're the manager in charge of the project. You should be able to get this done without paying anyone.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@cachondo @amir I've heard from a second hand source that they are, yes. But I haven't verified that.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@pixelate @PepperTheVixen If you have a sample of someone talking while chewing gum, you can absolutely make that happen.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@pixelate @PepperTheVixen If you give chatterbox-tts an ASMR recording to clone, you can absolutely get it to make lip smacking noises.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@PepperTheVixen The reason it's grating is because unlike Eloquence and dectalk, Espeak only uses formant synthesis for the vowel sounds. For consonants and plosives, it instead uses concatenative recordings based on human speech. That's why even when you switch to a voice that sounds less sharp, the "t", "b", "p", and other sounds are still too sharp. This seems to be the primary cause of the fatigue most people experience while using ESpeak.

❮ Previous page Next page ❯