Note by @fastfinge

Aaron @hosford42@techhub.social

5mo

What are your pain points, folks? Stuff that you hate doing or dealing with, or problems you can't find a good solution to? Stuff that other people might be frustrated with, too.

I'm looking for a way to make myself valuable to other people, as a way to both help people and also earn an income to feed my family in the process.

One thing I can do really well is create reliable software to automate rote tasks, generate financial/statistical/other reports, or calculate difficult solutions. Think it can't be done without LLMs? I might surprise you!

Throw me a bone!

Please boost for reach!

#PainPoints
#WishList
#Automation
#Reporting
#ProblemSolving
#FediHire
#GetFediBHired
#FediJob

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 Sadly, there is no money in solving any of my problems. If there was, someone would have solved them. See, for example, my complaints about text to speech systems. stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html

I can go into more detail about why all the options are bad if you want. But this is the sort of problem that eats years of your life, requires advanced mathematics (digital signal processing at a minimum), and advanced linguistics, on top of being a good systems-level programmer.

Aaron @hosford42@techhub.social

5mo

@fastfinge I just so happen to be an (unemployed) machine learning researcher by trade, with advanced mathematics, linguistics, and programming skills. Maybe not systems-level programming, but I could probably find someone who does that and work with them.

Given that the first two responses I've gotten were both about accessibility, there might be more of a market for this than you think, and also, it might make a good way to demo my skills even if it isn't paid work.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 The reason I say systems-level programming is mostly because for a text to speech system used by a blind power user, you need to keep an eye on performance. If the system crashes and the computer stops talking, the only choice the user has is to hard reset. It would be running and speaking the entire time the computer is in use, so memory leaks and other inefficiencies are going to add up extremely quickly.

From what I can tell, the ideal is some sort of formant-based vocal tract model. Espeak sort of does this, but only for the voiced sounds. Plosives are generated from modeling recorded speech, so sound weird and overly harsh to most users, and I suspect this is where most of the complaints about espeak come from. A neural network or other sort of machine learning model could be useful to discover the best parameters and run the model, but not for generating audio itself, I don't think. This is because most modern LLM-based neural network models can't allow changing of pitch, speed, etc, as all of that comes from the training data.

Secondly, the phonemizer needs to be reproducible. What if, say, it mispronounces "Hermione". With most modern text to speech systems, this is hard to fix; the output is not always the same for any given input. So a correction like "her my oh nee" might work in some circumstances, but not others, because how the model decides to pronounce words and where it puts the emphasis are just a black box. The state of the art, here, remains Eloquence. But it uses no machine learning at all, just hundreds of thousands of hand-coded rules and formants. But, of course, it's closed source (and as far as anyone can tell the source has actually been lost since the early 2000's), so goodness knows what all those rules are.

Aaron @hosford42@techhub.social

5mo

@fastfinge Reading your linked article article and this reply, I get the sneaking suspicion that HDC (hyperdimensional computing) or other one- or few-shot learning methods that are designed to factor the model into independent components that can be quickly recomposed in new ways might be appropriate. The idea would be to, as you suggest, learn the values for these components using machine learning, but also the mapping between them and the sounds produced, so that each becomes separately tunable on the fly.

HDC has the added advantage that it is great for working with "fuzzy", human-interpretable rule representations, is typically extremely efficient compared to neural nets, and even meshes well with neural nets and gradient descent-based optimization.

Do you happen to have data of any sort that could be used for training?

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 In general, for training the rules for pronouncing English, the CMU pronouncing dictionary is used: www.speech.cs.cmu.edu/cgi-bin/cmudict

When it comes to open-source speech data, LJSpeech is the best we have, though far from perfect: keithito.com/LJ-Speech-Dataset/

And here's a link to GnuSpeech, the only open-source fully articulatory text to speech system I'm aware of: github.com/mym-br/gnuspeech_sa?tab=readme-ov-file

I'm afraid I don't have any particular data of my own.

Aaron @hosford42@techhub.social

5mo

@fastfinge thanks! I'll have a look at these.

Were you wanting to collaborate on this?

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 Sadly, this is so far outside of my expertise and abilities it's not even funny. I have an excellent handle on what's needed, and the vague shape of the path forward, but actually doing any of it is way outside of my skillset. If it was anywhere near something I could do, I would have started already. :-)

Aaron @hosford42@techhub.social

5mo

@fastfinge how about for guidance, design, requirements, alpha testing and evaluation?

Aaron @hosford42@techhub.social

5mo

@fastfinge my thinking is that, sure, I can build a thing, just like all those other folks, but you know the actual needs it would meet firsthand. That's tremendously valuable and can make the difference between something awesome and something completely useless

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 Absolutely yes to all of the above. I can think of another 10 people on Mastodon at minimum who are also ready and willing to help where ever they can. Just none of us with the skillset to do the actual work.

Aaron @hosford42@techhub.social

5mo

@fastfinge Awesome!

What you (and others who are interested) could do to help me right off the bat:

1. Make a list of the common issues, bugs, and failure modes you see in existing systems. Split hairs where you can on this, so I know exactly what issues to design around.

2. Make a list of the features you want, and include info about how important they are to you. Pay special attention to things that distinguish a good screen reader TTS from tools designed for sighted people.

You've already given me a lot of material on both of these, which is super helpful. I just want to make sure I'm getting a complete understanding so we are not surprised by a finished product that was subtly misaligned to your needs.

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 When it comes to requirements, in general, if it can work with both the SAPI5 and NVDA addons API, it will suit the requirements of speech dispatcher on Linux and the mac API's. The important thing is that most screen readers want to register indexes and callbacks. So, for example, if I press a key to stop the screen reader speaking, it needs to know exactly where the text to speech system stopped so that it can put the cursor in the right place. It also wants to know what the tts system is reading so it can decide when to advance the cursor, get new text from the application to send for speaking, etc. I really really really wish I had a better example of how that works in NVDA than this: github.com/fastfinge/eloquence_64/blob/master/eloquence.py

Aaron @hosford42@techhub.social

5mo

@fastfinge I think I get the gist but the code will help a lot!

🇨🇦Samuel Proulx🇨🇦 @fastfinge@interfree.ca

5mo

@hosford42 I wish it would. Unfortunately, that code is what we use to keep Eloquence alive in the 64-bit NVDA version. So it's awful, for dozens of reasons. This...is a bit clearer? Maybe? Anyway, it's the canonical example of how NVDA officially wants to interact with a text to speech system, written by the NVDA developers themselves. Any text to speech system useful for screen reader users needs to expose everything required for someone to write code like this. Not saying you could or should; there are dozens of blind folks who can do the job of integrating any text to speech system with all of the various API's on all the screen readers and platforms. But we have to have useful hooks to do it. github.com/nvaccess/nvda/blob/master/source/synthDrivers/espeak.py