So this looks like a high quality, fast, natural, and open source TTS system in Python. A key candidate for an #NVDA#addon. Unfortunately, I find #nvdasr addon development super confusing. Is there a good template to start from or something? github.com/thewh1teagle/kokoro-onnx
Here's a much longer example of the quality of speech Kokoro TTS generates. I really do think it might be a decent #NVDA addon. The weird pauses are because I'm just giving it a big long string, rather than chunking it like I should. It generates this in real time on CPU, and faster on GPU. The code to generate it is as follows: import soundfile as sf from kokoro_onnx import Kokoro from onnxruntime import InferenceSession
session = InferenceSession("kokoro-v0_19.onnx", providers=["ROCMExecutionProvider", "CPUExecutionProvider"]) kokoro = Kokoro.from_session(session, "voices.json") samples, sample_rate = kokoro.create( "He wasn't sleeping very well, and he knew the people around him noticed, but he didn't know what to do about it. He had quietly gone to Madame Pomfrey, who had regretfully told him that Dreamless Sleep was highly addicting and that while she could give him the occasional dose, it would have to be spread out enough to prevent it from becoming addicting – meaning he could only take it one night out of every two weeks or so. It was one night more of productive sleep than he'd be getting otherwise, so he still did it, but it didn't help the larger issue. He wasn't under the effects of any nightmare-inducing Curses, potions, or other magical ailments, so there was nothing for Madame Pomfrey to do. The nightmares were coming from his own mind, and she was not a Mind-Healer. She'd offered to try and connect Harry with one, but when Harry discovered that it involved having someone else quite literally entering his mind with magic and helping him sort out things like trauma he couldn't. If Harry couldn't even tell Hermione the extent of what he'd suffered at the Dursley's, he wasn't about to let a stranger into his mind to see it. Let alone the 'adventures' of his Hogwarts years. So the nightmares persisted, and with the poor quality of sleep serving as the first domino, everything else slowly began to fall. His grades weren't slipping yet, but he was struggling with the study schedule Hermione had set out for them and doing his homework took more effort, more energy that he didn't have.", voice="af_sarah", speed=1.0, lang="en-us" ) sf.write("audio.wav", samples, sample_rate) print("Created audio.wav")
@fastfinge I have tried to speed those 11 lab voices up quite a bit using ElevenReader, and I don’t really like the results. Curious to hear what you think.
@FreakyFwoof@SeveraSnape@cachondo It's a Harry/Hermione fluff fic, so probably something you'd both enjoy. I'm surprised you don't already have it. Harry Potter and the Art of Getting Your Shit Together Posted originally on the Archive of Our Own at archiveofourown.org/works/59310490.
@FreakyFwoof@SeveraSnape@fastfinge the dedication is awesome. I need a way of knowing when the in progress things are done, seriously. DO I perhaps need to make an ao3 account or something?
@cachondo@SeveraSnape@fastfinge I had to, because I follow sooooo many things on both ff net and ao3, so I set up a rule to forward any HP-related emails to a dedicated folder. I set it such that if the emails do not contain the text 'Harry Potter' it immediately deletes them. It's cut down on my inbox clutter 10-fold if not more, and the folder of fics is purely HP-related. I loves it.
@cachondo@SeveraSnape@FreakyFwoof Yeah, and get fanficfare configured on a server somewhere. It can monitor an imap account, find emails from FF and AO3, and auto-update your epub files.
@cachondo@FreakyFwoof@fastfinge For Andre's folder, I've begun putting status in there, I have visions of a Change log like I do for mine... but there's so much I'm doing to his folder right now it would just be crazy. but, you would only need an account if you want to get author alerts. And... if you follow a bunch ten you would get lots of emails
@SeveraSnape@cachondo@FreakyFwoof Coffee is the one true caffeine delivery system, and mint tea is the only acceptable hot drink without caffeine. ROFL
@fastfinge it's a very nice neutral sort of an accent. Goes a bit funny on the ends of some words, one, and so are good examples in that sample. But I can see it being a great option for people who want more Human-sounding voices.
@cachondo So looking at it, it looks like it just uses the phonemes generated by espeak, and passes those to the natural voices. So if you use a voice trained on American English, and ask for en-gb, it'll do it anyway and sound terrible.
@fastfinge haha that's rather funny. One of the biggest complaints from users new to screen reading when I taught was the quality of the available voices. The school paid for vocaliser I think but that was as good as it got. I did get a few people onto the neural stuff, but it was in its infancy when I left. This sounds really smooth in comparison.
@fastfinge so can it only write direct-to-file, or could it also raw PCM data to a callback or have a way of reading a buffer it creates with that raw data? NVDA drivers would work infinitely sympler under that model. Sadly no real template for one exists beyond just looking at the code for drivers like DECTalk or Eloquence, Sonata, ETC and basing it off them to see which pattern best fits that synths way of operating on things.
@Tamasg@tspivey So it looks like the repo is still super active. For this to be an addon, we want: streaming samples in real time, and indication of speech starting and stopping. Anything else? I can open an issue to ask.
@fastfinge@tspivey I think yeah, a way to inject stop sequences mid-speech as well (so we could call a shut-up or stop from the main thread during playback) - having callbacks for stop can be nice, sometimes we can gather that just on the basis of the audio buffer closing itself if that's done in realtime with speech fragments.
@FreakyFwoof@SerenaTori@fastfinge ha. I know very little about how we could get it compiled right in the add-on. (I know there was a discussion of this earlier so if that build process for onnxruntime into the add-on succeeded, would love some basic copy then.) For anyone wanting to try, I think looking at something like the Brailab driver (which is super minimal and in the end all you're really going to use are the getters and setters for the synth driver, the way you do speech is obviously not at all like Brailab), and then crafting in to open the stream might work. But between the latest family emergency, work at Spotify with the new year / new projects, I'm afraid I'll be swamped for awhile to give it that truly comparitive look. I'd also love to see a test run at how quick it can synthesize speech on slower CPUs especially when that speech is interrupted mid-utterance - how does it handle stopping a stream and loading a new one, is there lots of latancy? A simple py test that just throws lots of speech chunks like that, stops, starts, would give us an idea maybe to then know if it's worth turning into a driver just yet.
@Tamasg@SerenaTori@fastfinge Sorry to hear about family emergencies, never nice to deal with. I hope things can be sorted out for the better.
Re slow CPU though, that's where I come in. I am right now even, using an Intel Core I5-3570K from 2012. It runs every synth very well, apart from Piper which it struggles with due to the neural aspect of it. If my machine can run... Whatever you guys end up coming up with (hopefully) then anything else should be a breeze.
@FreakyFwoof@Tamasg@SerenaTori@fastfinge I have an even slower one. Yay for countries in the middle of... Well somewhere, and computers from 2009 haha if something can even run on that, I'd be surprised. How's that for a slow processor? It's pretty ancient. The synth sounds nice, yeah, don't like how it reads hashtag, but I guess that's me. There's also something about question marks it clearly missed, but I think it needs to be fed a bigger chunk of text to see if it'll sound better. Otherwise, for the quality, Bleh, either my ears, or something, do not consider it a great quality in the sound terms, but for a TTS, I guess it's good. says the person who daily drives a TTS that came out in 2001. LOL.
@FreakyFwoof@Tamasg@SerenaTori@fastfinge A synth that does English people no good. Haha. And I have a dell from 2009, it has still a 32Bit windows 10 version, so it tells you something. :D
@FreakyFwoof@Tamasg@SerenaTori@fastfinge I also cannot tell you the full specs. Computer not here, sadly. It has a removable battery though, that gave up a long time ago, then I fell down some stairs while carrying set computer, and the pixels in the screen went poof, and no screen.
@fastfinge I wonder if Sonata would try to incorporate it? The trick with stuff like this is you might actually want to use a server process model rather than trying to run it from within NVDA itself.
@x0 Yeah, it does have a ton of dependencies. I will say all of the voices are better than Sonata/piper, IMHO. Even if it does look like they're all eleven labs ripoffs.
@fastfinge I understand @mush42 has made verry significant progress for example as compared to piper TTS. To me it looks it's much lighter for both training and using trained model even enhancing audio quality and elligibility in the process. This is just my guess but with such an achievement it's fine not to limit it to blind audience exclusivelly. This is how I am seeing #optispeech. However I haven't played with kokoro TTS thus I have asked how much do you like it for example while comparing to something else, perhaps piper TTS if you do know that one.
@fastfinge I heard about this project a few months back when it was still just a Huggingface demo. The model was trained on outputs from proprietary TTS systems including Eleven Labs and Open AI, hence why the quality is so good. Really cool project, and the model is still being worked on.
@fastfinge I suspect your first big headache will be getting onnxruntime (and any other heavy dependencies) installed into the add-on's environment. Doesn't look like simple pure Python code.
@jscholes You can just do it with pip install --target=. to force pip to install a package and all dependencies to the current directory. Then import from the extension directory. The only issue is I'm not sure if onnxruntime has 32-bit binaries or if I'll need to cross-compile the wheel from source.
@fastfinge outside of the dev guide and addon dev guide on github, not ... really, that I know of. Admittedly, those resources HAVE gotten a bit better as of late
Yeah, I am deeply confused about how buffers work and how to indicate when speaking is complete and do indexing and so-on. If this is going to be an #NVDA addon, someone else will have to do it.
@fastfinge You need support from the synth for some features. This one doesn't have anything. Once it starts speaking, it blocks until it's done, so you can't interrupt it.
@fastfinge Taking this sentence and passing it straight through, it pauses after highly. That's not even that many words. He had quietly gone to Madame Pomfrey, who had regretfully told him that Dreamless Sleep was highly addicting and that while she could give him the occasional dose, it would have to be spread out enough to prevent it from becoming addicting – meaning he could only take it one night out of every two weeks or so.
@tspivey Also, how does NVDA chunk text it passes to a synth? Even that's not really documented anywhere LOL. I think Kokoro inference would need running in its own thread, so the thread could be killed when we wanted to stop speech rather than generating extra samples, and a knew thread could be started so you could start the new speech quickly, like when someone's pressing down arrow rapidly. But I don't have the time, and I'm not smart enough.
@tspivey Yeah, I'm increasingly convinced that @x0 is correct, and this would need to be part of Sonata if this was going to happen at all. They seem to have solved those issues mostly.
@fastfinge@tspivey my wife and I had an unexpected keyword argument a few years ago. She'd never heard of the word nomenclature. I ended up with gnomes on a clay chair as a whacky present as a reminder of the utter ridiculousness of the discussion.
@fastfinge@tspivey I read an article a while ago about a young American who's dad hadn't heard of a word she'd picked up at college. it wasn't a particularlycomplicated or unusual word, but much was made of it in this article.
I sometimes wish I had a searchable text file of everything my screen reader ever said.
@tspivey@fastfinge I'm not sure this is the reason for pausing, but the model has a total context size of 500 characters and will not do well with input longer than that. It may also just be bad training data, sentences not ending with correct punctuation, primarily trained on paragraphs, etc. I’ve trained many TTS models over the last few years and data quality is extremely important, something lacking in most open source TTS systems out there.
@ZBennoui@tspivey I think it's something with the onnx implementation actually. The pytorch version doesn't have this issue. There's an open issue looking into it.
@tspivey That's why you start a session, so the model stays loaded in memory. Then I think you can actually stream output from onnxruntime bite by bite, I'm just not sure how.