-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speech Abstraction Layer #35
Comments
I think this method severely limits what we can do with speech-dispatcher given that we are (slowly) moving to an SSIP-based library. This would restrict a lot of the functionality which I expect to become standard to most screen readers. So having a backup where without speech-dispatcher support, we just directly speak to |
I don't think this issue should be closed just yet, since not everything is ready for implementation, there are things to iron out before I can start implementing. |
With the hopeful rise of something like That said, the answer to this is (probably) another speech server which implements SSIP, not a direct linking with I believe the correct solution for this is still to have So, a generic speech struct is fine, as long as it only supports two enums. I don't want Odilia trying to support Eiffel and pico speech engines as well. This falls fairly outside the scope of this project; perhaps an additional crate could be created to deal with this... but again, PRs to upstream speech engines to implement SSIP would still be preferred. So I like the idea, I just think it may fall almost entirely outside of Odilia's perview. If anybody happens to know how Orca handles this, I'd be interested. You're right, I shouldn't have closed the issue. |
If I may add something here: the thing that annoys me the most with speech-dispatcher is that you can't tweak settings that are specific to a particular synthesizer. When using espeak-ng for instance, you can't set the rate multiplier setting (I don't exactly remember the name), and so even with the speech rate set to 100% I still find it painfully slow. |
indeed, however if we're talking about other platforms and not covering only linux fragmentation, we have tts-rs that abstracts what we want well enough. Only problem is, it doesn't integrate anything but speech dispatcher on linux
well, if we want espeak to be the backup, we probably have to either link with libespeak(the ng variant these days I suppose), or statically include it in the binary, as of now we can't require many things to implement ssip on their own, we should have a very fault tolerant speech server in stead. This falls very neetly in position with my comment about remaking the speech server and making it right this time, with asyncronous streams, compatibility with newer audio backends and so on, see above for context.
Problem here is, you can't require all speech engines to be a speech server and implement ssip, that'd basically be wayland all over again. Plus, if there are more speech sockets, how would odilia see them? how would it connect to them, to which should it connect? The ideas of speech dispatcher are good and needed, however a screenreader should always have an emergency fallback mode, which can either be activated on an event of the engine reporting an error when speak is called, or simply if the user presses a panic key command that basically tells the sr help! I got no speech!
can you mok out an implementation of that? I don't understand what you described well enough to implement it, sorry. Ideally that should also show how would the flow of changing the implementations go, if that's configurable in the config file and so on.
well, simple, it entirely and solely relies on speech dispatcher. In comparison, either tdsr, fenrir or both, have an abstraction that allows them to rely on either spd, espeak or anything that can be called through the command line, with a wrapper. While there might still be various soundcard issues and whatever else preventing speech, spd is no longer the only source of truth, which means that if there's anything wrong with it and it alone, the backup engine/s should provide speech, allowing the user to fix the issue if it's fixable. Also, I think odilia shouldn't only work on the desktop, which means it should also be able to scale down to the command line/terminal screenreader mode if atspi errors out, isn't speech dispatcher two heavy for a tty only environment like, say, bsd images or the arch live iso shell? |
It's true that Orca solely relies on speech-dispatcher. It has an abstraction layer though.
|
trait SpeechBackend {
...
}
impl SpeechBackend for SSIPBackend { ... }
impl SpeechBackend for ESpeakBackend { ... }
enum SpeechBackendType {
SSIPCompatible(SSIPBackend),
EspeakNG(ESpeakBackend),
} So, only two variants will be supported. One which supports SSIP (for now only speech dispatcher), the other one which directly interfaces with The syntax obviously won't compile, but you get the idea. |
why is the speech backend only a marker trait so to speak? we need some way to unify those backends, so that we can place values for them in config files, speak with those backends through the trait itself, etc. Or, if that behaviour is not desired, then why have a trait at all? also, how would we represent backend specific configuration in the config file? could we derive serialize and deserialize on the enum and types, put the config in the types and have it work? what about backend agnostic settings? or is espeak considered strictly a fallback backend and it's not to be much configurable since it's for emergency issues only? |
Indeed, ideally some module specific configuration should be allowed, since some synth parameters are specific to that synth, aka rate boost in espeak. That could probably be easily fixable in spd, probably with some kind of configuration architecture overhall, modules needing updating and so on, but it's doable.
have you used pipewire recently with spd? maybe it's only in my case, I dk, however it introduces a lot of stuttering in my system. At the start, everything is good, but gradually as spd runs more and more, the xruns it usually does and have to be accounted for specially by the pipewire teams after some bugs were filed in the past, are affecting running apps to the point that voice calls on matrix or whatever voip solution I'm using, even videos or music, become so choppy I can't understand anything. Usually, restarting pipewire, turning off orca, then turning it back on after enough seconds passed so that spd goes and shuts itself down fixes it for a while. I asked on the pipewire matrix room and they told me people are having issues with spd as well, they also say it's because audio is processed syncronously. I would add to that the fact that it creates a stream per module which isn't necesarily good, on top of which its priority system is just cut the previously speaking stream, give the other one highest priority, then allow for the possibility of that app, regardless if it's being a screenreader, to never get the priority to speak again, I've seen that happen. Then, there's that incident discussed on the audiogames.net forum, where a person installed arch and everything but speech dispatcher, not sound, just speech dispatcher, wasn't working.
I firstly agree with what you said about the user interface part, probably we'd use gtk, however I'm dreading that day because it's gonna be interesting to make the UI in such a way it's generated by and depends on what's in the config files, but that's another discussion for another day. About contributing to speech dispatcher, firstly it's written in C which I don't think I know well enough, then it uses autotools which is kinda hard to get right even for a unix master, due to the intricacies and complexities of said system, plus most of the architecture would probably have to be rewritten if we want to have asyncronous streams, mixing per module, sane priority systems and all that goodness, we can always take it incrementally though. About it being only the synths, in the modules section of the spd reference, it says modules are incouraged to send spoken text as samples back to spd, which means then that spd playes them. For a quick and dirty test, try speaking "hello world!" with espeak-ng, voice en-us+max, vs doing it through spd-say. The first version would sound like normal espeak, like it sounds on windows and android, while through spd it sounds compressed for some reason, so spd is surely doing something with the data before sending it to...pulse I guess. |
Exactly. |
If we already have a good crate (tts-rs) abstracting the speech server in rust, why not to contribute to it. If we want to have a direct espeak backend, why we don't add it to tts-rs? |
This would be great! I don't know enough about tts-rs to know if we csn use it with SSML, but generaly I like contributing upsteam, and this seems pretty reasonable. |
I'm not sure, but I think at least it was in there plans. We would have to see the current state and if it's a feature they are intrested in if it's not already supported. For the backends they use, I'm almost sure, at least the desktop ones all have this feature. |
the problem with tts-rs is that all its functions are syncronous, which means that we have to use spawn_blocking on tokio in order to do speech quasi asyncronously, but we tryed that some time ago and that's why we moved from the tts-rs creator's speech-dispatcher binding, it presented problems when used outside the simplest take this piece of text and speak it scenario. For example, there was an issue with being unable to stop speech after pressing the ctrl key. This is because, as you know, odilia's architecture, while not really hardcore well defined, is mostly based around the everything is an async stream philosophi. While it allowed us to use the systems we got in probably a more efficient way, problem is now that the codebase as it currently stands isn't prepaired for syncronous code, especially not in larger quantities or at the point where systems meet, in this case it happened to be the speech layer. So yeah, what happened is that the speech function either blocked till speech was finished, or spawned a blocking task, but because the sr was able to continue processing events, I expect the latter. In any case, a thread out there in the application was blocked till the speak function returned, aka till libspeechd returned, aka untill speech dispatcher server sends the speech finished/ok message. So then, when the ctrl key was pressed, the sr got it, however it couldn't stop the speech. Aborting the task did nothing since the message was still in flight, sending a cancel request, which is what we ended up doing, didn't work because probably libspeechd has its own queue per connection I think, so that the cancel message we sent remained in that queue untill the speak message got its reply, but that's already too late, since probably the server would have replied with, well, nothing to cancel or something like that. With async, the speak message is sent via normal means, but then the future returns |
after a thread on the audiogames.net forum regarding speech dispatcher not working, therefore user is left without a way to use their computer in graphical mode, I concluded that, regardless how good speech dispatcher may be from a unix philosophi standpoint, it's made of many moving parts, two many imho, any of which can bring the entire system down, simply by failing. In a world where speech is the only way for visually impaired people to use a computer therefore as critical as the GPU for sighted users, having it fail on us is simply inacceptable. Dear readers with some kind of sight, would you like your screen to go black simply because the shader cache was full for example? Also, graphics are integrated everywhere in the stack while speech isn't, but that's another discussion for another time. So then, as I'm sure you won't like that, why should we have to deal with it? I took this idea from screenreaders like nvda and fenrir for the tty, where there's a speech abstraction inside the sr itself, not speech dispatcher. Then, there are different backends facilitating speech, which eventually perhaps give back pcm wave data to the sr, which gets processed by some internal systems, for example direct interaction with pipewire. So, in this case, we would use a rust trait to abstract away the concrete implementation of the backend speech provider, then probably the sr would use
Box<dyn SpeechProvider>
as the interface through which to deliver speech to the user. For now, this is the draft I want to propose for this, feel free to modify it and suggest improvements, as this one is probably here to stay after it'll be implementedFor now, here are a few things the current implementation doesn't explain:
The text was updated successfully, but these errors were encountered: