-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Whispering support #454
base: master
Are you sure you want to change the base?
feat: add Whispering support #454
Conversation
Some questions:
|
It works just as fast as VOSK, however it only starts transcribing after the sentence ends. It does not have partial results, which might make it look slow.
Currently we have tested both medium and large, with very good performances.
We have run our tests on t1-45 OVH VPS, so an NVIDIA Tesla V100.
We have not tested that yet, but it seems that Whispering supports multiple connections. If you want, we plan on presenting our findings at today's Jitsi community call. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #454 +/- ##
============================================
- Coverage 23.15% 22.39% -0.77%
Complexity 304 304
============================================
Files 69 70 +1
Lines 5812 6006 +194
Branches 790 804 +14
============================================
- Hits 1346 1345 -1
- Misses 4235 4430 +195
Partials 231 231
Continue to review full report at Codecov.
|
ctx.put("no_speech_threshold", 0.6); | ||
ctx.put("buffer_threshold", 0.5); | ||
ctx.put("vad_threshold", 0.5); | ||
ctx.put("data_type", "float32"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using GPU, I expect it should be a bit faster with float16. But most time is spent waiting for audio, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad this was supposed to be int64, as, from my understanding, this is the audio format Jigasi sends.
I have implemented a convertion to float32 in shirayu/whispering#36 but I will suggest float16 for better performances.
e645d49
to
d8f88ae
Compare
@charles-zablit do you have a plan to finish this? It would be a great feature as i think whisper is currently the best open source STT. I would like to use it for meeting notes. |
Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)? Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU. Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers.. Many thanks for your work on all of this. Best, M. |
Where do you see this? |
Link to source file was in my last post - here it is again: Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java See line 27... |
Hi, We are still in the very early stage with our own Whisper live transcription implementation. We plan to make it open-source in the not so distant future. Cheers, |
@charles-zablit @nikvaessen @damencho The whisper live transcription server is now open source under the jitsi/skynet project. It should work out of the box with Jigasi. |
This PR adds support for Whispering a streaming transcription server based on OpenAI's Whisper.
Whispering's advantage over VOSK is that it supports multiple languages detection and transcription.
The Whispering Transcription service uses WebSockets to communicate with the Whispering server.
This is still a WIP as we still need to fix a sample rate incompatibility issue between Whispering and Jigasi.
Right now, we have to set
EXPECTED_AUDIO_LENGTH
to25600
.We also have to change https://github.com/shirayu/whispering/blob/256bf38b4d3d751e1eac8116f0f7da07e1b9652f/whispering/serve.py#L69
to
audio = np.frombuffer(message, dtype=np.int64)