I just ran it again and happened to get an even better time, under 7 seconds without loading and 13.08 seconds including loading. In case anyone is curious about the use of Flash Attention, I tried without it and transcription took under 10 seconds, 15.3 including loading.
Another question that's only slightly related, but while we're here...
Using OAI's paid Whisper API, you can give a text prompt to a) set the tone/style of the transcription and b) teach it technical terms, names etc that it might not be familiar with and should expect in the audio to transcribe.
Am I correct that this isn't possible with any released versions of Whisper, or is there a way to do it on my machine that I'm not aware of?
You can definitely do this with the open source version. Many transcription implementations use it to maintain context between the max-30-second chunks Whisper natively supports.
I'll try to understand some of how stuff like faster-whisper works when I've got time over the weekend, but I fear it may be too complex for me...
I was rather hoping for a guide of just how to either adapt classic whisper usage or adapt one of the optimised ones like faster-whisper (which I've just set up in a docker container but that's used up all the time I've got for playing around right now) to take a text prompt with the audio file.
Cheers, I've been wanting to get into doing something else with my 4090 order than multi monitor simulator gaming, quad screen workstation work - and this will get me kicked off!
The 4090 is an absolute beast, runs extremely quiet and simply powers through everything. DCS pushes it to the limit, but the resulting experience is simply stunning. Mine's coupled to a 7800x3d which uses hardly any power at all, absolutely love it.
If you're looking for something easy to try out, try my early demo that hooks Whisper to an LLM and TTS so you can have a real time speech conversation with your local GPU that feels like talking to a person! It's way faster than ChatGPT: https://apps.microsoft.com/detail/9NC624PBFGB7
I just can't get it to work, it errors out with 'NotImplementedError: The model type whisper is not yet supported to be used with BetterTransformer.' Did you happen to run into this problem?
Sorry, I didn't encounter that error. It worked on the first try for me. I have wished many times that the ML community didn't settle on Python for this reason...
I just ran it again and happened to get an even better time, under 7 seconds without loading and 13.08 seconds including loading. In case anyone is curious about the use of Flash Attention, I tried without it and transcription took under 10 seconds, 15.3 including loading.