Roadmap | F.A.Q. #126

ggerganov · 2022-11-06T15:48:25Z

ggerganov
Nov 6, 2022
Maintainer

Updated roadmap as Github Project

https://github.com/users/ggerganov/projects/7

Roadmap (old)

In decreasing priority:

Decoding strategies
Try to achieve at least parity with the OpenAI implementation
Target release: v1.1.0
Memory usage reduction
This will allow wider application on low-memory devices. Should be possible to cut memory usage in half with a few simple changes in ggml
Target release: v1.2.0
Core ML support
This will allow to utilize the Apple Neural Engine for very efficient inference of the model
Target release: v1.3.0
Q4 / Q5 / Q8 integer quantization
Added via Integer quantisation support #540
Target release: v1.4.0
GPU support
Partial CUDA support via cuBLAS
Target release: v1.4.0
Documentation of ggml
Hopefully this leads to more contributions
Diarization
Very wanted feature, but very difficult to achieve. Interesting to explore and experiment
Low-power mode
This has the potential for some performance improvements too

F.A.Q.

Is whisper.cpp faster or slower than PyTorch CPU?

The performance should be comparable. At the time of writing, the performance on Apple Silicon when using whisper.cpp is better since we utilise FP16 + Accelerate framework, while PyTorch not yet. But this will soon change.

In general, it is not very easy to make a proper benchmark between the 2 implementations. For more information, read the following comment: Benchmark results #89 (comment)
Should I use whisper.cpp in my project?

whisper.cpp is a hobby project. It does not strive to provide a production ready implementation. The main goals of the implementation is to be educational, minimalistic, portable, hackable and performant. There are no guarantees that the implementation is correct and bug-free and stuff can break at any point in the future. Support and updates will depend mostly on contributions, since with time I will move on and won't dedicate too much time on the project.

If you plan to use whisper.cpp in your own project, keep in mind the above.
My advice is to not put all your eggs into the whisper.cpp basket.
How can I contribute?

Will `ggml` / `whisper.cpp` support CUDA / GPU?

One of the main goals of this implementation is to be very minimalistic and be able to run it on a large spectrum of hardware. The existing CPU-only implementation achieves this goal - it is bloat-free and very simple. I think it also has some educational value. Of course, not taking advantage of modern GPU hardware is a huge drawback in terms of performance. However, adding a dependency on a certain GPU framework will tie the project with the corresponding hardware and will introduce some extra complexity.

With that said, adding GPU support to the project is low priority.

In any case, it would not be too difficult to add initial support. The main thing that needs to be offloaded to the GPU is the GGML_OP_MUL_MAT operator:

whisper.cpp/ggml.c

Lines 6231 to 6234 in c71363f

    
           case GGML_OP_MUL_MAT: 
        
               { 
        
                   ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor); 
        
               } break;

This is where more than 90% of the computation time is currently spent. Also, I don't think it's necessary to offload the entire model to the GPU. For example, the 2 convolution layers at the start of the Encoder can easily remain on the CPU as they are not very computationally heavy. Not uploading the full model to VRAM will make it require less memory and thus make it compatible with more video cards.

Another candidate GPU framework that will likely be supported in the future is Apple's Metal Performance Shaders (MPS). Currently, ggml supports Apple's Accelerate framework and I really like how seamlessly it integrates in the project - both on MacOS and iOS. It does not feel like a third-party dependency at all and that is why I think MPS support can be added in a similar way. In theory, it will help to utilize the GPU on Apple devices and could potentially lead to some additional performance improvement. Also, the unified memory model of modern Apple Silicon devices allows to seamlessly share the model weights and the data embeddings between the CPU and the GPU, which is not the case for CUDA. So far, my initial experiments haven't shown any benefit of using MPS for the transformer inference, but maybe some more work is needed.

Edit: Sample Metal support has been demonstrated here: Metal support #127

Edit2: There is a promising CUDA support in the works through NVBLAS: Experiments with GPU CUDA acceleration...sort of #220

Edit3: CUDA support via cuBLAS has been added: Add CUDA support via cuBLAS #834

Notable `whisper.cpp` discussions

rpikus27 · 2023-03-01T17:52:39Z

rpikus27
Mar 1, 2023

I saw the demo with command recognition. How do you trigger events after a command? Is there an intent handler model built in?

3 replies

ggerganov Mar 2, 2023
Maintainer Author

The command example only demonstrates how to receive voice commands. Handling the commands is up to you.
You can either extend the C++ example with a handler or you can pipe / parse the output from an external app, etc.
There is no built-in intent model available.

I am planning to soon demonstrate how to extend the command example for the purpose of making moves on a chess board with voice. Updates will be posted in #428 if you are interested.

rpikus27 Mar 2, 2023

I am interested what you choose to parse the text. Whisper is so good, for my purposes, a switch statement would do the trick, but a mass market app will probably need a model.

ruban22 Apr 12, 2023

Hi,

Im intersted in your solution for real time streaming transcription/translation in our Video confrencing, can you offer your service to implement it in our enviroment?

Thanks

katurov · 2023-04-28T07:01:10Z

katurov
Apr 28, 2023

Hi! I've built version for M1 as in instruction but after all I have my stream translated from original language to English. And I cannot switch it off. Is here some ideas how to?

1 reply

SonOfLilit May 14, 2023

Found the solution here:

#526

./main --language auto ...

fromparis · 2023-10-06T21:43:54Z

fromparis
Oct 6, 2023

can you make an example of the word time stamp in swiftUI ?? I'm totally lost in c++

0 replies

rexendevar · 2024-08-16T18:32:36Z

rexendevar
Aug 16, 2024

Is there any current or planned method of setting up a file to use as an additional dictionary?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap | F.A.Q. #126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Is `whisper.cpp` faster or slower than `PyTorch CPU`?

Should I use `whisper.cpp` in my project?

How can I contribute?

Will `ggml` / `whisper.cpp` support CUDA / GPU?

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Roadmap | F.A.Q. #126

ggerganov Nov 6, 2022 Maintainer

Updated roadmap as Github Project

Roadmap (old)

F.A.Q.

Is whisper.cpp faster or slower than PyTorch CPU?

Should I use whisper.cpp in my project?

How can I contribute?

Will ggml / whisper.cpp support CUDA / GPU?

Notable whisper.cpp discussions

Replies: 4 comments · 4 replies

rpikus27 Mar 1, 2023

ggerganov Mar 2, 2023 Maintainer Author

rpikus27 Mar 2, 2023

ruban22 Apr 12, 2023

katurov Apr 28, 2023

SonOfLilit May 14, 2023

fromparis Oct 6, 2023

rexendevar Aug 16, 2024

ggerganov
Nov 6, 2022
Maintainer

Is `whisper.cpp` faster or slower than `PyTorch CPU`?

Should I use `whisper.cpp` in my project?

Will `ggml` / `whisper.cpp` support CUDA / GPU?

Notable `whisper.cpp` discussions

Replies: 4 comments 4 replies

rpikus27
Mar 1, 2023

ggerganov Mar 2, 2023
Maintainer Author

katurov
Apr 28, 2023

fromparis
Oct 6, 2023

rexendevar
Aug 16, 2024