The strategies for `Reliable Long-form Transcription` in whisper.cpp differs from OpenAI's Whisper #1461

bobqianic · 2023-11-09T01:14:12Z

bobqianic
Nov 9, 2023
Collaborator

I suddenly wanted to take a closer look at the OpenAI Whisper paper, and one section that caught my attention is the one I highlighted in yellow. Then I checked the whisper.cpp code and found that there are mainly two issues: the size of the temperature increment, and the method of calculating the compression ratio. The temperature in whisper.cpp increases by 0.4 each time instead of the 0.2 mentioned in the paper. Additionally, whisper.cpp uses entropy as a substitute for the gzip compression ratio, while OpenAI Whisper actually compresses the text and calculates the real gzip compression ratio. @ggerganov

Temprature:

whisper/transcribe.py

whisper.cpp/whisper.cpp

Line 3833 in 0de8582

/*.temperature_inc =*/ 0.4f,

whisper.cpp/whisper.cpp

Lines 4545 to 4554 in 0de8582

    
           // a set of temperatures to use 
        
           // [ t0, t0 + delta, t0 + 2*delta, ..., < 1.0f + 1e-6f ] 
        
           std::vector<float> temperatures; 
        
           if (params.temperature_inc > 0.0f) { 
        
               for (float t = params.temperature; t < 1.0f + 1e-6f; t += params.temperature_inc) { 
        
                   temperatures.push_back(t); 
        
               } 
        
           } else { 
        
               temperatures.push_back(params.temperature); 
        
           }

Gzip Compression Ratio:

whisper/utils.py

whisper.cpp/whisper.cpp

Lines 4326 to 4372 in 0de8582

    
           static void whisper_sequence_score( 
        
                   const struct whisper_full_params & params, 
        
                                   whisper_sequence & sequence) { 
        
               if (sequence.result_len == 0) { 
        
                   return; 
        
               } 
        
               double result = 0.0f; 
        
               for (int i = 0; i < sequence.result_len; ++i) { 
        
                   result += sequence.tokens[i].plog; 
        
               } 
        
               sequence.sum_logprobs = result; 
        
               sequence.avg_logprobs = result/sequence.result_len; 
        
               double penalty = sequence.result_len; 
        
               if (params.length_penalty > 0.0f) { 
        
                   penalty = pow((5.0 + penalty)/6.0, params.length_penalty); 
        
               } 
        
               sequence.score = result/penalty; 
        
               // compute the entropy of the sequence of the last 32 tokens 
        
               { 
        
                   const int n = 32; 
        
                   int cnt = 0; 
        
                   double entropy = 0.0f; 
        
                   std::map<whisper_token, int> token_counts; 
        
                   for (int i = std::max(0, sequence.result_len - n); i < sequence.result_len; ++i) { 
        
                       token_counts[sequence.tokens[i].id]++; 
        
                       cnt++; 
        
                   } 
        
                   for (const auto & kv : token_counts) { 
        
                       const auto p = kv.second/(double)cnt; 
        
                       entropy -= p*log(p); 
        
                       //WHISPER_PRINT_DEBUG("entropy: %d %f %f, count %d\n", kv.first, p, log(p), kv.second); 
        
                   } 
        
                   sequence.entropy = entropy; 
        
               } 
        
           }

ggerganov · 2023-11-09T09:41:39Z

ggerganov
Nov 9, 2023
Maintainer

The reason for the temp increase to be 0.4 is that it is faster processing when the fallback triggers. After we add efficient batched decoding, we will reduce it to 0.2.

We don't use gzip because it is not trivial to implement the compression. Entropy estimation works in a similar way and it's much easier to implement

4 replies

bobqianic Feb 1, 2024
Collaborator Author

We don't use gzip because it is not trivial to implement the compression. Entropy estimation works in a similar way and it's much easier to implement

In some scenarios, this assertion doesn't hold. Consider the following case. While the entropy of this segment is OK, the compression ratio does not meet expectations.

[best_decoder.sequence.entropy: 3.149363]
[00:33:03.860 --> 00:33:05.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:05.860 --> 00:33:07.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:07.860 --> 00:33:09.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:09.860 --> 00:33:11.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:11.860 --> 00:33:13.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:13.860 --> 00:33:15.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:15.860 --> 00:33:17.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:17.860 --> 00:33:19.860]   I don't know who they are, but I'm sure they're going to tell us something.
[00:33:19.860 --> 00:33:21.860]   I don't know who they are, but I'm sure they're going to tell us something.

gzip compression ratio is 8.0357 >> 2.4

[_286_][_500_][_380_][_458_][_261_][_1289_][_436_][_366_][_11_][_457_][_286_][_478_][_988_][_436_][_434_][_516_][_281_][_980_][_505_][_746_][_13_]

I don't know who they are, but I'm sure they're going to tell us something.

[_286_][_500_][_380_][_458_][_567_][_436_][_366_][_11_][_457_][_286_][_478_][_988_][_436_][_434_][_516_][_281_][_980_][_505_][_746_][_13_]

I don't know who they are, but I'm sure they're going to tell us something.

Possible reasons:

Various combinations of tokens can yield identical text.
Tokens are created during the training of BPE, and since they are already in a highly compressed form, entropy of tokens doesn't serve as a reliable measure.
The approach adopted for calculating entropy in whisper.cpp is not sufficient. As highlighted by C.E. Shannon in A Mathematical Theory of Communication, the formula $$H(S) = -\sum_{i=1}^{n} p_i \log_2(p_i)$$ is applicable solely to sequences where successive symbols are independent.

Artoria2e5 Oct 6, 2024

It could make sense to try a different LZ77 compressor, like the very simple LZ4 here. You only need the match-scanning after all.

ggerganov Oct 6, 2024
Maintainer

I'm OK to add compression based fallbacks, but the compressor implementation has to be very simple and efficient. I'm not familiar with this topic, so I don't know how much effort it is to implement LZ from scratch.

Artoria2e5 Oct 6, 2024

I will do a little reading of my own on that front. Our requirement can be further broken down into just estimating the compressed-size without coding; one approach would be to read and strip down the source code of an existing tiny compressor like FastLz or LZSS.C, another would be to try and make our own by following parts of https://glinscott.github.io/lz/index.html#toc4.2.2 (no Huffman, no "optimal parsing" even, cuz LZ4 doesn’t do that.)

AustinCase · 2024-11-12T20:51:09Z

AustinCase
Nov 12, 2024

Hey @ggerganov and @Artoria2e5 just coming across this thread. Not sure what the current thinking is / timeline of implementation?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The strategies for `Reliable Long-form Transcription` in whisper.cpp differs from OpenAI's Whisper #1461

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The strategies for Reliable Long-form Transcription in whisper.cpp differs from OpenAI's Whisper #1461

bobqianic Nov 9, 2023 Collaborator

Temprature:

Gzip Compression Ratio:

Replies: 2 comments · 4 replies

ggerganov Nov 9, 2023 Maintainer

bobqianic Feb 1, 2024 Collaborator Author

Artoria2e5 Oct 6, 2024

ggerganov Oct 6, 2024 Maintainer

Artoria2e5 Oct 6, 2024

AustinCase Nov 12, 2024

The strategies for `Reliable Long-form Transcription` in whisper.cpp differs from OpenAI's Whisper #1461

bobqianic
Nov 9, 2023
Collaborator

Replies: 2 comments 4 replies

ggerganov
Nov 9, 2023
Maintainer

bobqianic Feb 1, 2024
Collaborator Author

ggerganov Oct 6, 2024
Maintainer

AustinCase
Nov 12, 2024