Windowing heuristic #161

tom-huntington · 2022-11-20T22:02:11Z

tom-huntington
Nov 20, 2022

Seems like you just stride by the window length to produce the segments

Lines 2917 to 2918 in 2065572

    
           const int start_samples = offset_samples + (i + 1)*n_samples_per_processor; 
        
           const int n_samples_cur = (i == n_processors - 2) ? n_samples - start_samples : n_samples_per_processor;

Seems like this wont handle words split across segments very well.

ggerganov · 2022-11-21T15:54:49Z

ggerganov
Nov 21, 2022
Maintainer

The code that you have quoted is related to what I call "processors". This is a functionality that was requested by someone to split the audio into chunks and process the chunks separately using a single model in memory. The hope was that there will be benefit from this approach on multi-core server machines. See the following PR for more info: #110

The actual sliding window logic that you are referring to is implemented here:

whisper.cpp/whisper.cpp

Lines 2670 to 2723 in eab36eb

    
           // very basic greedy sampling strategy: 
        
           // 
        
           //   - always take the most probable token 
        
           // 
        
           // more sophisticated sampling strategies could be implemented here, but we keep it simple 
        
           // feel free to experiment! 
        
           // 
        
           { 
        
               auto token = whisper_sample_best(ctx); 
        
               if (i == 0) { 
        
                   token.tid = whisper_token_beg(ctx); 
        
               } 
        
               // timestamp token - update sliding window 
        
               if (token.id > whisper_token_beg(ctx)) { 
        
                   seek_delta = 2*(token.id - whisper_token_beg(ctx)); 
        
                   result_len = i + 1; 
        
               } 
        
               // add it to the context 
        
               prompt.push_back(token.id); 
        
               tokens_cur.push_back(token); 
        
               //{ 
        
               //    const auto tt = token.pt > 0.10 ? ctx->vocab.id_to_token[token.tid] : "[?]"; 
        
               //    printf("%s: %10s %6.3f '%s'\n", __func__, tt.c_str(), token.pt, ctx->vocab.id_to_token[token.id].c_str()); 
        
               //} 
        
               // end of text token 
        
               if (token.id == whisper_token_eot(ctx) || (params.max_tokens > 0 && i > params.max_tokens)) { 
        
                   if (result_len == 0) { 
        
                       if (seek + seek_delta + 100 >= seek_end) { 
        
                           result_len = i + 1; 
        
                       } else { 
        
                           // TODO: figure out how to resolve this 
        
                           fprintf(stderr, "\n%s: failed to generate timestamp token - this should not happen\n\n", __func__); 
        
                       } 
        
                   } 
        
                   if (params.single_segment) { 
        
                       result_len = i + 1; 
        
                       seek_delta = 100*WHISPER_CHUNK_SIZE; 
        
                   } 
        
                   break; 
        
               } 
        
               // TESTS: if no tensors are loaded, it means we are running tests 
        
               if (ctx->model.n_loaded == 0) { 
        
                   seek_delta = 100*WHISPER_CHUNK_SIZE; 
        
                   break; 
        
               } 
        
           }

Basically, we sample the best token, and when the token is a timestamp, we remember it in seek_delta in order to slide the window by that amount.

1 reply

tom-huntington Nov 21, 2022
Author

Wow, so your ggml library actually uses multiple cores on the same forward pass. Rather just than running multiple forward passes parallel.

563 sec to transcribe 1h 30m of audio, I just though you must be doing segments in parallel.

openai/whisper#208 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windowing heuristic #161

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Windowing heuristic #161

tom-huntington Nov 20, 2022

Replies: 1 comment · 1 reply

ggerganov Nov 21, 2022 Maintainer

tom-huntington Nov 21, 2022 Author

tom-huntington
Nov 20, 2022

Replies: 1 comment 1 reply

ggerganov
Nov 21, 2022
Maintainer

tom-huntington Nov 21, 2022
Author