Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fw compliance #857

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
fc54cb9
seed, multilingual and fixes
Jiltseb Jun 9, 2023
84d58fa
added languages in tokenizer
Jiltseb Jun 14, 2023
63bea66
multilingual fixes
Jiltseb Jun 21, 2023
b95d694
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
a8626bb
code fixes for multilingual
Jiltseb Jun 28, 2023
c2ca8d4
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9edf960
added commits specifying changes to original package
Jiltseb Jul 26, 2023
d008650
seed, multilingual and fixes
Jiltseb Jun 9, 2023
2573982
added languages in tokenizer
Jiltseb Jun 14, 2023
8add326
multilingual fixes
Jiltseb Jun 21, 2023
afc3f5c
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
dd55c03
code fixes for multilingual
Jiltseb Jun 28, 2023
d34780e
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9fab8d9
added commits specifying changes to original package
Jiltseb Jul 26, 2023
162fbf0
modifications based on review
Jiltseb Jul 28, 2023
ca6a2ba
removed LANGUAGES from tokenizer and added numpy requirements
Jiltseb Oct 6, 2023
0df6953
Merge remote-tracking branch 'upstream/master'
Jiltseb Oct 9, 2023
988c528
Merge local master to 'updated_js_v2.1'
Jiltseb Oct 9, 2023
443eb86
Merge pull request #1 from mobiusml/js_asr_v2.1_pr
Jiltseb Oct 9, 2023
6a51407
Update requirements.txt
Jiltseb Oct 9, 2023
4138e16
Merge pull request #2 from SYSTRAN/master
Jiltseb Dec 12, 2023
b906a98
changes to README.md
Jiltseb Dec 13, 2023
0464122
Added BatchedInferencePipeline
Jiltseb Dec 13, 2023
78b5cd7
Added language detection from multiple segments and batched inference…
Jiltseb Dec 13, 2023
f397e37
added additional packages
Jiltseb Dec 13, 2023
83895ac
changes to batched inference based on the review
Jiltseb Dec 20, 2023
e1c1699
change in silence detection
Jiltseb Dec 21, 2023
b516bc8
Merge pull request #3 from mobiusml/batched_asr
Jiltseb Dec 22, 2023
3477d86
Merge pull request #4 from SYSTRAN/master
Jiltseb Jan 22, 2024
95df9eb
added logic for torchaudio based feature extraction
Jiltseb Jan 23, 2024
0cc2d1d
added requirements
Jiltseb Jan 23, 2024
d6624ff
added feature extraction in README
Jiltseb Jan 23, 2024
fa69694
Merge pull request #5 from mobiusml/add_new_feat_extract
Jiltseb Jan 23, 2024
6698a9a
removing unwanted dataclasses and non-generator transcribe function, …
Jiltseb Mar 19, 2024
1b6376f
Merge remote-tracking branch systran/faster_whisper 'upstream/master'…
Jiltseb Mar 19, 2024
92867e3
uses same type annotation as faster_whisper for batched transcribe, c…
Jiltseb Mar 25, 2024
8452cf2
added jsons for dict conversion
Jiltseb Mar 25, 2024
4535963
made vad_segments as optional parameter, modified docstring
Jiltseb Mar 25, 2024
95671d2
made default batched asr options optional as this can be taken care d…
Jiltseb Mar 25, 2024
5fa21b8
Merge pull request #7 from mobiusml/fixes_and_update
Jiltseb Mar 26, 2024
b421086
Update requirements.txt
Jiltseb Mar 26, 2024
16d54e5
Update requirements.txt
Jiltseb Mar 26, 2024
827df36
Update requirements.txt
Jiltseb Mar 27, 2024
911c62d
Update requirements.txt
Jiltseb Mar 27, 2024
fcf8519
merging with systran fw
Jiltseb Apr 8, 2024
e288337
adding vad model and defaults for language detection
Jiltseb Apr 8, 2024
9c85222
adding utility functions for vad model
Jiltseb Apr 8, 2024
21f4640
add pyannote dependency
Jiltseb Apr 8, 2024
eff5e23
adding VAD model, tests and update README
Jiltseb Apr 9, 2024
caaa593
update requirements
Jiltseb Apr 10, 2024
538366b
Merge pull request #8 from mobiusml/fw_pr
Jiltseb Apr 11, 2024
c41e4f2
added 'use_vad_model' to better handle vad segments
Jiltseb Apr 12, 2024
0e8fa00
Update error message
Jiltseb Apr 12, 2024
0d6c62e
Merge pull request #9 from mobiusml/fw_pr
Jiltseb Apr 12, 2024
56d68a1
added gpu implementation for vad by default
Jiltseb Apr 28, 2024
2812d99
adding a vad_device, modifying vad_url
Jiltseb Apr 29, 2024
1cd3c60
adding get_device function
Jiltseb Apr 29, 2024
3f27636
Merge pull request #10 from mobiusml/fw_pr_compliance
Jiltseb Apr 29, 2024
93c327d
updating the fork
Jiltseb May 17, 2024
2152d11
Merge remote-tracking branch 'upstream/master' into pr_expt
Jiltseb May 22, 2024
10242fc
updated version, credits to whisper-x, model made optional
Jiltseb May 22, 2024
2dde3c9
Merge branch 'master' into fw_compliance
Jiltseb May 22, 2024
8fd2ec0
Merge pull request #11 from mobiusml/fw_compliance
Jiltseb May 24, 2024
0fd5003
added compatibility for python 3.8
Jiltseb May 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)

# Faster Whisper transcription with CTranslate2
# Mobius Faster Whisper transcription with CTranslate2

**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.

Expand Down Expand Up @@ -166,6 +166,35 @@ for segment in segments:
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
```

### multi-segment language detection

To directly use the model for improved language detection, the following code snippet can be used:

```python
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")
```

### Batched faster-whisper


The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.

The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.

```python
from faster_whisper import BatchedInferencePipeline

model = WhisperModel("medium", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```

### Faster Distil-Whisper

The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
Expand Down
3 changes: 2 additions & 1 deletion faster_whisper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
from faster_whisper.audio import decode_audio
from faster_whisper.transcribe import WhisperModel
from faster_whisper.transcribe import BatchedInferencePipeline, WhisperModel
from faster_whisper.utils import available_models, download_model, format_timestamp
from faster_whisper.version import __version__

__all__ = [
"available_models",
"decode_audio",
"WhisperModel",
"BatchedInferencePipeline",
"download_model",
"format_timestamp",
"__version__",
Expand Down
55 changes: 41 additions & 14 deletions faster_whisper/feature_extractor.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import numpy as np
import torch
import torchaudio.compliance.kaldi as ta_kaldi


# Adapted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/feature_extraction_whisper.py # noqa: E501
Expand All @@ -21,6 +23,7 @@ def __init__(
self.mel_filters = self.get_mel_filters(
sampling_rate, n_fft, n_mels=feature_size
)
self.n_mels = feature_size

def get_mel_filters(self, sr, n_fft, n_mels=128, dtype=np.float32):
# Initialize the weights
Expand Down Expand Up @@ -142,29 +145,53 @@ def stft(self, frames, window):
data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
return data.T

def __call__(self, waveform, padding=True, chunk_length=None):
def __call__(self, waveform, enable_ta=False, padding=True, chunk_length=None):
"""
Compute the log-Mel spectrogram of the provided audio, gives similar results
whisper's original torch implementation with 1e-5 tolerance.
whisper's original torch implementation with 1e-5 tolerance. Additionally, faster
feature extraction option using kaldi fbank features are available if torchaudio is
available.
"""
if enable_ta:
waveform = waveform.astype(np.float32)

if chunk_length is not None:
self.n_samples = chunk_length * self.sampling_rate
self.nb_max_frames = self.n_samples // self.hop_length

if padding:
waveform = np.pad(waveform, [(0, self.n_samples)])

window = np.hanning(self.n_fft + 1)[:-1]

frames = self.fram_wave(waveform)
stft = self.stft(frames, window=window)
magnitudes = np.abs(stft[:, :-1]) ** 2

filters = self.mel_filters
mel_spec = filters @ magnitudes

log_spec = np.log10(np.clip(mel_spec, a_min=1e-10, a_max=None))
log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
if enable_ta:
audio = torch.from_numpy(waveform).unsqueeze(0)
fbank = ta_kaldi.fbank(
audio,
sample_frequency=self.sampling_rate,
window_type="hanning",
num_mel_bins=self.n_mels,
)
log_spec = fbank.numpy().T.astype(np.float32) # ctranslate does not take 64

# normalize

# Audioset values as default mean and std for audio
mean_val = -4.2677393
std_val = 4.5689974
scaled_features = (log_spec - (mean_val)) / (std_val * 2)
log_spec = scaled_features

else:
window = np.hanning(self.n_fft + 1)[:-1]

frames = self.fram_wave(waveform)
stft = self.stft(frames, window=window)
magnitudes = np.abs(stft[:, :-1]) ** 2

filters = self.mel_filters
mel_spec = filters @ magnitudes

log_spec = np.log10(np.clip(mel_spec, a_min=1e-10, a_max=None))
log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0

return log_spec
Loading
Loading