Support for other audio formats #1399

fingertrouble · 2023-10-28T23:46:25Z

fingertrouble
Oct 28, 2023

I've actually gone back to standard whisper because I not only need to convert my podcast transcribe files to 16 bit WAV files for whisper-cpp, but also it uses a really weird sample rate (it has to be 16Khz)?

It's just another step, and blows through any amazing speed increase cos I have to faff around with re-exporting or converting the files. Also my podcast is 2+ hours so the WAC files can be massive!

Support for standard 16 bit 44 or 48khz WAV, M4A/AAC, Flac and MP3 formats would be really useful.

At the very least MP3, because that's a standard among podcasters.

bobqianic · 2023-10-29T17:39:00Z

bobqianic
Oct 29, 2023
Collaborator

Support for standard 16 bit 44 or 48khz WAV, M4A/AAC, Flac and MP3 formats would be really useful.

OpenAI's Whisper accomplishes this by invoking ffmpeg from the command line. I believe we could do something similar in whisper.cpp

0 replies

ZachNagengast · 2023-12-13T05:18:37Z

ZachNagengast
Dec 13, 2023

Thanks for the suggestion @bobqianic

If anyone needs a script to do the conversion and store it as a temp wav file:

audio_path = os.path.join(some_dir, "temp.wav")

# Run ffmpeg command to convert audio to WAV format
cmd = [
    "ffmpeg",
    "-nostdin",
    "-threads", "0",
    "-i", "your_file.flac",
    "-acodec", "pcm_s16le",
    "-ar", "16000",
    "-ac", "1",
    "-f", "wav",
    audio_path
]

subprocess.run(cmd, stderr=subprocess.DEVNULL, check=True)

# Use the converted file for inference
command = f"./main -m models/ggml-base.bin -f {audio_path}"

result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, shell=True, text=True)
prediction = result.stdout.strip()

0 replies

beaudjango · 2024-03-10T03:22:42Z

beaudjango
Mar 10, 2024

I'm finding @fingertrouble's suggest really useful in a totally different use case. If you are recording audio on an iOS device (mobile) you won't be able to change the Hz from the standard 48khz to 16khz, so you have to do converting on the fly. I'm not deep enough into the process to know if there are performance hits from doing that, but I guess there are going to be. I have a feeling it could cause latency. I wonder if there could be different trained models on different Hz? That would be really cool, for example if there was a whisper tiny en that dealt with 48khz then you can pipe the default audio into whisper on the device more easily.

1 reply

slavanorm Oct 29, 2024

collecting enough audio for model training in 48k is less trivial than normal 44.1k or other formats lower than 44.1k.
i'd think again about implementing pipeline with ffmpeg for your project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for other audio formats #1399

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Support for other audio formats #1399

fingertrouble Oct 28, 2023

Replies: 3 comments · 1 reply

bobqianic Oct 29, 2023 Collaborator

ZachNagengast Dec 13, 2023

beaudjango Mar 10, 2024

slavanorm Oct 29, 2024

fingertrouble
Oct 28, 2023

Replies: 3 comments 1 reply

bobqianic
Oct 29, 2023
Collaborator

ZachNagengast
Dec 13, 2023

beaudjango
Mar 10, 2024