Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

dscripka · 2024-02-10T21:34:46Z

When transcribing short audio clips (e.g., <30 seconds, and usually between 5-10 seconds), I've noticed that the audio_ctx parameter can greatly increase performance when set appropriately. After some experimentation, it seems that using the length of the audio clip to scale the value of audio_ctx works quite well. I have been using audio_ctx = (audio length in seconds/30 seconds)*1500 + 128 somewhat arbitrarily.

To confirm these observations I did some comparisons using clips from the Common Voice dataset, the base.en model, and this hardware configuration:

CPU: Intel i7-11700K
RAM: DDR4 - 3200 Mhz

whisper.cpp build details:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.46 MB (1 buffers)
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   16.17 MB
whisper_init_state: compute buffer (encode) =   94.42 MB
whisper_init_state: compute buffer (cross)  =    5.08 MB
whisper_init_state: compute buffer (decode) =  105.96 MB

system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |

These are the results I'm seeing for 200 random clips from Common Voice (average length ~5.7 seconds):

WER	Beam Size	Total Time (seconds)	Threads	audio_ctx
20.06	1	204	2	1500 (default)
19.2	1	60	2	(audio_length/30)*1500 + 128

I also see similar results with the tiny.en model.

Overall, this seems like a great way to get a ~3-3.5x speedup on CPU with no significant penalty to accuracy when working with shorter audio clips. Might there be some other side effects or downsides that I'm not considering when using audio_ctx in this way?

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-02-11T15:00:57Z

Nice! The idea to compute the Encoder with partial context was introduced and discussed to some extend here: #137

I often use it for short audio segments on low-end devices such as Raspberry Pis. It's surprising to see the WER being lower with partial context though - I've never measured it, but I expected that the quality would be worse when using smaller audio_ctx

dscripka · 2024-02-11T16:06:11Z

Yes, that was a great thread and motivating some of my experiments to better understanding the impact of using partial context with the Encoder.

To your point, choosing a value for audio_ctx that is too small can have odd impacts on WER and efficiency, which might also be dependent on the underlying model. Updating the table above with more results:

Model	WER	Beam Size	Total Time (seconds)	Threads	audio_ctx
base.en	20.06	1	204	2	1500 (default)
base.en	19.2	1	60	2	(audio_length/30)*1500 + 128
base.en	43.03	1	205	2	256
tiny.en	25.81	1	121	2	1500 (default)
tiny.en	25.83	1	40	2	(audio_length/30)*1500 + 128
tiny.en	38.73	1	81	2	256

dscripka · 2024-02-11T16:41:28Z

Since this approach seems generally useful, I created a quick PR (#1857) to make it easier to use the audio_ctx argument across examples.

ggerganov/whisper.cpp#1855

dscripka mentioned this issue Feb 11, 2024

added audio_ctx argument to main and server examples #1857

Merged

dscripka closed this as completed Feb 11, 2024

dev-msp added a commit to dev-msp/untitled-voice-assistant that referenced this issue Apr 11, 2024

Set audio_ctx according to recommendation in whisper.cpp repo

c4fd60f

ggerganov/whisper.cpp#1855

dscripka mentioned this issue May 25, 2024

Add support for quantization and custom audio context size to OpenVino #2184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

dscripka commented Feb 10, 2024 •

edited

Loading

ggerganov commented Feb 11, 2024

dscripka commented Feb 11, 2024

dscripka commented Feb 11, 2024

Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

Comments

dscripka commented Feb 10, 2024 • edited Loading

ggerganov commented Feb 11, 2024

dscripka commented Feb 11, 2024

dscripka commented Feb 11, 2024

dscripka commented Feb 10, 2024 •

edited

Loading