Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855

Closed
dscripka opened this issue Feb 10, 2024 · 3 comments
Closed

Comments

@dscripka
Copy link
Contributor

dscripka commented Feb 10, 2024

When transcribing short audio clips (e.g., <30 seconds, and usually between 5-10 seconds), I've noticed that the audio_ctx parameter can greatly increase performance when set appropriately. After some experimentation, it seems that using the length of the audio clip to scale the value of audio_ctx works quite well. I have been using audio_ctx = (audio length in seconds/30 seconds)*1500 + 128 somewhat arbitrarily.

To confirm these observations I did some comparisons using clips from the Common Voice dataset, the base.en model, and this hardware configuration:

CPU: Intel i7-11700K
RAM: DDR4 - 3200 Mhz

whisper.cpp build details:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.46 MB (1 buffers)
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   16.17 MB
whisper_init_state: compute buffer (encode) =   94.42 MB
whisper_init_state: compute buffer (cross)  =    5.08 MB
whisper_init_state: compute buffer (decode) =  105.96 MB

system_info: n_threads = 2 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

These are the results I'm seeing for 200 random clips from Common Voice (average length ~5.7 seconds):

WER Beam Size Total Time (seconds) Threads audio_ctx
20.06 1 204 2 1500 (default)
19.2 1 60 2 (audio_length/30)*1500 + 128

I also see similar results with the tiny.en model.

Overall, this seems like a great way to get a ~3-3.5x speedup on CPU with no significant penalty to accuracy when working with shorter audio clips. Might there be some other side effects or downsides that I'm not considering when using audio_ctx in this way?

@ggerganov
Copy link
Owner

Nice! The idea to compute the Encoder with partial context was introduced and discussed to some extend here: #137

I often use it for short audio segments on low-end devices such as Raspberry Pis. It's surprising to see the WER being lower with partial context though - I've never measured it, but I expected that the quality would be worse when using smaller audio_ctx

@dscripka
Copy link
Contributor Author

Yes, that was a great thread and motivating some of my experiments to better understanding the impact of using partial context with the Encoder.

To your point, choosing a value for audio_ctx that is too small can have odd impacts on WER and efficiency, which might also be dependent on the underlying model. Updating the table above with more results:

Model WER Beam Size Total Time (seconds) Threads audio_ctx
base.en 20.06 1 204 2 1500 (default)
base.en 19.2 1 60 2 (audio_length/30)*1500 + 128
base.en 43.03 1 205 2 256
tiny.en 25.81 1 121 2 1500 (default)
tiny.en 25.83 1 40 2 (audio_length/30)*1500 + 128
tiny.en 38.73 1 81 2 256

@dscripka
Copy link
Contributor Author

Since this approach seems generally useful, I created a quick PR (#1857) to make it easier to use the audio_ctx argument across examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants