Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding fix for multi-byte segments in whisper.cpp #734

Merged

Conversation

raivisdejus
Copy link
Collaborator

Sometimes transcription in Latvian failed with error Failed utf-8 codec can't decode byte 0xc4 in position 0: unexpected end of data. This seems to be referenced in ggerganov/whisper.cpp#1798 where multi-byte utf-8 characters get returned in separate segments and uft-8 decoder fails to process them. This PR fixes this issue.

This PR also fixes issue where with "Word-level timings" setting enabled words get split into separate segments making this feature less usable in real world situations. Changes in PR will combine whisper.cpp segments around word boundary of space.

The unclear part is in regards to languages where space may not be proper word boundary. If someone has relevant comments on word boundaries in languages like Chinese, I am happy to adjust the solution.

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 78.94737% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 81.30%. Comparing base (d483864) to head (5b85a81).
Report is 3 commits behind head on main.

❗ Current head 5b85a81 differs from pull request most recent head 3513158. Consider uploading reports for the commit 3513158 to get more accurate results

Files Patch % Lines
buzz/transcriber/whisper_cpp.py 78.94% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #734      +/-   ##
==========================================
- Coverage   81.97%   81.30%   -0.68%     
==========================================
  Files          83       81       -2     
  Lines        3840     3610     -230     
==========================================
- Hits         3148     2935     -213     
+ Misses        692      675      -17     
Flag Coverage Δ
Linux ?
Windows 81.30% <78.94%> (-0.07%) ⬇️
macOS ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chidiwilliams
Copy link
Owner

Awesome, thank you. Given you commit access to the repo if you're interested in joining as well. Cheers.

@chidiwilliams chidiwilliams enabled auto-merge (squash) May 14, 2024 23:16
@chidiwilliams chidiwilliams merged commit 38f5d26 into chidiwilliams:main May 14, 2024
9 of 11 checks passed
@raivisdejus raivisdejus deleted the fix-multibyte-word-timestamps branch May 15, 2024 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants