-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out
#6078
Comments
raft::offset_monitor::wait_aborted
raft::offset_monitor::wait_aborted
This CI failure was repeated today
|
This is probably fixed by #6419 |
Interesting, it turns out the case reproduced in the test was actually a https://buildkite.com/redpanda/vtools/builds/3568#01836dbf-ccc2-43ee-9dc7-03a3b447737e |
If this was a chaos test I would just suppress the log message in the test, but in this test we do not intentionally stop or restart any nodes during traffic, so it's unexpected that we see a timeout, unless the offset monitor is somehow waiting for data promoted from S3. |
raft::offset_monitor::wait_aborted
raft::offset_monitor::wait_timed_out
Using this generic exception type loses us a little information on what kind of error occurred, but gains us the generic exception handling in the RPC layer that knows how to translate a seastar timeout into a kafka protocol timeout. Fixes: redpanda-data#6078
There is plenty going on around the time of it: segment hydrations in flight, also segments recently deleted for retention.bytes enforcement, and it's about the same time some requests are getting timeouts creating kafka/__consumer_offsets when the consumers first start. Because this timeout is happening effectively at startup (when we have a slew of other timeouts from concurrent consumers all hitting init paths at the same time), I'm not particularly worried. We can extend the change from #6419 to use a ss::timed_out_error exception, and then it'll be caught by the generic RPC logic for politely sending a timeout error code onward to the client rather than logging it as an unexpected server error. |
Using this generic exception type loses us a little information on what kind of error occurred, but gains us the generic exception handling in the RPC layer that knows how to translate a seastar timeout into a kafka protocol timeout. Fixes: redpanda-data#6078
Stack traces are basically identical in both tests:
This looks similar in nature to #5764 so maybe a similar fix will suffice.
The text was updated successfully, but these errors were encountered: