CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: `raft::offset_monitor::wait_timed_out` #6078

andrwng · 2022-08-17T17:47:55Z

FAIL test: FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed (1/1 runs)
  failure at 2022-08-17T07:38:31.224Z: <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80
FAIL test: FranzGoVerifiableWithSiTestLargeSegments.test_si_without_timeboxed (1/1 runs)
  failure at 2022-08-17T07:38:31.224Z: <BadLogLines nodes=ip-172-31-50-71(10) example="ERROR 2022-08-17 04:09:31,897 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80

Stack traces are basically identical in both tests:

====================================================================================================
test_id:    rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed
status:     FAIL
run time:   3 minutes 25.014 seconds


    <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 51, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1267, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">

This looks similar in nature to #5764 so maybe a similar fix will suffice.

The text was updated successfully, but these errors were encountered:

abhijat · 2022-09-08T11:18:36Z

This CI failure was repeated today

FAIL test: KgoVerifierWithSiTestLargeSegments.test_si_without_timeboxed (1/2 runs)
  failure at 2022-09-08T07:19:27.327Z: <BadLogLines nodes=ip-172-31-55-202(7) example="ERROR 2022-09-08 04:09:47,154 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3487#01831b0e-22b1-4569-998d-ba2249309e87

jcsp · 2022-09-23T09:58:22Z

This is probably fixed by #6419

jcsp · 2022-09-26T09:08:22Z

Interesting, it turns out the case reproduced in the test was actually a raft::offset_monitor::wait_timed_out (once #6419 separated timeouts and shutdown-type aborts).

https://buildkite.com/redpanda/vtools/builds/3568#01836dbf-ccc2-43ee-9dc7-03a3b447737e

jcsp · 2022-09-26T09:11:11Z

If this was a chaos test I would just suppress the log message in the test, but in this test we do not intentionally stop or restart any nodes during traffic, so it's unexpected that we see a timeout, unless the offset monitor is somehow waiting for data promoted from S3.

Using this generic exception type loses us a little information on what kind of error occurred, but gains us the generic exception handling in the RPC layer that knows how to translate a seastar timeout into a kafka protocol timeout. Fixes: redpanda-data#6078

jcsp · 2022-09-26T12:45:58Z

There is plenty going on around the time of it: segment hydrations in flight, also segments recently deleted for retention.bytes enforcement, and it's about the same time some requests are getting timeouts creating kafka/__consumer_offsets when the consumers first start.

Because this timeout is happening effectively at startup (when we have a slew of other timeouts from concurrent consumers all hitting init paths at the same time), I'm not particularly worried.

We can extend the change from #6419 to use a ss::timed_out_error exception, and then it'll be caught by the generic RPC logic for politely sending a timeout error code onward to the client rather than logging it as an unexpected server error.

Using this generic exception type loses us a little information on what kind of error occurred, but gains us the generic exception handling in the RPC layer that knows how to translate a seastar timeout into a kafka protocol timeout. Fixes: redpanda-data#6078

andrwng added the kind/bug Something isn't working label Aug 17, 2022

ztlpn added the ci-failure label Aug 18, 2022

mmedenjak added area/tests area/cloud-storage Shadow indexing subsystem labels Aug 23, 2022

rystsov changed the title ~~CI failure: FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted~~ CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted Aug 25, 2022

mmedenjak assigned abhijat Sep 8, 2022

jcsp closed this as completed Sep 23, 2022

jcsp reopened this Sep 26, 2022

jcsp changed the title ~~CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted~~ CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out Sep 26, 2022

jcsp mentioned this issue Sep 26, 2022

raft: use ss::timed_out_error in offset_monitor #6525

Merged

6 tasks

mmedenjak closed this as completed in #6525 Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: `raft::offset_monitor::wait_timed_out` #6078

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: `raft::offset_monitor::wait_timed_out` #6078

andrwng commented Aug 17, 2022

abhijat commented Sep 8, 2022

jcsp commented Sep 23, 2022

jcsp commented Sep 26, 2022 •

edited

Loading

jcsp commented Sep 26, 2022

jcsp commented Sep 26, 2022

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out #6078

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out #6078

Comments

andrwng commented Aug 17, 2022

abhijat commented Sep 8, 2022

jcsp commented Sep 23, 2022

jcsp commented Sep 26, 2022 • edited Loading

jcsp commented Sep 26, 2022

jcsp commented Sep 26, 2022

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: `raft::offset_monitor::wait_timed_out` #6078

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: `raft::offset_monitor::wait_timed_out` #6078

jcsp commented Sep 26, 2022 •

edited

Loading