Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out #6078

Closed
andrwng opened this issue Aug 17, 2022 · 5 comments · Fixed by #6525
Assignees
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working

Comments

@andrwng
Copy link
Contributor

andrwng commented Aug 17, 2022

FAIL test: FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed (1/1 runs)
  failure at 2022-08-17T07:38:31.224Z: <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80
FAIL test: FranzGoVerifiableWithSiTestLargeSegments.test_si_without_timeboxed (1/1 runs)
  failure at 2022-08-17T07:38:31.224Z: <BadLogLines nodes=ip-172-31-50-71(10) example="ERROR 2022-08-17 04:09:31,897 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80

Stack traces are basically identical in both tests:

====================================================================================================
test_id:    rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed
status:     FAIL
run time:   3 minutes 25.014 seconds


    <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 51, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1267, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-50-71(7) example="ERROR 2022-08-17 04:06:17,503 [shard 3] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">

This looks similar in nature to #5764 so maybe a similar fix will suffice.

@andrwng andrwng added the kind/bug Something isn't working label Aug 17, 2022
@mmedenjak mmedenjak added area/tests area/cloud-storage Shadow indexing subsystem labels Aug 23, 2022
@rystsov rystsov changed the title CI failure: FranzGoVerifiableWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted Aug 25, 2022
@abhijat
Copy link
Contributor

abhijat commented Sep 8, 2022

This CI failure was repeated today

FAIL test: KgoVerifierWithSiTestLargeSegments.test_si_without_timeboxed (1/2 runs)
  failure at 2022-09-08T07:19:27.327Z: <BadLogLines nodes=ip-172-31-55-202(7) example="ERROR 2022-09-08 04:09:47,154 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
      in job https://buildkite.com/redpanda/vtools/builds/3487#01831b0e-22b1-4569-998d-ba2249309e87

@jcsp
Copy link
Contributor

jcsp commented Sep 23, 2022

This is probably fixed by #6419

@jcsp jcsp closed this as completed Sep 23, 2022
@jcsp
Copy link
Contributor

jcsp commented Sep 26, 2022

Interesting, it turns out the case reproduced in the test was actually a raft::offset_monitor::wait_timed_out (once #6419 separated timeouts and shutdown-type aborts).

https://buildkite.com/redpanda/vtools/builds/3568#01836dbf-ccc2-43ee-9dc7-03a3b447737e

@jcsp jcsp reopened this Sep 26, 2022
@jcsp
Copy link
Contributor

jcsp commented Sep 26, 2022

If this was a chaos test I would just suppress the log message in the test, but in this test we do not intentionally stop or restart any nodes during traffic, so it's unexpected that we see a timeout, unless the offset monitor is somehow waiting for data promoted from S3.

@jcsp jcsp changed the title CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_aborted CI failure: KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed logs an error: raft::offset_monitor::wait_timed_out Sep 26, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 26, 2022
Using this generic exception type loses us a little
information on what kind of error occurred, but gains
us the generic exception handling in the RPC layer that
knows how to translate a seastar timeout into a kafka
protocol timeout.

Fixes: redpanda-data#6078
@jcsp
Copy link
Contributor

jcsp commented Sep 26, 2022

There is plenty going on around the time of it: segment hydrations in flight, also segments recently deleted for retention.bytes enforcement, and it's about the same time some requests are getting timeouts creating kafka/__consumer_offsets when the consumers first start.

Because this timeout is happening effectively at startup (when we have a slew of other timeouts from concurrent consumers all hitting init paths at the same time), I'm not particularly worried.

We can extend the change from #6419 to use a ss::timed_out_error exception, and then it'll be caught by the generic RPC logic for politely sending a timeout error code onward to the client rather than logging it as an unexpected server error.

ballard26 pushed a commit to ballard26/redpanda that referenced this issue Sep 27, 2022
Using this generic exception type loses us a little
information on what kind of error occurred, but gains
us the generic exception handling in the RPC layer that
knows how to translate a seastar timeout into a kafka
protocol timeout.

Fixes: redpanda-data#6078
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/tests ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants