Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

Closed
ajfabbri opened this issue Jun 17, 2022 · 4 comments · Fixed by #6419

Comments

@ajfabbri
Copy link
Contributor

rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_with_timeboxed.segment_size=10485760
  <BadLogLines nodes=ip-172-31-58-10(3) example="ERROR 2022-06-14 07:28:06,896 
  [shard 0] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">

This is similar to #4489; both cases have the offset monitor wait aborted exception.

Reproduced in CDT here.

@piyushredpanda
Copy link
Contributor

Assigning to @ZeDRoman, as #4489 was something he was looking at.

@jcsp
Copy link
Contributor

jcsp commented Aug 4, 2022

This uncaught exception is still in the code. I'm currently seeing it around the same time as I start a bunch of clients doing idempotent writes.

I have a mixture of caught wait_aborted exceptions coming from the id_allocator machinery, and then some uncaught ones making it up to the RPC handler that's logging these as ERROR:

WARN  2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:252 - can not create {kafka_internal}/{id_allocator} topic - error: raft::offset_monitor::wait_aborted (offset monitor wait aborted)
WARN  2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:70 - can't find {ns: {kafka_internal}, topic: {id_allocator}} in the metadata cache
WARN  2022-08-04 19:44:23,752 [shard 0] kafka - init_producer_id.cc:114 - failed to allocate pid, ec: cluster::errc:14
ERROR 2022-08-04 19:44:23,772 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)

@jcsp jcsp changed the title franz_go_verifiable_test...test_si_with_timeboxed: offset monitor wait aborted exception Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (FranzGoVerifiableWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Aug 5, 2022
@jcsp
Copy link
Contributor

jcsp commented Aug 5, 2022

FAIL test: PartitionBalancerTest.test_fuzz_admin_ops (2/37 runs)
failure at 2022-08-05T07:48:34.288Z:
in job https://buildkite.com/redpanda/redpanda/builds/13659#01826c88-355c-4b07-a514-c884579adabb

@rystsov rystsov changed the title Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (FranzGoVerifiableWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Aug 25, 2022
@rystsov rystsov changed the title Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Aug 25, 2022
ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 12, 2022
Add a "raft::offset_monitor::wait_aborted" message to allow list
redpanda-data#5154 is fixed
ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 13, 2022
Add a "raft::offset_monitor::wait_aborted" message to allow list
redpanda-data#5154 is fixed
@ztlpn
Copy link
Contributor

ztlpn commented Sep 15, 2022

Relevant discussion about the wait_aborted exception: #6367 (comment)

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 15, 2022
Aborts should be propagated as the standard
ss::abort_requested_exception type which is understood
by handlers to be ignored silently, as it occurs during
normal shutdown.

Timeouts remain specific exception type in offset_monitor,
and in locations that used to catch + swallow both aborts
and timeouts, timeouts are logged at WARN severity, as they
are not necessarily indicative of a fault, but may indicate
a system not operating at its best.

Fixes: redpanda-data#5154
ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 16, 2022
Add a "raft::offset_monitor::wait_aborted" message to allow list
redpanda-data#5154 is fixed

(cherry picked from commit db0ded6)
ballard26 pushed a commit to ballard26/redpanda that referenced this issue Sep 27, 2022
Add a "raft::offset_monitor::wait_aborted" message to allow list
redpanda-data#5154 is fixed
ballard26 pushed a commit to ballard26/redpanda that referenced this issue Sep 27, 2022
Aborts should be propagated as the standard
ss::abort_requested_exception type which is understood
by handlers to be ignored silently, as it occurs during
normal shutdown.

Timeouts remain specific exception type in offset_monitor,
and in locations that used to catch + swallow both aborts
and timeouts, timeouts are logged at WARN severity, as they
are not necessarily indicative of a fault, but may indicate
a system not operating at its best.

Fixes: redpanda-data#5154
BenPope pushed a commit to BenPope/redpanda that referenced this issue Mar 15, 2023
Aborts should be propagated as the standard
ss::abort_requested_exception type which is understood
by handlers to be ignored silently, as it occurs during
normal shutdown.

Timeouts remain specific exception type in offset_monitor,
and in locations that used to catch + swallow both aborts
and timeouts, timeouts are logged at WARN severity, as they
are not necessarily indicative of a fault, but may indicate
a system not operating at its best.

Fixes: redpanda-data#5154
(cherry picked from commit 927ea66)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment