Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

ajfabbri · 2022-06-17T01:20:31Z

rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_with_timeboxed.segment_size=10485760
  <BadLogLines nodes=ip-172-31-58-10(3) example="ERROR 2022-06-14 07:28:06,896 
  [shard 0] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">

This is similar to #4489; both cases have the offset monitor wait aborted exception.

Reproduced in CDT here.

The text was updated successfully, but these errors were encountered:

piyushredpanda · 2022-06-17T01:37:48Z

Assigning to @ZeDRoman, as #4489 was something he was looking at.

jcsp · 2022-08-04T20:10:58Z

This uncaught exception is still in the code. I'm currently seeing it around the same time as I start a bunch of clients doing idempotent writes.

I have a mixture of caught wait_aborted exceptions coming from the id_allocator machinery, and then some uncaught ones making it up to the RPC handler that's logging these as ERROR:

WARN  2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:252 - can not create {kafka_internal}/{id_allocator} topic - error: raft::offset_monitor::wait_aborted (offset monitor wait aborted)
WARN  2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:70 - can't find {ns: {kafka_internal}, topic: {id_allocator}} in the metadata cache
WARN  2022-08-04 19:44:23,752 [shard 0] kafka - init_producer_id.cc:114 - failed to allocate pid, ec: cluster::errc:14
ERROR 2022-08-04 19:44:23,772 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)

jcsp · 2022-08-05T13:20:45Z

FAIL test: PartitionBalancerTest.test_fuzz_admin_ops (2/37 runs)
failure at 2022-08-05T07:48:34.288Z:
in job https://buildkite.com/redpanda/redpanda/builds/13659#01826c88-355c-4b07-a514-c884579adabb

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed

ztlpn · 2022-09-15T12:17:54Z

Relevant discussion about the wait_aborted exception: #6367 (comment)

Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed (cherry picked from commit db0ded6)

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed

Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154

Aborts should be propagated as the standard ss::abort_requested_exception type which is understood by handlers to be ignored silently, as it occurs during normal shutdown. Timeouts remain specific exception type in offset_monitor, and in locations that used to catch + swallow both aborts and timeouts, timeouts are logged at WARN severity, as they are not necessarily indicative of a fault, but may indicate a system not operating at its best. Fixes: redpanda-data#5154 (cherry picked from commit 927ea66)

ajfabbri added kind/bug Something isn't working ci-failure labels Jun 17, 2022

piyushredpanda assigned ZeDRoman Jun 17, 2022

mmedenjak added the area/tests label Jul 5, 2022

jcsp changed the title ~~franz_go_verifiable_test...test_si_with_timeboxed: offset monitor wait aborted exception~~ Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (FranzGoVerifiableWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) Aug 5, 2022

ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 12, 2022

tests/partition_balancer: update LOG_ALLOW_LIST

5fb5ec0

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed

mmedenjak assigned ztlpn Sep 13, 2022

ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 13, 2022

tests/partition_balancer: update LOG_ALLOW_LIST

db0ded6

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed

ZeDRoman mentioned this issue Sep 15, 2022

cluster: offset_monitor change custom exceptions to ss #6420

Closed

6 tasks

mmedenjak added area/redpanda and removed area/tests labels Sep 15, 2022

ZeDRoman mentioned this issue Sep 16, 2022

raft: distinguish aborts & timeouts in offset_monitor exceptions #6419

Merged

6 tasks

jcsp closed this as completed in #6419 Sep 16, 2022

rystsov added the ci-disabled-test label Sep 16, 2022

ztlpn added a commit to ztlpn/redpanda that referenced this issue Sep 16, 2022

tests/partition_balancer: update LOG_ALLOW_LIST

7113520

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed (cherry picked from commit db0ded6)

ballard26 pushed a commit to ballard26/redpanda that referenced this issue Sep 27, 2022

tests/partition_balancer: update LOG_ALLOW_LIST

c58460e

Add a "raft::offset_monitor::wait_aborted" message to allow list redpanda-data#5154 is fixed

This was referenced Mar 15, 2023

[v22.2.x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9455

Closed

[v22.2x] Backport #6175 #6285 #8963 #6246 #6501 #6091 #6419 #6157 #6124 #9456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

ajfabbri commented Jun 17, 2022

piyushredpanda commented Jun 17, 2022

jcsp commented Aug 4, 2022

jcsp commented Aug 5, 2022

ztlpn commented Sep 15, 2022

Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (KgoVerifierWithSiTestLargeSegments.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops) #5154

Comments

ajfabbri commented Jun 17, 2022

piyushredpanda commented Jun 17, 2022

jcsp commented Aug 4, 2022

jcsp commented Aug 5, 2022

ztlpn commented Sep 15, 2022