-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade
#5589
Comments
Is this a dupe of #5332? |
PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade
PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade
We haven't seen it for more than two weeks, closing assuming it's fixed (if it isn't then it'll pop up as a pr blocker). |
Another instance https://buildkite.com/redpanda/redpanda/builds/15760#01837b2c-a836-4d0f-a11f-d1366b0445cd; reopening as a pr-blocker |
another instance, arm only FAIL test: PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade (1/46 runs) |
@rystsov please comment on the severity level of this issue. |
@dotnwat errors in admin api doesn't affect kafka api so I think the severity is low, we just need to repeat the request |
Seen here: This is an error handling bug in the admin API, the case it's hitting is:
Fix is probably to figure out what raft::errc that corresponds to, and add it to the list of things in admin_server.cc that gets mapped to a 503. Then the admin client in the test will retry. |
Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#7096 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589
Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589
Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589
Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589 (cherry picked from commit a497908)
Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589 (cherry picked from commit a497908) I dropped the changes to the tests, since they don't exist on v22.1.x
https://buildkite.com/redpanda/redpanda/builds/12899#0182246f-263f-4db9-ba6a-0b2d60eeeb8c
Corresponding panda log
The text was updated successfully, but these errors were encountered: