CI Failure (500) in `PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade` #5589

rystsov · 2022-07-22T16:32:05Z

https://buildkite.com/redpanda/redpanda/builds/12899#0182246f-263f-4db9-ba6a-0b2d60eeeb8c

Module: rptest.tests.prefix_truncate_recovery_test
Class:  PrefixTruncateRecoveryUpgradeTest
Method: test_recover_during_upgrade

HTTPError('500 Server Error: Internal Server Error for url: http://docker-rp-4:9644/v1/raft/1/transfer_leadership?target=1')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 191, in test_recover_during_upgrade
    MixedVersionWorkloadRunner.upgrade_with_workload(
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 46, in upgrade_with_workload
    workload_fn(node0, node1)
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 183, in _run_recovery
    self.redpanda._admin.transfer_leadership_to(
  File "/root/tests/rptest/services/admin.py", line 660, in transfer_leadership_to
    ret = self._request('post', path=path, node=leader)
  File "/root/tests/rptest/services/admin.py", line 334, in _request
    r.raise_for_status()
  File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Corresponding panda log

INFO  2022-07-22 07:29:38,510 [shard 0] admin_api_server - admin_server.cc:1206 - Leadership transfer request for raft group 1 to node {1}
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2757 - Starting leadership transfer from {id: {2}, revision: {16}} to {id: {1}, revision: {16}} in term 1
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2616 - transfer leadership: preparing target={id: {1}, revision: {16}}, dirty_offset=0
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2621 - transfer leadership: cleared oplock
DEBUG 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2641 - transfer leadership: starting node {id: {1}, revision: {16}} recovery
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2660 - transfer leadership: waiting for node {id: {1}, revision: {16}} to catch up
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [follower: {id: {1}, revision: {16}}] [group_id:1, {kafka/topic-bcomvjecwu/0}] - recovery_stm.cc:536 - Finished recovery
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2676 - transfer leadership: finished waiting on node {id: {1}, revision: {16}} recovery
WARN  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2841 - Cannot transfer leadership: {id: {1}, revision: {16}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

The text was updated successfully, but these errors were encountered:

VadimPlh · 2022-07-25T10:59:10Z

Again https://buildkite.com/redpanda/redpanda/builds/12977#01822ed4-bd8e-45cf-9f83-aba57b5e903e

BenPope · 2022-07-27T13:56:25Z

Is this a dupe of #5332?

rystsov · 2022-09-13T00:18:29Z

We haven't seen it for more than two weeks, closing assuming it's fixed (if it isn't then it'll pop up as a pr blocker).

rystsov · 2022-09-26T22:06:17Z

Another instance https://buildkite.com/redpanda/redpanda/builds/15760#01837b2c-a836-4d0f-a11f-d1366b0445cd; reopening as a pr-blocker

andijcr · 2022-10-21T14:11:22Z

another instance, arm only

FAIL test: PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade (1/46 runs)
failure at 2022-10-20T07:31:08.986Z: HTTPError('500 Server Error: Internal Server Error for url: http://docker-rp-4:9644/v1/raft/1/transfer_leadership?target=1')
in job https://buildkite.com/redpanda/redpanda/builds/16962#0183f3e9-05ee-4fe4-b3cc-02606f5acc24

dotnwat · 2022-11-02T17:13:39Z

@rystsov please comment on the severity level of this issue.

rystsov · 2022-11-03T15:15:49Z

@dotnwat errors in admin api doesn't affect kafka api so I think the severity is low, we just need to repeat the request

jcsp · 2022-11-07T10:43:20Z

Seen here:
https://buildkite.com/redpanda/redpanda/builds/17928#018443c9-20bd-4a71-b465-e5d16d71b3d6

This is an error handling bug in the admin API, the case it's hitting is:

WARN  2022-11-04 20:25:45,193 [shard 1] raft - [group_id:1, {kafka/topic-keqtkiwwzu/0}] consensus.cc:2841 - Cannot transfer leadership: {id: {1}, revision: {15}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

Fix is probably to figure out what raft::errc that corresponds to, and add it to the list of things in admin_server.cc that gets mapped to a 503. Then the admin client in the test will retry.

Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#7096 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589

Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589

Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589 (cherry picked from commit a497908)

Immediately following a node restart, attempts to transfer leadership to the restarted node may be misconstrued as a node that needs recovery, resulting in a raft::errc::timeout, which yields code 500 in versions that don't have redpanda-data#6388 (i.e. 22.3 and below). Since 500 is a generic error, rather than adding 500 to the accepted retriable codes, this commit opts to retry the leadership transfer until leadership is actually transferred, ignoring all errors. This entailed tweaking the return semantics of transfer_leadership_to(), which no other test appears to be using. Fixes redpanda-data#5589 (cherry picked from commit a497908) I dropped the changes to the tests, since they don't exist on v22.1.x

rystsov added kind/bug Something isn't working ci-failure labels Jul 22, 2022

rystsov mentioned this issue Jul 22, 2022

Reduce duration of partition_movement_test from 25min to 8min #5238

Draft

piyushredpanda assigned ballard26 Jul 22, 2022

mmedenjak added area/raft area/tests labels Jul 26, 2022

rystsov mentioned this issue Aug 19, 2022

ducky: pin to the latest version #6003

Closed

5 tasks

rystsov changed the title ~~Leadership transfer failure in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade~~ CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade Aug 22, 2022

rystsov closed this as completed Sep 13, 2022

rystsov reopened this Sep 26, 2022

rystsov added the pr-blocker CI failures blocking a PR from being merged label Sep 26, 2022

mmedenjak unassigned ballard26 Sep 27, 2022

piyushredpanda assigned andrwng Oct 6, 2022

jcsp mentioned this issue Nov 7, 2022

archival_policy: fix off-by-1 error for timeboxed uploads #7096

Merged

3 tasks

andrwng mentioned this issue Nov 8, 2022

prefix_truncate_recovery_test: retry leadership transfers #7163

Merged

6 tasks

andrwng closed this as completed in #7163 Nov 10, 2022

vbotbuildovich mentioned this issue Nov 10, 2022

[v22.2.x] CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #7206

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (500) in `PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade` #5589

CI Failure (500) in `PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade` #5589

rystsov commented Jul 22, 2022

VadimPlh commented Jul 25, 2022

BenPope commented Jul 27, 2022

rystsov commented Sep 13, 2022

rystsov commented Sep 26, 2022

andijcr commented Oct 21, 2022

dotnwat commented Nov 2, 2022

rystsov commented Nov 3, 2022

jcsp commented Nov 7, 2022

CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

Comments

rystsov commented Jul 22, 2022

VadimPlh commented Jul 25, 2022

BenPope commented Jul 27, 2022

rystsov commented Sep 13, 2022

rystsov commented Sep 26, 2022

andijcr commented Oct 21, 2022

dotnwat commented Nov 2, 2022

rystsov commented Nov 3, 2022

jcsp commented Nov 7, 2022

CI Failure (500) in `PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade` #5589

CI Failure (500) in `PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade` #5589