Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

Closed
rystsov opened this issue Jul 22, 2022 · 8 comments · Fixed by #7163
Closed

CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade #5589

rystsov opened this issue Jul 22, 2022 · 8 comments · Fixed by #7163
Assignees
Labels
area/raft area/tests ci-failure kind/bug Something isn't working pr-blocker CI failures blocking a PR from being merged

Comments

@rystsov
Copy link
Contributor

rystsov commented Jul 22, 2022

https://buildkite.com/redpanda/redpanda/builds/12899#0182246f-263f-4db9-ba6a-0b2d60eeeb8c

Module: rptest.tests.prefix_truncate_recovery_test
Class:  PrefixTruncateRecoveryUpgradeTest
Method: test_recover_during_upgrade
HTTPError('500 Server Error: Internal Server Error for url: http://docker-rp-4:9644/v1/raft/1/transfer_leadership?target=1')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 191, in test_recover_during_upgrade
    MixedVersionWorkloadRunner.upgrade_with_workload(
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 46, in upgrade_with_workload
    workload_fn(node0, node1)
  File "/root/tests/rptest/tests/prefix_truncate_recovery_test.py", line 183, in _run_recovery
    self.redpanda._admin.transfer_leadership_to(
  File "/root/tests/rptest/services/admin.py", line 660, in transfer_leadership_to
    ret = self._request('post', path=path, node=leader)
  File "/root/tests/rptest/services/admin.py", line 334, in _request
    r.raise_for_status()
  File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)

Corresponding panda log

INFO  2022-07-22 07:29:38,510 [shard 0] admin_api_server - admin_server.cc:1206 - Leadership transfer request for raft group 1 to node {1}
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2757 - Starting leadership transfer from {id: {2}, revision: {16}} to {id: {1}, revision: {16}} in term 1
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2616 - transfer leadership: preparing target={id: {1}, revision: {16}}, dirty_offset=0
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2621 - transfer leadership: cleared oplock
DEBUG 2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2641 - transfer leadership: starting node {id: {1}, revision: {16}} recovery
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2660 - transfer leadership: waiting for node {id: {1}, revision: {16}} to catch up
TRACE 2022-07-22 07:29:38,510 [shard 1] raft - [follower: {id: {1}, revision: {16}}] [group_id:1, {kafka/topic-bcomvjecwu/0}] - recovery_stm.cc:536 - Finished recovery
INFO  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2676 - transfer leadership: finished waiting on node {id: {1}, revision: {16}} recovery
WARN  2022-07-22 07:29:38,510 [shard 1] raft - [group_id:1, {kafka/topic-bcomvjecwu/0}] consensus.cc:2841 - Cannot transfer leadership: {id: {1}, revision: {16}} needs recovery (-9223372036854775808, -9223372036854775808, 0)
@VadimPlh
Copy link
Contributor

@BenPope
Copy link
Member

BenPope commented Jul 27, 2022

Is this a dupe of #5332?

@rystsov rystsov changed the title Leadership transfer failure in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade CI Failure (500) in PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade Aug 22, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Sep 13, 2022

We haven't seen it for more than two weeks, closing assuming it's fixed (if it isn't then it'll pop up as a pr blocker).

@rystsov rystsov closed this as completed Sep 13, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Sep 26, 2022

Another instance https://buildkite.com/redpanda/redpanda/builds/15760#01837b2c-a836-4d0f-a11f-d1366b0445cd; reopening as a pr-blocker

@rystsov rystsov reopened this Sep 26, 2022
@rystsov rystsov added the pr-blocker CI failures blocking a PR from being merged label Sep 26, 2022
@andijcr
Copy link
Contributor

andijcr commented Oct 21, 2022

another instance, arm only

FAIL test: PrefixTruncateRecoveryUpgradeTest.test_recover_during_upgrade (1/46 runs)
failure at 2022-10-20T07:31:08.986Z: HTTPError('500 Server Error: Internal Server Error for url: http://docker-rp-4:9644/v1/raft/1/transfer_leadership?target=1')
in job https://buildkite.com/redpanda/redpanda/builds/16962#0183f3e9-05ee-4fe4-b3cc-02606f5acc24

@dotnwat
Copy link
Member

dotnwat commented Nov 2, 2022

@rystsov please comment on the severity level of this issue.

@rystsov
Copy link
Contributor Author

rystsov commented Nov 3, 2022

@dotnwat errors in admin api doesn't affect kafka api so I think the severity is low, we just need to repeat the request

@jcsp
Copy link
Contributor

jcsp commented Nov 7, 2022

Seen here:
https://buildkite.com/redpanda/redpanda/builds/17928#018443c9-20bd-4a71-b465-e5d16d71b3d6

This is an error handling bug in the admin API, the case it's hitting is:

WARN  2022-11-04 20:25:45,193 [shard 1] raft - [group_id:1, {kafka/topic-keqtkiwwzu/0}] consensus.cc:2841 - Cannot transfer leadership: {id: {1}, revision: {15}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

Fix is probably to figure out what raft::errc that corresponds to, and add it to the list of things in admin_server.cc that gets mapped to a 503. Then the admin client in the test will retry.

andrwng added a commit to andrwng/redpanda that referenced this issue Nov 8, 2022
Immediately following a node restart, attempts to transfer leadership to
the restarted node may be misconstrued as a node that needs recovery,
resulting in a raft::errc::timeout, which yields code 500 in versions
that don't have redpanda-data#7096
(i.e. 22.3 and below).

Since 500 is a generic error, rather than adding 500 to the accepted
retriable codes, this commit opts to retry the leadership transfer until
leadership is actually transferred, ignoring all errors.

This entailed tweaking the return semantics of transfer_leadership_to(),
which no other test appears to be using.

Fixes redpanda-data#5589
andrwng added a commit to andrwng/redpanda that referenced this issue Nov 8, 2022
Immediately following a node restart, attempts to transfer leadership to
the restarted node may be misconstrued as a node that needs recovery,
resulting in a raft::errc::timeout, which yields code 500 in versions
that don't have redpanda-data#6388
(i.e. 22.3 and below).

Since 500 is a generic error, rather than adding 500 to the accepted
retriable codes, this commit opts to retry the leadership transfer until
leadership is actually transferred, ignoring all errors.

This entailed tweaking the return semantics of transfer_leadership_to(),
which no other test appears to be using.

Fixes redpanda-data#5589
andrwng added a commit to andrwng/redpanda that referenced this issue Nov 9, 2022
Immediately following a node restart, attempts to transfer leadership to
the restarted node may be misconstrued as a node that needs recovery,
resulting in a raft::errc::timeout, which yields code 500 in versions
that don't have redpanda-data#6388
(i.e. 22.3 and below).

Since 500 is a generic error, rather than adding 500 to the accepted
retriable codes, this commit opts to retry the leadership transfer until
leadership is actually transferred, ignoring all errors.

This entailed tweaking the return semantics of transfer_leadership_to(),
which no other test appears to be using.

Fixes redpanda-data#5589
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Nov 10, 2022
Immediately following a node restart, attempts to transfer leadership to
the restarted node may be misconstrued as a node that needs recovery,
resulting in a raft::errc::timeout, which yields code 500 in versions
that don't have redpanda-data#6388
(i.e. 22.3 and below).

Since 500 is a generic error, rather than adding 500 to the accepted
retriable codes, this commit opts to retry the leadership transfer until
leadership is actually transferred, ignoring all errors.

This entailed tweaking the return semantics of transfer_leadership_to(),
which no other test appears to be using.

Fixes redpanda-data#5589

(cherry picked from commit a497908)
BenPope pushed a commit to BenPope/redpanda that referenced this issue Dec 22, 2022
Immediately following a node restart, attempts to transfer leadership to
the restarted node may be misconstrued as a node that needs recovery,
resulting in a raft::errc::timeout, which yields code 500 in versions
that don't have redpanda-data#6388
(i.e. 22.3 and below).

Since 500 is a generic error, rather than adding 500 to the accepted
retriable codes, this commit opts to retry the leadership transfer until
leadership is actually transferred, ignoring all errors.

This entailed tweaking the return semantics of transfer_leadership_to(),
which no other test appears to be using.

Fixes redpanda-data#5589

(cherry picked from commit a497908)

I dropped the changes to the tests, since they don't exist on v22.1.x
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft area/tests ci-failure kind/bug Something isn't working pr-blocker CI failures blocking a PR from being merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants