Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout) in TopicRecoveryTest.test_empty_segments #10219

Closed
dotnwat opened this issue Apr 19, 2023 · 9 comments · Fixed by #10535
Closed

CI Failure (timeout) in TopicRecoveryTest.test_empty_segments #10219

dotnwat opened this issue Apr 19, 2023 · 9 comments · Fixed by #10535
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2023

https://buildkite.com/redpanda/redpanda/builds/27484#0187996d-6f3b-43a3-bc04-8f12c3fd182b

Module: rptest.tests.topic_recovery_test
Class:  TopicRecoveryTest
Method: test_empty_segments
Arguments:
{
  "cloud_storage_type": 2
}
====================================================================================================
test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_empty_segments.cloud_storage_type=CloudStorageType.ABS
status:     FAIL
run time:   1 minute 42.935 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1354, in test_empty_segments
    self.do_run(test_case)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1312, in do_run
    self._wait_for_topic(test_case.expected_recovered_topics)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1262, in _wait_for_topic
    wait_until(verify, timeout_sec=timeout.total_seconds(), backoff_sec=1)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
@dotnwat dotnwat added kind/bug Something isn't working ci-failure labels Apr 19, 2023
@dotnwat
Copy link
Member Author

dotnwat commented Apr 20, 2023

@dotnwat
Copy link
Member Author

dotnwat commented Apr 21, 2023

@dotnwat
Copy link
Member Author

dotnwat commented Apr 21, 2023

@jcsp jcsp added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Apr 26, 2023
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 26, 2023
@BenPope
Copy link
Member

BenPope commented May 8, 2023

@michael-redpanda
Copy link
Contributor

@VladLazar
Copy link
Contributor

OFAIL test: TopicRecoveryTest.test_empty_segments.cloud_storage_type=CloudStorageType.S3 (1/37 runs)
  failure at 2023-05-23T07:23:45.162Z: TimeoutError('')
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29652#01884735-c82c-464f-8f34-48b2ee498bb7

@BenPope
Copy link
Member

BenPope commented May 31, 2023

This is a form of "stale metadata":

Node 3 becomes the leader:

INFO  2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] vote_stm.cc:264 - becoming the leader term:3
DEBUG 2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] consensus.cc:2312 - Replicating group configuration {current: {voters: {{id: {3}, revision: {18}}, {id: {1}, revision: {18}}, {id: {2}, revision: {18}}}, learners: {}}, old:{nullopt}, revision: 18, update: {nullopt}, version: 5}}
TRACE 2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] replicate_entries_stm.cc:187 - Self append entries - {raft_group:{1}, commit_index:{-9223372036854775808}, term:{3}, prev_log_index:{2}, prev_log_term:{2}, last_visible_index:{-9223372036854775808}}
INFO  2023-04-24 07:37:07,139 [shard 1] storage - segment.cc:745 - Creating new segment /var/lib/redpanda/data/kafka/panda-topic/0_18/3-3-v1.log

But then I'm not sure about the revisions:

DEBUG 2023-04-24 07:37:25,044 [shard 1] cluster - controller_backend.cc:881 - [{kafka/panda-topic/0}] (retry 0) finished operation: {type: addition, revision: 17, assignment: { id: 0, group_id: 1, replicas: {{node_id: 3, shard: 1}, {node_id: 2, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}

So, it says it is the leader in the metadata response, but then it says not_leader_for_partition when asked about the offsets.

TRACE 2023-04-24 07:37:28,225 [shard 0] kafka - handler.h:69 - [client_id: {rpk}] handling metadata v7 request {topics={{{name={panda-topic}}}} allow_auto_topic_creation=false include_cluster_authorized_operations=false include_topic_authorized_operations=false}
TRACE 2023-04-24 07:37:28,225 [shard 0] kafka - request_context.h:186 - [172.18.0.36:58400] sending 3:metadata for {rpk}, response {throttle_time_ms=0 brokers={{node_id=1 host=docker-rp-15 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-12 port=9092 rack={nullopt}}, {node_id=3 host=docker-rp-16 port=9092 rack={nullopt}}} cluster_id={redpanda.initializing} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=3 leader_epoch=1 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}},] topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - requests.cc:94 - [172.18.0.36:58404] processing name:list_offsets, key:2, version:4 for rpk, mem_units: 8110, ctx_size: 42
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - handler.h:69 - [client_id: {rpk}] handling list_offsets v4 request {replica_id=-1 isolation_level=0 topics={{name={panda-topic} partitions={{partition_index=0 current_leader_epoch=-1 timestamp={timestamp: -2} max_num_offsets=0}}}}}
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - request_context.h:186 - [172.18.0.36:58404] sending 2:list_offsets for {rpk}, response {throttle_time_ms=0 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}

I suspect that it doesn't think it's the leader because there's a mismatch in the revisions.

@piyushredpanda
Copy link
Contributor

@VladLazar could help answer any questions, perhaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants