CI Failure (timeout) in `TopicRecoveryTest.test_empty_segments` #10219

dotnwat · 2023-04-19T23:09:58Z

https://buildkite.com/redpanda/redpanda/builds/27484#0187996d-6f3b-43a3-bc04-8f12c3fd182b

Module: rptest.tests.topic_recovery_test
Class:  TopicRecoveryTest
Method: test_empty_segments
Arguments:
{
  "cloud_storage_type": 2
}

====================================================================================================
test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_empty_segments.cloud_storage_type=CloudStorageType.ABS
status:     FAIL
run time:   1 minute 42.935 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1354, in test_empty_segments
    self.do_run(test_case)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1312, in do_run
    self._wait_for_topic(test_case.expected_recovered_topics)
  File "/root/tests/rptest/tests/topic_recovery_test.py", line 1262, in _wait_for_topic
    wait_until(verify, timeout_sec=timeout.total_seconds(), backoff_sec=1)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

The text was updated successfully, but these errors were encountered:

dotnwat · 2023-04-20T22:35:17Z

https://buildkite.com/redpanda/redpanda/builds/27545#01879b85-c692-4a70-92d3-3b2380dde5e6

dotnwat · 2023-04-21T21:26:45Z

https://buildkite.com/redpanda/redpanda/builds/27637#0187a01b-57ea-43ab-b2f9-6001c9126c62

dotnwat · 2023-04-21T21:26:56Z

https://buildkite.com/redpanda/redpanda/builds/27707#0187a374-61cb-4808-8ed7-9223406a0239

jcsp · 2023-04-25T09:56:58Z

https://buildkite.com/redpanda/redpanda/builds/27827#0187b1e6-54ed-400d-b329-4dccb5c618da
https://buildkite.com/redpanda/redpanda/builds/27806#0187acbd-df18-4556-8ea8-ca411746cb19

Related: redpanda-data#10219

BenPope · 2023-05-08T19:36:37Z

v23.1.x https://buildkite.com/redpanda/redpanda/builds/28657#0187ecbd-3e4f-44fb-9b68-aeccba4db9db

michael-redpanda · 2023-05-17T18:25:28Z

[v23.1.x] https://buildkite.com/redpanda/vtools/builds/7593#01882a67-cfe8-4204-a031-d38f7ac4a2a9

VladLazar · 2023-05-23T16:58:50Z

OFAIL test: TopicRecoveryTest.test_empty_segments.cloud_storage_type=CloudStorageType.S3 (1/37 runs)
  failure at 2023-05-23T07:23:45.162Z: TimeoutError('')
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29652#01884735-c82c-464f-8f34-48b2ee498bb7

BenPope · 2023-05-31T17:43:15Z

This is a form of "stale metadata":

Node 3 becomes the leader:

INFO  2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] vote_stm.cc:264 - becoming the leader term:3
DEBUG 2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] consensus.cc:2312 - Replicating group configuration {current: {voters: {{id: {3}, revision: {18}}, {id: {1}, revision: {18}}, {id: {2}, revision: {18}}}, learners: {}}, old:{nullopt}, revision: 18, update: {nullopt}, version: 5}}
TRACE 2023-04-24 07:37:07,139 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] replicate_entries_stm.cc:187 - Self append entries - {raft_group:{1}, commit_index:{-9223372036854775808}, term:{3}, prev_log_index:{2}, prev_log_term:{2}, last_visible_index:{-9223372036854775808}}
INFO  2023-04-24 07:37:07,139 [shard 1] storage - segment.cc:745 - Creating new segment /var/lib/redpanda/data/kafka/panda-topic/0_18/3-3-v1.log

But then I'm not sure about the revisions:

DEBUG 2023-04-24 07:37:25,044 [shard 1] cluster - controller_backend.cc:881 - [{kafka/panda-topic/0}] (retry 0) finished operation: {type: addition, revision: 17, assignment: { id: 0, group_id: 1, replicas: {{node_id: 3, shard: 1}, {node_id: 2, shard: 1}, {node_id: 1, shard: 1}} }, previous assignment: {nullopt}}

So, it says it is the leader in the metadata response, but then it says not_leader_for_partition when asked about the offsets.

TRACE 2023-04-24 07:37:28,225 [shard 0] kafka - handler.h:69 - [client_id: {rpk}] handling metadata v7 request {topics={{{name={panda-topic}}}} allow_auto_topic_creation=false include_cluster_authorized_operations=false include_topic_authorized_operations=false}
TRACE 2023-04-24 07:37:28,225 [shard 0] kafka - request_context.h:186 - [172.18.0.36:58400] sending 3:metadata for {rpk}, response {throttle_time_ms=0 brokers={{node_id=1 host=docker-rp-15 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-12 port=9092 rack={nullopt}}, {node_id=3 host=docker-rp-16 port=9092 rack={nullopt}}} cluster_id={redpanda.initializing} controller_id=1 topics={{error_code={ error_code: none [0] } name={panda-topic} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=3 leader_epoch=1 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}},] topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - requests.cc:94 - [172.18.0.36:58404] processing name:list_offsets, key:2, version:4 for rpk, mem_units: 8110, ctx_size: 42
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - handler.h:69 - [client_id: {rpk}] handling list_offsets v4 request {replica_id=-1 isolation_level=0 topics={{name={panda-topic} partitions={{partition_index=0 current_leader_epoch=-1 timestamp={timestamp: -2} max_num_offsets=0}}}}}
TRACE 2023-04-24 07:37:28,227 [shard 1] kafka - request_context.h:186 - [172.18.0.36:58404] sending 2:list_offsets for {rpk}, response {throttle_time_ms=0 topics={{name={panda-topic} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}

I suspect that it doesn't think it's the leader because there's a mismatch in the revisions.

piyushredpanda · 2023-05-31T19:51:18Z

@VladLazar could help answer any questions, perhaps?

dotnwat added kind/bug Something isn't working ci-failure labels Apr 19, 2023

piyushredpanda added the area/cloud-storage Shadow indexing subsystem label Apr 23, 2023

andijcr mentioned this issue Apr 24, 2023

CloudStorage: use segment_meta_cstore for partition_manifest #9853

Merged

7 tasks

jcsp assigned Lazin Apr 25, 2023

jcsp mentioned this issue Apr 26, 2023

tests: improve CloudRetentionTest #10327

Merged

7 tasks

jcsp added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Apr 26, 2023

jcsp added a commit to jcsp/redpanda that referenced this issue Apr 26, 2023

tests: ok_to_fail empty segment recovery case

a5c530c

Related: redpanda-data#10219

jcsp mentioned this issue Apr 26, 2023

tests: ok_to_fail empty segment recovery case #10377

Merged

7 tasks

piyushredpanda mentioned this issue Apr 27, 2023

Put consumers in their own scheduling group #10398

Merged

7 tasks

VladLazar mentioned this issue Apr 27, 2023

[v23.1.x] cloud_storage: fix reupload of topic manifest #10384

Merged

7 tasks

bharathv mentioned this issue May 5, 2023

[backport] [v23.1.x] raft: dynamic learner recovery rate #10505

Merged

7 tasks

dotnwat mentioned this issue May 10, 2023

[v23.1.x] compression: avoid optimizing away logger registration #10490

Merged

bharathv mentioned this issue May 10, 2023

[backport] [v23.1.x] More detailed partition reconfiguration tracking #10201 #10630

Merged

7 tasks

mmaslankaprv mentioned this issue May 24, 2023

[v23.1.x] Removed lock contention when writing Raft snapshots #10973

Merged

BenPope mentioned this issue May 25, 2023

[v23.1.x] net: Add GNUTLS_E_INVALID_SESSION to is_reconnect_error #11032

Merged

piyushredpanda assigned BenPope and unassigned Lazin May 31, 2023

VladLazar mentioned this issue Jun 1, 2023

cloud_storage: tweak offsets used in post-recovery bootstrap #10535

Merged

7 tasks

piyushredpanda assigned VladLazar and unassigned BenPope Jun 1, 2023

BenPope mentioned this issue Jun 1, 2023

[v23.1.x] schema_registry/avro: Canonicalize AVRO schema #11143

Merged

piyushredpanda closed this as completed in #10535 Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (timeout) in `TopicRecoveryTest.test_empty_segments` #10219

CI Failure (timeout) in `TopicRecoveryTest.test_empty_segments` #10219

dotnwat commented Apr 19, 2023

dotnwat commented Apr 20, 2023

dotnwat commented Apr 21, 2023

dotnwat commented Apr 21, 2023

jcsp commented Apr 25, 2023

BenPope commented May 8, 2023

michael-redpanda commented May 17, 2023

VladLazar commented May 23, 2023

BenPope commented May 31, 2023

piyushredpanda commented May 31, 2023

CI Failure (timeout) in TopicRecoveryTest.test_empty_segments #10219

CI Failure (timeout) in TopicRecoveryTest.test_empty_segments #10219

Comments

dotnwat commented Apr 19, 2023

dotnwat commented Apr 20, 2023

dotnwat commented Apr 21, 2023

dotnwat commented Apr 21, 2023

jcsp commented Apr 25, 2023

BenPope commented May 8, 2023

michael-redpanda commented May 17, 2023

VladLazar commented May 23, 2023

BenPope commented May 31, 2023

piyushredpanda commented May 31, 2023

CI Failure (timeout) in `TopicRecoveryTest.test_empty_segments` #10219

CI Failure (timeout) in `TopicRecoveryTest.test_empty_segments` #10219