cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

abhijat · 2022-08-01T09:37:50Z

Cover Letter

If a leader election happened right before we verify topic watermark, rpk may return an error such as "not leader for partition" due to stale metadata.

Instead of failing immediately this change makes the caller wait for upto 1 minute so that the new leadership info is propagated to the cluster and rpk returns the correct partition metadata.

in a failed test in linked CI failure the following logs are seen due to metadata returned during leader election:

INFO  2022-07-29 18:08:38,744 [shard 1] raft - [group_id:1, {kafka/panda-topic-1/0}] vote_stm.cc:255 - becoming the leader term:5
INFO  2022-07-29 18:08:38,764 [shard 1] raft - [group_id:2, {kafka/panda-topic-1/1}] vote_stm.cc:255 - becoming the leader term:5
TRACE 2022-07-29 18:08:38,858 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,863 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,865 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,866 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,869 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,869 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
INFO  2022-07-29 18:08:39,027 [shard 1] raft - [group_id:1, {kafka/panda-topic-1/0}] vote_stm.cc:270 - became the leader term:5
INFO  2022-07-29 18:08:39,033 [shard 1] raft - [group_id:2, {kafka/panda-topic-1/1}] vote_stm.cc:270 - became the leader term:5

fixes #5737

force push: fix capturing the variable in wait_until

Release Notes

abhijat · 2022-08-01T09:43:19Z

tests/rptest/tests/topic_recovery_test.py

+            for partition_info in self._rpk.describe_topic(topic_name):
+                hwm = partition_info.high_watermark


should we calculate the max/min/first value which is not None watermark over all partitions?

right now we are effectively just getting the value from the last row that rpk produced.

cc @Lazin

We only need high watermark for this test.

Lazin

LGTM (when it will stop being a draft PR)

abhijat · 2022-08-01T15:01:30Z

LGTM (when it will stop being a draft PR)

I had a bug in the way watermark was captured, latest commit should fix it.

abhijat · 2022-08-02T14:41:45Z

errors:
https://buildkite.com/redpanda/redpanda/builds/13415#018259ea-dff0-4d4e-8c72-89c30874ce5f

test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status:     FAIL
run time:   5 minutes 0.884 seconds
 
    <BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">

#5378 related, should be fixed with PR #5742

test_id:
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=restart_recovery
status: FAIL
run time: 3 minutes 30.828 seconds

<BadLogLines nodes=docker-rp-11(1)
    example="ERROR 2022-08-01 15:53:16,410 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply">
    Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
    File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
    File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
    File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
    File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
    rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-11(1)
        example="ERROR 2022-08-01 15:53:16,410 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply">

seems an instance of #5608 Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply

dotnwat · 2022-08-02T16:55:56Z

#5378 related, should be fixed with PR #5742

thank you for doing this. could you post the stack traces on the issue related to the failure rather than the PR that experienced the failure?

tests/rptest/tests/topic_recovery_test.py

ajfabbri · 2022-08-12T06:35:50Z

Looks good. Only problem is your force-push fix ended up in the wrong commit. I've done this before, 😅.

If a leader election happened right before we verify topic watermark, rpk may return an error such as "not leader for partition" due to stale metadata. Instead of failing immediately this change makes the caller wait for upto 1 minute so that the new leadership info is propagated to the cluster and rpk returns the correct partition metadata.

abhijat · 2022-08-13T17:13:42Z

/backport v22.2.x

abhijat · 2022-08-13T17:15:36Z

/backport v22.1.x

abhijat requested review from dotnwat and NyaliaLui as code owners August 1, 2022 09:37

abhijat requested review from Lazin and LenaAn August 1, 2022 09:38

abhijat commented Aug 1, 2022

View reviewed changes

abhijat marked this pull request as draft August 1, 2022 10:34

Lazin previously approved these changes Aug 1, 2022

View reviewed changes

abhijat dismissed Lazin’s stale review via 4cd151e August 1, 2022 14:58

abhijat force-pushed the wait-for-topic-metadata-test-fast-3 branch from 3339469 to 4cd151e Compare August 1, 2022 14:58

abhijat marked this pull request as ready for review August 1, 2022 15:00

abhijat requested a review from Lazin August 1, 2022 15:01

abhijat mentioned this pull request Aug 2, 2022

Failure in ControllerUpgradeTest test_updating_cluster_when_executing_operations #5378

Closed

ajfabbri reviewed Aug 12, 2022

View reviewed changes

tests/rptest/tests/topic_recovery_test.py Show resolved Hide resolved

abhijat added 2 commits August 12, 2022 12:13

tests: newline before rpk output for readability

d94b00a

abhijat force-pushed the wait-for-topic-metadata-test-fast-3 branch from 4cd151e to d94b00a Compare August 12, 2022 06:44

Lazin approved these changes Aug 12, 2022

View reviewed changes

abhijat merged commit 97e8076 into redpanda-data:dev Aug 12, 2022

This was referenced Aug 13, 2022

[v22.2.x] Assertion Error in TopicRecoveryTest.test_fast3 #6012

Closed

[v22.2.x] cloud_storage/tests: wait for partition metadata during topic recovery tests #6013

Merged

This was referenced Aug 13, 2022

[v22.1.x] Assertion Error in TopicRecoveryTest.test_fast3 #6014

Closed

[v22.1.x] cloud_storage/tests: wait for partition metadata during topic recovery tests #6015

Merged

mmedenjak added kind/bug Something isn't working ci-failure labels Aug 15, 2022

mmedenjak added the area/cloud-storage Shadow indexing subsystem label Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

abhijat commented Aug 1, 2022 •

edited by andrewhsu

Loading

abhijat Aug 1, 2022

Lazin Aug 1, 2022

Lazin left a comment

abhijat commented Aug 1, 2022

abhijat commented Aug 2, 2022

dotnwat commented Aug 2, 2022

ajfabbri commented Aug 12, 2022 •

edited

Loading

abhijat commented Aug 13, 2022

abhijat commented Aug 13, 2022

		for partition_info in self._rpk.describe_topic(topic_name):
		hwm = partition_info.high_watermark

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

Conversation

abhijat commented Aug 1, 2022 • edited by andrewhsu Loading

Cover Letter

Release Notes

abhijat Aug 1, 2022

Choose a reason for hiding this comment

Lazin Aug 1, 2022

Choose a reason for hiding this comment

Lazin left a comment

Choose a reason for hiding this comment

abhijat commented Aug 1, 2022

abhijat commented Aug 2, 2022

dotnwat commented Aug 2, 2022

ajfabbri commented Aug 12, 2022 • edited Loading

abhijat commented Aug 13, 2022

abhijat commented Aug 13, 2022

abhijat commented Aug 1, 2022 •

edited by andrewhsu

Loading

ajfabbri commented Aug 12, 2022 •

edited

Loading