Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage/tests: wait for partition metadata during topic recovery tests #5757

Merged

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Aug 1, 2022

Cover Letter

If a leader election happened right before we verify topic watermark, rpk may return an error such as "not leader for partition" due to stale metadata.

Instead of failing immediately this change makes the caller wait for upto 1 minute so that the new leadership info is propagated to the cluster and rpk returns the correct partition metadata.

in a failed test in linked CI failure the following logs are seen due to metadata returned during leader election:

INFO  2022-07-29 18:08:38,744 [shard 1] raft - [group_id:1, {kafka/panda-topic-1/0}] vote_stm.cc:255 - becoming the leader term:5
INFO  2022-07-29 18:08:38,764 [shard 1] raft - [group_id:2, {kafka/panda-topic-1/1}] vote_stm.cc:255 - becoming the leader term:5
TRACE 2022-07-29 18:08:38,858 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,863 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,865 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,866 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,869 [shard 0] kafka - request_context.h:160 - [172.16.16.27:55628] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=1 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
TRACE 2022-07-29 18:08:38,869 [shard 0] kafka - request_context.h:160 - [172.16.16.27:48330] sending 2:list_offsets response {throttle_time_ms=0 topics={{name={panda-topic-1} partitions={{partition_index=0 error_code={ error_code: not_leader_for_partition [6] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-1}}}}}
INFO  2022-07-29 18:08:39,027 [shard 1] raft - [group_id:1, {kafka/panda-topic-1/0}] vote_stm.cc:270 - became the leader term:5
INFO  2022-07-29 18:08:39,033 [shard 1] raft - [group_id:2, {kafka/panda-topic-1/1}] vote_stm.cc:270 - became the leader term:5

fixes #5737

force push: fix capturing the variable in wait_until

Release Notes

Comment on lines 182 to 183
for partition_info in self._rpk.describe_topic(topic_name):
hwm = partition_info.high_watermark
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we calculate the max/min/first value which is not None watermark over all partitions?

right now we are effectively just getting the value from the last row that rpk produced.

cc @Lazin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need high watermark for this test.

@abhijat abhijat marked this pull request as draft August 1, 2022 10:34
Lazin
Lazin previously approved these changes Aug 1, 2022
Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (when it will stop being a draft PR)

@abhijat abhijat force-pushed the wait-for-topic-metadata-test-fast-3 branch from 3339469 to 4cd151e Compare August 1, 2022 14:58
@abhijat abhijat marked this pull request as ready for review August 1, 2022 15:00
@abhijat
Copy link
Contributor Author

abhijat commented Aug 1, 2022

LGTM (when it will stop being a draft PR)

I had a bug in the way watermark was captured, latest commit should fix it.

@abhijat abhijat requested a review from Lazin August 1, 2022 15:01
@abhijat
Copy link
Contributor Author

abhijat commented Aug 2, 2022

errors:
https://buildkite.com/redpanda/redpanda/builds/13415#018259ea-dff0-4d4e-8c72-89c30874ce5f

test_id:    rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations
status:     FAIL
run time:   5 minutes 0.884 seconds
 
    <BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-19(1) example="ERROR 2022-08-01 15:46:09,378 [shard 1] r/heartbeat - heartbeat_manager.cc:284 - cannot find consensus group:19">

#5378 related, should be fixed with PR #5742

test_id:
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=False.recovery=restart_recovery
status: FAIL
run time: 3 minutes 30.828 seconds

<BadLogLines nodes=docker-rp-11(1)
    example="ERROR 2022-08-01 15:53:16,410 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply">
    Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
    File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
    File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
    File "/root/tests/rptest/services/cluster.py", line 48, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
    File "/root/tests/rptest/services/redpanda.py", line 1121, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
    rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-11(1)
        example="ERROR 2022-08-01 15:53:16,410 [shard 1] rpc - transport.h:180 - Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply">

seems an instance of #5608 Protocol violation: request version rpc::transport_version::v1 incompatible with reply version rpc::transport_version::v2 reply type raft::append_entries_reply

@dotnwat
Copy link
Member

dotnwat commented Aug 2, 2022

#5378 related, should be fixed with PR #5742

thank you for doing this. could you post the stack traces on the issue related to the failure rather than the PR that experienced the failure?

@ajfabbri
Copy link
Contributor

ajfabbri commented Aug 12, 2022

Looks good. Only problem is your force-push fix ended up in the wrong commit. I've done this before, 😅.

If a leader election happened right before we verify topic watermark,
rpk may return an error such as "not leader for partition" due to stale
metadata.

Instead of failing immediately this change makes the caller wait for
upto 1 minute so that the new leadership info is propagated to the
cluster and rpk returns the correct partition metadata.
@abhijat abhijat force-pushed the wait-for-topic-metadata-test-fast-3 branch from 4cd151e to d94b00a Compare August 12, 2022 06:44
@abhijat abhijat merged commit 97e8076 into redpanda-data:dev Aug 12, 2022
@abhijat
Copy link
Contributor Author

abhijat commented Aug 13, 2022

/backport v22.2.x

@abhijat
Copy link
Contributor Author

abhijat commented Aug 13, 2022

/backport v22.1.x

@mmedenjak mmedenjak added the area/cloud-storage Shadow indexing subsystem label Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion Error in TopicRecoveryTest.test_fast3
5 participants