Failure in `RetentionPolicyTest.test_changing_topic_retention_with_restart` #2406

jcsp · 2021-09-23T10:19:15Z

Example failure: https://buildkite.com/vectorized/redpanda/builds/2496#a4a3cbd2-d134-4234-a7d3-6027cf50801a

The test tries to modify a topic config immediately after a cluster restart and the server rejects it.

[DEBUG - 2021-09-22 21:37:56,346 - kafka_cli_tools - _execute - lineno:231]: Error (1) executing command: Error while executing config command with args '--bootstrap-server docker_n_18:9092,docker_n_8:9092,docker_n_18:9092,docker_n_14:9092,docker_n_8:9092,docker_n_14:9092 --topic topic-juga --alter --add-config retention.bytes=15728640'
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
	at kafka.admin.ConfigCommand$.alterConfig(ConfigCommand.scala:334)
	at kafka.admin.ConfigCommand$.processCommand(ConfigCommand.scala:301)
	at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:96)
	at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
Caused by: org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.

TRACE 2021-09-22 21:37:56,014 [shard 0] kafka - requests.cc:87 - Processing name:incremental_alter_configs, key:44, version:0 for adminclient-1
TRACE 2021-09-22 21:37:56,014 [shard 0] kafka - Handling request {resources={{resource_type=2 resource_name=topic-juga configs={{name=retention.bytes config_operation=0 value={15728640}}}}} validate_only=false}
TRACE 2021-09-22 21:37:56,015 [shard 0] kafka - request_context.h:150 - sending 44:incremental_alter_configs response {throttle_time_ms=0 responses={{error_code={ error_code: unknown_server_error [-1] } error_message={nullopt} resource_type=2 resource_name=topic-juga}}}

A few milliseconds later the node that rejected the client request learns of the new controller leader

TRACE 2021-09-22 21:37:56,210 [shard 0] rpc - server.cc:147 - vectorized internal rpc protocol - Incoming connection from 172.19.0.18:61425 on ""
TRACE 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1427 - Append entries request: {raft_group:{0}, commit_index:{13}, term:{2}, prev_log_index:{13}, prev_log_term:{2}}
DEBUG 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1447 - Append entries request term:2 is greater than current: 1. Setting new term
TRACE 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1427 - Append entries request: {raft_group:{0}, commit_index:{13}, term:{2}, prev_log_index:{13}, prev_log_term:{2}}
INFO  2021-09-22 21:37:56,212 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost

The text was updated successfully, but these errors were encountered:

Pending fix for redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp · 2021-09-23T14:50:47Z

Thinking about this more, maybe this is not just a test behavior issue -- shouldn't redpanda handle the request even if the controller group is mid-election? Maybe the redpanda node should block the request until the controller group is ready to service the request.

jcsp · 2021-09-23T15:26:25Z

This test is now disabled in dev, PR that fixes it should re-enable test too.

twmb · 2021-09-24T01:31:35Z

Today, I think rebased on dev,
https://buildkite.com/vectorized/redpanda/builds/2549#149ae9cd-730b-4aa6-b612-9038b39fd828

test_id:    rptest.tests.retention_policy_test.RetentionPolicyTest.test_changing_topic_retention_with_restart
--
  | status:     FAIL
  | run time:   2 minutes 53.115 seconds
  |  
  |  
  | TimeoutError('Segments were not removed')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 123, in test_changing_topic_retention_with_restart
  | self._wait_for_segments_removal(self.topic, 0, 16)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 176, in _wait_for_segments_removal
  | wait_until(done,
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Segments were not removed

Although I may have forgotten to pull upstream, so it's possible this build was missing the test disabling.

Related: redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

Instead of returning `unknown_server_error`, return `not_controller` This is a retriable error code, with metadata refresh. Fixes redpanda-data#2406 Signed-off-by: Ben Pope <ben@vectorized.io>

jcsp · 2021-09-24T11:04:32Z

Thanks for spotting that @twmb , the test was indeed still enabled, it turns out the ignore decorator is twitchy about being called with no arguments (#2427)

Instead of returning `unknown_server_error`, return `not_controller` This is a retriable error code, with metadata refresh. Fixes #2406 Signed-off-by: Ben Pope <ben@vectorized.io>

ivotron · 2021-09-29T05:55:29Z

this has reappeared in https://buildkite.com/vectorized/redpanda/builds/2706#0bfa0f0f-86a4-473f-a3eb-bea8d0699aa6 . should we reopen this or open a new issue?

jcsp · 2021-09-29T09:13:07Z

Thanks for reopening - here's another failure example from test-staging (https://buildkite.com/vectorized/redpanda/builds/2703#9e549b84-79ef-482a-8db9-4c38b68aa34a).

BenPope · 2021-09-29T10:48:48Z

#2428 has worked, but it looks like kafka-configs.sh doesn't retry:

java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotControllerException: This is not the correct controller for this cluster.
	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
	at kafka.admin.ConfigCommand$.alterConfig(ConfigCommand.scala:334)
	at kafka.admin.ConfigCommand$.processCommand(ConfigCommand.scala:301)
	at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:96)
	at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
Caused by: org.apache.kafka.common.errors.NotControllerException: This is not the correct controller for this cluster.

Pending redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

PR redpanda-data#2428 changed the returned error code so that the error was retriable with a metadata refresh, but kafka-configs.sh doesn't retry. Call describe_topic to wait for the controller, as it performs a retry. Fix redpanda-data#2406 Signed-off-by: Ben Pope <ben@vectorized.io>

Pending redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

r-vasquez · 2022-06-29T14:02:00Z

Seeing

TimeoutError('Segments were not removed')
--
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 99, in test_changing_topic_retention_with_restart
  | wait_for_segments_removal(redpanda=self.redpanda,
  | File "/root/tests/rptest/util.py", line 156, in wait_for_segments_removal
  | wait_until(done,
  | File "/root/tests/rptest/util.py", line 72, in wait_until
  | raise TimeoutError(
  | ducktape.errors.TimeoutError: Segments were not removed

In: https://buildkite.com/redpanda/redpanda/builds/11813#0181ad2c-892a-4d2e-a49e-3e5a71f4cc95/1602-8550

Should we reopen this or create a new issue?

twmb · 2022-06-29T16:38:13Z

Reopening

jcsp · 2022-07-01T12:57:35Z

(Note: this appears to be a totally different failure to the one this ticket was originally for)

This test was assuming that segment sizes apply exactly, but they are actually subject to a +/- 5% jitter. To reliably remove the expected number of segments, we must set our retention bytes to 5% less than the amount we really expect to retain. Fixes redpanda-data#2406

jcsp · 2022-07-04T10:47:40Z

Dissecting most recent failure (https://buildkite.com/redpanda/redpanda/builds/12024#0181c5f8-c7d2-4e43-a227-5d7d8020d14c)

I see that all nodes have prefix truncated their logs, and nodes 1,2,3 have 17,15,16 segments remaining respectively. The test asserts that at most 16 segments may remain. Node 1 has a slightly lower start offset on its oldest segment than the others.

This may be just the segment size jitter conflicting with the test's expectations of a deterministic number of segments remaining after setting a lower retention. It makes sense that this would affect the _with_restart test and not the others in the same class, because the other tests in the class use smaller segment counts where the probability of cumulative jitter having this eeffect is much lower.

This test was assuming that segment sizes apply exactly, but they are actually subject to a +/- 5% jitter. To reliably remove the expected number of segments, we must set our retention bytes to 5% less than the amount we really expect to retain. Fixes redpanda-data#2406

jcsp added area/tests ci-failure labels Sep 23, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 23, 2021

tests: disable test_changing_topic_retention_with_restart

49716c8

Pending fix for redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp mentioned this issue Sep 23, 2021

Disable flaky tests #2389

Merged

dswang assigned BenPope Sep 23, 2021

BenPope mentioned this issue Sep 23, 2021

kafka/server: Improve kafka::map_topic_error_code #2422

Closed

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 24, 2021

tests: fix disabling test_changing_topic_retention_with_restart

bc50f08

Related: redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp mentioned this issue Sep 24, 2021

tests: fix disabling test_changing_topic_retention_with_restart #2427

Closed

BenPope mentioned this issue Sep 24, 2021

kafka/server: Improve kafka::map_topic_error_code v2 #2428

Merged

mmaslankaprv closed this as completed in #2428 Sep 27, 2021

BenPope reopened this Sep 29, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 29, 2021

tests: disable test_changing_topic_retention_with_restart

23be322

Pending redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

jcsp mentioned this issue Sep 29, 2021

Disable more unstable tests #2476

Merged

BenPope mentioned this issue Sep 29, 2021

ducktape: retention_policy - wait for controller #2477

Merged

jcsp closed this as completed in #2477 Sep 29, 2021

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 29, 2021

tests: disable test_changing_topic_retention_with_restart

89339e8

Pending redpanda-data#2406 Signed-off-by: John Spray <jcs@vectorized.io>

r-vasquez mentioned this issue Jun 29, 2022

Remove viper and mapstructure from rpk #5061

Merged

twmb reopened this Jun 29, 2022

jcsp mentioned this issue Jul 1, 2022

storage: fix test partition_size_while_cleanup #5244

Merged

jcsp mentioned this issue Jul 4, 2022

Failure in test_changing_topic_retention.test_changing_topic_retention.test_changing_topic_retention #3924

Closed

jcsp mentioned this issue Jul 4, 2022

tests: fix test_changing_topic_retention_with_restart #5331

Merged

jcsp assigned jcsp and unassigned BenPope Jul 4, 2022

jcsp mentioned this issue Jul 4, 2022

cluster: bootstrap user creation during cluster creation #5231

Merged

jcsp closed this as completed in #5331 Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in `RetentionPolicyTest.test_changing_topic_retention_with_restart` #2406

Failure in `RetentionPolicyTest.test_changing_topic_retention_with_restart` #2406

jcsp commented Sep 23, 2021

jcsp commented Sep 23, 2021

jcsp commented Sep 23, 2021

twmb commented Sep 24, 2021 •

edited

Loading

jcsp commented Sep 24, 2021

ivotron commented Sep 29, 2021

jcsp commented Sep 29, 2021

BenPope commented Sep 29, 2021

r-vasquez commented Jun 29, 2022

twmb commented Jun 29, 2022

jcsp commented Jul 1, 2022

jcsp commented Jul 4, 2022

Failure in RetentionPolicyTest.test_changing_topic_retention_with_restart #2406

Failure in RetentionPolicyTest.test_changing_topic_retention_with_restart #2406

Comments

jcsp commented Sep 23, 2021

jcsp commented Sep 23, 2021

jcsp commented Sep 23, 2021

twmb commented Sep 24, 2021 • edited Loading

jcsp commented Sep 24, 2021

ivotron commented Sep 29, 2021

jcsp commented Sep 29, 2021

BenPope commented Sep 29, 2021

r-vasquez commented Jun 29, 2022

twmb commented Jun 29, 2022

jcsp commented Jul 1, 2022

jcsp commented Jul 4, 2022

Failure in `RetentionPolicyTest.test_changing_topic_retention_with_restart` #2406

Failure in `RetentionPolicyTest.test_changing_topic_retention_with_restart` #2406

twmb commented Sep 24, 2021 •

edited

Loading