Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in RetentionPolicyTest.test_changing_topic_retention_with_restart #2406

Closed
jcsp opened this issue Sep 23, 2021 · 11 comments · Fixed by #2428, #2477 or #5331
Closed

Failure in RetentionPolicyTest.test_changing_topic_retention_with_restart #2406

jcsp opened this issue Sep 23, 2021 · 11 comments · Fixed by #2428, #2477 or #5331

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 23, 2021

Example failure: https://buildkite.com/vectorized/redpanda/builds/2496#a4a3cbd2-d134-4234-a7d3-6027cf50801a

The test tries to modify a topic config immediately after a cluster restart and the server rejects it.

[DEBUG - 2021-09-22 21:37:56,346 - kafka_cli_tools - _execute - lineno:231]: Error (1) executing command: Error while executing config command with args '--bootstrap-server docker_n_18:9092,docker_n_8:9092,docker_n_18:9092,docker_n_14:9092,docker_n_8:9092,docker_n_14:9092 --topic topic-juga --alter --add-config retention.bytes=15728640'
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
	at kafka.admin.ConfigCommand$.alterConfig(ConfigCommand.scala:334)
	at kafka.admin.ConfigCommand$.processCommand(ConfigCommand.scala:301)
	at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:96)
	at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
Caused by: org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
TRACE 2021-09-22 21:37:56,014 [shard 0] kafka - requests.cc:87 - Processing name:incremental_alter_configs, key:44, version:0 for adminclient-1
TRACE 2021-09-22 21:37:56,014 [shard 0] kafka - Handling request {resources={{resource_type=2 resource_name=topic-juga configs={{name=retention.bytes config_operation=0 value={15728640}}}}} validate_only=false}
TRACE 2021-09-22 21:37:56,015 [shard 0] kafka - request_context.h:150 - sending 44:incremental_alter_configs response {throttle_time_ms=0 responses={{error_code={ error_code: unknown_server_error [-1] } error_message={nullopt} resource_type=2 resource_name=topic-juga}}}

A few milliseconds later the node that rejected the client request learns of the new controller leader

TRACE 2021-09-22 21:37:56,210 [shard 0] rpc - server.cc:147 - vectorized internal rpc protocol - Incoming connection from 172.19.0.18:61425 on ""
TRACE 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1427 - Append entries request: {raft_group:{0}, commit_index:{13}, term:{2}, prev_log_index:{13}, prev_log_term:{2}}
DEBUG 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1447 - Append entries request term:2 is greater than current: 1. Setting new term
TRACE 2021-09-22 21:37:56,212 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1427 - Append entries request: {raft_group:{0}, commit_index:{13}, term:{2}, prev_log_index:{13}, prev_log_term:{2}}
INFO  2021-09-22 21:37:56,212 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 23, 2021
Pending fix for redpanda-data#2406

Signed-off-by: John Spray <jcs@vectorized.io>
@jcsp
Copy link
Contributor Author

jcsp commented Sep 23, 2021

Thinking about this more, maybe this is not just a test behavior issue -- shouldn't redpanda handle the request even if the controller group is mid-election? Maybe the redpanda node should block the request until the controller group is ready to service the request.

@jcsp
Copy link
Contributor Author

jcsp commented Sep 23, 2021

This test is now disabled in dev, PR that fixes it should re-enable test too.

@twmb
Copy link
Contributor

twmb commented Sep 24, 2021

Today, I think rebased on dev,
https://buildkite.com/vectorized/redpanda/builds/2549#149ae9cd-730b-4aa6-b612-9038b39fd828

test_id:    rptest.tests.retention_policy_test.RetentionPolicyTest.test_changing_topic_retention_with_restart
--
  | status:     FAIL
  | run time:   2 minutes 53.115 seconds
  |  
  |  
  | TimeoutError('Segments were not removed')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/tests/runner_client.py", line 215, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 123, in test_changing_topic_retention_with_restart
  | self._wait_for_segments_removal(self.topic, 0, 16)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 176, in _wait_for_segments_removal
  | wait_until(done,
  | File "/usr/local/lib/python3.8/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Segments were not removed

Although I may have forgotten to pull upstream, so it's possible this build was missing the test disabling.

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 24, 2021
Related: redpanda-data#2406

Signed-off-by: John Spray <jcs@vectorized.io>
BenPope added a commit to BenPope/redpanda that referenced this issue Sep 24, 2021
Instead of returning `unknown_server_error`, return `not_controller`

This is a retriable error code, with metadata refresh.

Fixes redpanda-data#2406

Signed-off-by: Ben Pope <ben@vectorized.io>
@jcsp
Copy link
Contributor Author

jcsp commented Sep 24, 2021

Thanks for spotting that @twmb , the test was indeed still enabled, it turns out the ignore decorator is twitchy about being called with no arguments (#2427)

jcsp pushed a commit that referenced this issue Sep 24, 2021
Instead of returning `unknown_server_error`, return `not_controller`

This is a retriable error code, with metadata refresh.

Fixes #2406

Signed-off-by: Ben Pope <ben@vectorized.io>
@ivotron
Copy link
Member

ivotron commented Sep 29, 2021

this has reappeared in https://buildkite.com/vectorized/redpanda/builds/2706#0bfa0f0f-86a4-473f-a3eb-bea8d0699aa6 . should we reopen this or open a new issue?

@BenPope BenPope reopened this Sep 29, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Sep 29, 2021

Thanks for reopening - here's another failure example from test-staging (https://buildkite.com/vectorized/redpanda/builds/2703#9e549b84-79ef-482a-8db9-4c38b68aa34a).

@BenPope
Copy link
Member

BenPope commented Sep 29, 2021

#2428 has worked, but it looks like kafka-configs.sh doesn't retry:

java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotControllerException: This is not the correct controller for this cluster.
	at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
	at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
	at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:104)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
	at kafka.admin.ConfigCommand$.alterConfig(ConfigCommand.scala:334)
	at kafka.admin.ConfigCommand$.processCommand(ConfigCommand.scala:301)
	at kafka.admin.ConfigCommand$.main(ConfigCommand.scala:96)
	at kafka.admin.ConfigCommand.main(ConfigCommand.scala)
Caused by: org.apache.kafka.common.errors.NotControllerException: This is not the correct controller for this cluster.

jcsp added a commit to jcsp/redpanda that referenced this issue Sep 29, 2021
Pending redpanda-data#2406

Signed-off-by: John Spray <jcs@vectorized.io>
BenPope added a commit to BenPope/redpanda that referenced this issue Sep 29, 2021
PR redpanda-data#2428 changed the returned error code so that the error was
retriable with a metadata refresh, but kafka-configs.sh doesn't retry.

Call describe_topic to wait for the controller, as it performs a retry.

Fix redpanda-data#2406

Signed-off-by: Ben Pope <ben@vectorized.io>
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 29, 2021
Pending redpanda-data#2406

Signed-off-by: John Spray <jcs@vectorized.io>
@r-vasquez
Copy link
Contributor

Seeing

TimeoutError('Segments were not removed')
--
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/retention_policy_test.py", line 99, in test_changing_topic_retention_with_restart
  | wait_for_segments_removal(redpanda=self.redpanda,
  | File "/root/tests/rptest/util.py", line 156, in wait_for_segments_removal
  | wait_until(done,
  | File "/root/tests/rptest/util.py", line 72, in wait_until
  | raise TimeoutError(
  | ducktape.errors.TimeoutError: Segments were not removed

In: https://buildkite.com/redpanda/redpanda/builds/11813#0181ad2c-892a-4d2e-a49e-3e5a71f4cc95/1602-8550

Should we reopen this or create a new issue?

@twmb
Copy link
Contributor

twmb commented Jun 29, 2022

Reopening

@jcsp
Copy link
Contributor Author

jcsp commented Jul 1, 2022

(Note: this appears to be a totally different failure to the one this ticket was originally for)

jcsp added a commit to jcsp/redpanda that referenced this issue Jul 4, 2022
This test was assuming that segment sizes apply exactly,
but they are actually subject to a +/- 5% jitter.  To reliably
remove the expected number of segments, we must set our retention
bytes to 5% less than the amount we really expect to retain.

Fixes redpanda-data#2406
@jcsp
Copy link
Contributor Author

jcsp commented Jul 4, 2022

Dissecting most recent failure (https://buildkite.com/redpanda/redpanda/builds/12024#0181c5f8-c7d2-4e43-a227-5d7d8020d14c)

I see that all nodes have prefix truncated their logs, and nodes 1,2,3 have 17,15,16 segments remaining respectively. The test asserts that at most 16 segments may remain. Node 1 has a slightly lower start offset on its oldest segment than the others.

This may be just the segment size jitter conflicting with the test's expectations of a deterministic number of segments remaining after setting a lower retention. It makes sense that this would affect the _with_restart test and not the others in the same class, because the other tests in the class use smaller segment counts where the probability of cumulative jitter having this eeffect is much lower.

jcsp added a commit to jcsp/redpanda that referenced this issue Jul 4, 2022
This test was assuming that segment sizes apply exactly,
but they are actually subject to a +/- 5% jitter.  To reliably
remove the expected number of segments, we must set our retention
bytes to 5% less than the amount we really expect to retain.

Fixes redpanda-data#2406
@jcsp jcsp assigned jcsp and unassigned BenPope Jul 4, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Jul 4, 2022
This test was assuming that segment sizes apply exactly,
but they are actually subject to a +/- 5% jitter.  To reliably
remove the expected number of segments, we must set our retention
bytes to 5% less than the amount we really expect to retain.

Fixes redpanda-data#2406
@jcsp jcsp closed this as completed in #5331 Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment