Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: move centralized-configuration Kuttl tests to main bucket. #5609

Merged
merged 1 commit into from
Mar 13, 2023

Conversation

nicolaferraro
Copy link
Member

Centralized configuration and related tests seem to be stable.

cc: @jcsp

@nicolaferraro nicolaferraro requested a review from a team as a code owner July 25, 2022 09:12
@mmedenjak mmedenjak added kind/enhance New feature or request area/tests labels Jul 26, 2022
RafalKorepta
RafalKorepta previously approved these changes Jul 27, 2022
@jcsp
Copy link
Contributor

jcsp commented Jul 27, 2022

What was the outcome of investigating the most recent failures of these? (https://redpandadata.slack.com/archives/C01H6JRQX1S/p1658740986316359?thread_ts=1658736130.531769&cid=C01H6JRQX1S)

If something is still up with them, let's fix it before we reinstate them.

@jcsp
Copy link
Contributor

jcsp commented Aug 3, 2022

k8s-unstable-tests had two failures in last 24h:
FAIL test: k8s-unstable-tests.k8s-unstable-tests (2/24 runs)
failure at 2022-08-02T09:37:39.714Z: None
in job https://buildkite.com/redpanda/redpanda/builds/13453#01825dce-2bd3-4300-92a8-483046638bce
failure at 2022-08-02T10:20:34.786Z: None
in job https://buildkite.com/redpanda/redpanda/builds/13454#01825e11-5854-46fe-8699-4c617c6ce836

@nicolaferraro it would be good to get a signal on whether we have a bug here before we release 22.2

@nicolaferraro
Copy link
Member Author

@jcsp the logs highlight a strange behavior of the configuration system in redpanda:

2022-08-02T09:32:34.479154062Z stderr F 2022-08-02T09:32:34.479Z	INFO	controllers.redpanda.Cluster	Applying patch to the cluster configuration	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "patch": "+append_chunk_size"}
2022-08-02T09:32:34.480731898Z stderr F 2022-08-02T09:32:34.480Z	INFO	controllers.redpanda.Cluster	Patch written to the cluster	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "config_version": 4}
2022-08-02T09:32:34.481066838Z stderr F 2022-08-02T09:32:34.481Z	INFO	controllers.redpanda.Cluster	Centralized configuration hash has changed	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488032683Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 0 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488039338Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 1 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488041858Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 0 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.48804441Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 1 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}

It's a 2 replicas cluster and the flow can be read as:

  • Operator sets a different value for append_chuck_size, the cluster returns configuration version 4
  • Operator then asks both nodes which configuration are they using and if they need restart, and they both say 3 and no need for restart

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

@nicolaferraro
Copy link
Member Author

The information about which config is every node using comes from /v1/cluster_config/status, sent explicitly to the leader.

@jcsp
Copy link
Contributor

jcsp commented Aug 4, 2022

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

Update of status is asynchronous: there is no guarantee that the version in the status will reflect the version in the response from the PUT. It's because status updates are themselves persistent writes to the controller log, separate to the write that updates the configuration. We could make the API a bit friendlier by waiting for status updates inside the PUT handler, but that would not be 100% reliable either because it's possible for controller to lose leadership between writing the config update and writing the status update.

For testing on existing code, the solution is to have a retry-wait for the status, rather than expecting it to be updated synchronously.

Making this a bit friendlier in the API is #5833

@nicolaferraro
Copy link
Member Author

I remember we did some changes in the v22.1 branch to direct calls to the leader since it was supposed to apply the configuration before returning. The current operator code requires that there's some consistency between the two calls to save-config and get-config..

@jcsp
Copy link
Contributor

jcsp commented Aug 4, 2022

Getting the config on the leader after setting it on the leader is synchronous, it's just the status specifically that's asynchonous.

jcsp added a commit to jcsp/redpanda that referenced this pull request Aug 4, 2022
Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: redpanda-data#5833

Related: redpanda-data#5609
@jcsp
Copy link
Contributor

jcsp commented Aug 4, 2022

#5835

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Aug 16, 2022
Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: redpanda-data#5833

Related: redpanda-data#5609
(cherry picked from commit 6ba1128)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Aug 16, 2022
Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: redpanda-data#5833

Related: redpanda-data#5609
(cherry picked from commit 6ba1128)
felixguendling pushed a commit to felixguendling/redpanda that referenced this pull request Aug 22, 2022
Previously, after writing a config update, API clients could do
a /status query to the same node and not see any nodes (including
the leader that they just PUT to) reflect the new version.

With this change, if the client is talking to the controller leader,
it will reliably see the new config version reflected in the /status
result when querying the same node again after a PUT.

This is a little subtle and later we should make simpler rules
for this via a higher level "wait for status updates" as part
of the PUT call itself: redpanda-data#5833

Related: redpanda-data#5609
@joejulian
Copy link
Contributor

Needs rebase

Centralized configuration and related tests seem to be stable.
Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tests is passing. I will merge as the upgrade tests are failing. The #9400 will address that.

@RafalKorepta RafalKorepta merged commit d9ee027 into redpanda-data:dev Mar 13, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants