operator: move centralized-configuration Kuttl tests to main bucket. #5609

nicolaferraro · 2022-07-25T09:12:33Z

Centralized configuration and related tests seem to be stable.

jcsp · 2022-07-27T14:19:41Z

What was the outcome of investigating the most recent failures of these? (https://redpandadata.slack.com/archives/C01H6JRQX1S/p1658740986316359?thread_ts=1658736130.531769&cid=C01H6JRQX1S)

If something is still up with them, let's fix it before we reinstate them.

jcsp · 2022-08-03T14:06:37Z

k8s-unstable-tests had two failures in last 24h:
FAIL test: k8s-unstable-tests.k8s-unstable-tests (2/24 runs)
failure at 2022-08-02T09:37:39.714Z: None
in job https://buildkite.com/redpanda/redpanda/builds/13453#01825dce-2bd3-4300-92a8-483046638bce
failure at 2022-08-02T10:20:34.786Z: None
in job https://buildkite.com/redpanda/redpanda/builds/13454#01825e11-5854-46fe-8699-4c617c6ce836

@nicolaferraro it would be good to get a signal on whether we have a bug here before we release 22.2

nicolaferraro · 2022-08-04T08:02:28Z

@jcsp the logs highlight a strange behavior of the configuration system in redpanda:

2022-08-02T09:32:34.479154062Z stderr F 2022-08-02T09:32:34.479Z	INFO	controllers.redpanda.Cluster	Applying patch to the cluster configuration	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "patch": "+append_chunk_size"}
2022-08-02T09:32:34.480731898Z stderr F 2022-08-02T09:32:34.480Z	INFO	controllers.redpanda.Cluster	Patch written to the cluster	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration", "config_version": 4}
2022-08-02T09:32:34.481066838Z stderr F 2022-08-02T09:32:34.481Z	INFO	controllers.redpanda.Cluster	Centralized configuration hash has changed	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488032683Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 0 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488039338Z stderr F 2022-08-02T09:32:34.487Z	INFO	controllers.redpanda.Cluster	Node 1 restart status is false	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.488041858Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 0 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}
2022-08-02T09:32:34.48804441Z stderr F 2022-08-02T09:32:34.488Z	INFO	controllers.redpanda.Cluster	Node 1 is using config version 3	{"redpandacluster": "kuttl-test-pretty-robin/centralized-configuration"}

It's a 2 replicas cluster and the flow can be read as:

Operator sets a different value for append_chuck_size, the cluster returns configuration version 4
Operator then asks both nodes which configuration are they using and if they need restart, and they both say 3 and no need for restart

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

nicolaferraro · 2022-08-04T08:06:24Z

The information about which config is every node using comes from /v1/cluster_config/status, sent explicitly to the leader.

jcsp · 2022-08-04T09:02:24Z

So, there might be some error in the way the query is performed, or redpanda changed the way to handle these cases of configuration changes. Wdyt?

Update of status is asynchronous: there is no guarantee that the version in the status will reflect the version in the response from the PUT. It's because status updates are themselves persistent writes to the controller log, separate to the write that updates the configuration. We could make the API a bit friendlier by waiting for status updates inside the PUT handler, but that would not be 100% reliable either because it's possible for controller to lose leadership between writing the config update and writing the status update.

For testing on existing code, the solution is to have a retry-wait for the status, rather than expecting it to be updated synchronously.

Making this a bit friendlier in the API is #5833

nicolaferraro · 2022-08-04T09:39:35Z

I remember we did some changes in the v22.1 branch to direct calls to the leader since it was supposed to apply the configuration before returning. The current operator code requires that there's some consistency between the two calls to save-config and get-config..

jcsp · 2022-08-04T10:01:01Z

Getting the config on the leader after setting it on the leader is synchronous, it's just the status specifically that's asynchonous.

Previously, after writing a config update, API clients could do a /status query to the same node and not see any nodes (including the leader that they just PUT to) reflect the new version. With this change, if the client is talking to the controller leader, it will reliably see the new config version reflected in the /status result when querying the same node again after a PUT. This is a little subtle and later we should make simpler rules for this via a higher level "wait for status updates" as part of the PUT call itself: redpanda-data#5833 Related: redpanda-data#5609

jcsp · 2022-08-04T10:40:32Z

#5835

Previously, after writing a config update, API clients could do a /status query to the same node and not see any nodes (including the leader that they just PUT to) reflect the new version. With this change, if the client is talking to the controller leader, it will reliably see the new config version reflected in the /status result when querying the same node again after a PUT. This is a little subtle and later we should make simpler rules for this via a higher level "wait for status updates" as part of the PUT call itself: redpanda-data#5833 Related: redpanda-data#5609 (cherry picked from commit 6ba1128)

Previously, after writing a config update, API clients could do a /status query to the same node and not see any nodes (including the leader that they just PUT to) reflect the new version. With this change, if the client is talking to the controller leader, it will reliably see the new config version reflected in the /status result when querying the same node again after a PUT. This is a little subtle and later we should make simpler rules for this via a higher level "wait for status updates" as part of the PUT call itself: redpanda-data#5833 Related: redpanda-data#5609

joejulian · 2023-03-01T17:01:55Z

Needs rebase

Centralized configuration and related tests seem to be stable.

RafalKorepta

LGTM, tests is passing. I will merge as the upgrade tests are failing. The #9400 will address that.

nicolaferraro requested a review from a team as a code owner July 25, 2022 09:12

github-actions bot added the area/k8s label Jul 25, 2022

mmedenjak added kind/enhance New feature or request area/tests labels Jul 26, 2022

RafalKorepta previously approved these changes Jul 27, 2022

View reviewed changes

jcsp mentioned this pull request Aug 4, 2022

config: make PUT to /cluster_config friendlier by waiting for status updates #5833

Open

jcsp mentioned this pull request Aug 4, 2022

admin: read-after-write consistency for config status on leader node #5835

Merged

5 tasks

joejulian added the needs rebase label Mar 1, 2023

RafalKorepta dismissed their stale review via ffd77e5 March 13, 2023 11:31

RafalKorepta force-pushed the move-test branch from c07797b to ffd77e5 Compare March 13, 2023 11:31

RafalKorepta added the k8s/tests label Mar 13, 2023

operator: move centralized-configuration Kuttl tests to main bucket.

fce80d7

Centralized configuration and related tests seem to be stable.

RafalKorepta force-pushed the move-test branch from ffd77e5 to fce80d7 Compare March 13, 2023 12:20

RafalKorepta approved these changes Mar 13, 2023

View reviewed changes

RafalKorepta merged commit d9ee027 into redpanda-data:dev Mar 13, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: move centralized-configuration Kuttl tests to main bucket. #5609

operator: move centralized-configuration Kuttl tests to main bucket. #5609

nicolaferraro commented Jul 25, 2022

jcsp commented Jul 27, 2022

jcsp commented Aug 3, 2022

nicolaferraro commented Aug 4, 2022

nicolaferraro commented Aug 4, 2022

jcsp commented Aug 4, 2022 •

edited

Loading

nicolaferraro commented Aug 4, 2022

jcsp commented Aug 4, 2022

jcsp commented Aug 4, 2022

joejulian commented Mar 1, 2023

RafalKorepta left a comment

operator: move centralized-configuration Kuttl tests to main bucket. #5609

operator: move centralized-configuration Kuttl tests to main bucket. #5609

Conversation

nicolaferraro commented Jul 25, 2022

jcsp commented Jul 27, 2022

jcsp commented Aug 3, 2022

nicolaferraro commented Aug 4, 2022

nicolaferraro commented Aug 4, 2022

jcsp commented Aug 4, 2022 • edited Loading

nicolaferraro commented Aug 4, 2022

jcsp commented Aug 4, 2022

jcsp commented Aug 4, 2022

joejulian commented Mar 1, 2023

RafalKorepta left a comment

Choose a reason for hiding this comment

jcsp commented Aug 4, 2022 •

edited

Loading