CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

rystsov · 2022-08-13T16:39:03Z

https://buildkite.com/redpanda/redpanda/builds/14096#0182953a-15df-4614-8678-071dc1799efd

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigTest
Method: test_invalid_settings_forced

====================================================================================================
test_id:    rptest.tests.cluster_config_test.ClusterConfigTest.test_invalid_settings_forced
status:     FAIL
run time:   18.461 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 404, in test_invalid_settings_forced
    assert n['invalid'] == [invalid_setting[0]]
AssertionError

The text was updated successfully, but these errors were encountered:

rystsov · 2022-08-13T16:53:10Z

_wait_for_version_sync uses rpk to check that all nodes converge to a version. Each time rpk is invoked it picks a random node and checks its view of the cluster. It's possible that one node thinks that all nodes has converged while another node hasn't gotten a memo yet. This is exactly what happened here.

The test waited until docker-rp-19 converged to version 3

[DEBUG - 2022-08-13 04:56:29,151 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-19:9644/v1/cluster_config/status
[DEBUG - 2022-08-13 04:56:29,152 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 2, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}]

then the test asked docker-rp-18 which hasn't converged

[DEBUG - 2022-08-13 04:56:29,154 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-18:9644/v1/cluster_config/status
[DEBUG - 2022-08-13 04:56:29,155 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 2, 'restart': False, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 2, 'invalid': [], 'unknown': []}]

and the assert failed.

jcsp · 2022-08-15T21:59:24Z

Fixed in #5972

The wait_for_version_sync was correct for waiting for the configuration to propagate across the cluster, but it was _not_ correct for waiting for the configuration status to be symmetric on all nodes (i.e. for all nodes to know the status of all other nodes) For test cases that query the status via arbitrary notes, a stricter wait is needed. This issue becomes visible once wait_until is improved to avoid spurious extra sleeps, in redpanda-data#6003 Fixes redpanda-data#6010

The wait_for_version_sync was correct for waiting for the configuration to propagate across the cluster, but it was _not_ correct for waiting for the configuration status to be symmetric on all nodes (i.e. for all nodes to know the status of all other nodes) For test cases that query the status via arbitrary notes, a stricter wait is needed. This issue becomes visible once wait_until is improved to avoid spurious extra sleeps, in redpanda-data#6003 Fixes redpanda-data#6010 (cherry picked from commit b7d4f2c)

rystsov · 2022-08-19T02:03:42Z

Tested on the today's dev and it isn't fixed yet - https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd31-49ee-8db4-40fa5404a865

rystsov · 2022-08-19T03:56:18Z

Another instance - https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd33-421b-a908-d7400ba039df

It isn't exactly the same but the same root cause: the picked a different node and it still has old data

test_id:    rptest.tests.cluster_config_test.ClusterConfigTest.test_invalid_settings_forced
status:     FAIL
run time:   28.218 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 439, in test_invalid_settings_forced
    assert n['invalid'] == []
AssertionError

Fixes redpanda-data#6095 Fixes redpanda-data#6010

The wait_for_version_sync was correct for waiting for the configuration to propagate across the cluster, but it was _not_ correct for waiting for the configuration status to be symmetric on all nodes (i.e. for all nodes to know the status of all other nodes) For test cases that query the status via arbitrary notes, a stricter wait is needed. This issue becomes visible once wait_until is improved to avoid spurious extra sleeps, in redpanda-data#6003 Fixes redpanda-data#6010

Fixes redpanda-data#6095 Fixes redpanda-data#6010

rystsov added kind/bug Something isn't working area/tests ci-failure labels Aug 13, 2022

rystsov mentioned this issue Aug 13, 2022

ducky: pin to the latest version #6003

Closed

5 tasks

jcsp self-assigned this Aug 15, 2022

jcsp mentioned this issue Aug 16, 2022

tests: various bugfixes #6048

Merged

5 tasks

jcsp closed this as completed in b7d4f2c Aug 16, 2022

jcsp closed this as completed in #6048 Aug 16, 2022

rystsov mentioned this issue Aug 19, 2022

CI Failure in ClusterConfigTest.test_restart #6095

Closed

rystsov reopened this Aug 19, 2022

jcsp added a commit to jcsp/redpanda that referenced this issue Aug 19, 2022

tests: make ClusterConfigTest more robust

431a8e5

Fixes redpanda-data#6095 Fixes redpanda-data#6010

jcsp mentioned this issue Aug 19, 2022

tests: make ClusterConfigTest more robust #6107

Merged

5 tasks

rystsov closed this as completed in #6107 Aug 19, 2022

felixguendling pushed a commit to felixguendling/redpanda that referenced this issue Aug 22, 2022

tests: make ClusterConfigTest more robust

5eb0aca

Fixes redpanda-data#6095 Fixes redpanda-data#6010

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

rystsov commented Aug 13, 2022

rystsov commented Aug 13, 2022

jcsp commented Aug 15, 2022

rystsov commented Aug 19, 2022

rystsov commented Aug 19, 2022

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

Comments

rystsov commented Aug 13, 2022

rystsov commented Aug 13, 2022

jcsp commented Aug 15, 2022

rystsov commented Aug 19, 2022

rystsov commented Aug 19, 2022