Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

Closed
rystsov opened this issue Aug 13, 2022 · 4 comments · Fixed by #6048 or #6107
Closed

CI Failure in ClusterConfigTest.test_invalid_settings_forced #6010

rystsov opened this issue Aug 13, 2022 · 4 comments · Fixed by #6048 or #6107
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented Aug 13, 2022

https://buildkite.com/redpanda/redpanda/builds/14096#0182953a-15df-4614-8678-071dc1799efd

Module: rptest.tests.cluster_config_test
Class:  ClusterConfigTest
Method: test_invalid_settings_forced
====================================================================================================
test_id:    rptest.tests.cluster_config_test.ClusterConfigTest.test_invalid_settings_forced
status:     FAIL
run time:   18.461 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 404, in test_invalid_settings_forced
    assert n['invalid'] == [invalid_setting[0]]
AssertionError
@rystsov rystsov added kind/bug Something isn't working area/tests ci-failure labels Aug 13, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Aug 13, 2022

_wait_for_version_sync uses rpk to check that all nodes converge to a version. Each time rpk is invoked it picks a random node and checks its view of the cluster. It's possible that one node thinks that all nodes has converged while another node hasn't gotten a memo yet. This is exactly what happened here.

The test waited until docker-rp-19 converged to version 3

[DEBUG - 2022-08-13 04:56:29,151 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-19:9644/v1/cluster_config/status
[DEBUG - 2022-08-13 04:56:29,152 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 2, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}]

then the test asked docker-rp-18 which hasn't converged

[DEBUG - 2022-08-13 04:56:29,154 - admin - _request - lineno:303]: Dispatching GET http://docker-rp-18:9644/v1/cluster_config/status
[DEBUG - 2022-08-13 04:56:29,155 - admin - _request - lineno:326]: Response OK, JSON: [{'node_id': 1, 'restart': False, 'config_version': 3, 'invalid': ['log_message_timestamp_type'], 'unknown': []}, {'node_id': 2, 'restart': False, 'config_version': 2, 'invalid': [], 'unknown': []}, {'node_id': 3, 'restart': False, 'config_version': 2, 'invalid': [], 'unknown': []}]

and the assert failed.

@jcsp jcsp self-assigned this Aug 15, 2022
@jcsp
Copy link
Contributor

jcsp commented Aug 15, 2022

Fixed in #5972

jcsp added a commit to jcsp/redpanda that referenced this issue Aug 15, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 15, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 16, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 16, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
@jcsp jcsp mentioned this issue Aug 16, 2022
5 tasks
@jcsp jcsp closed this as completed in b7d4f2c Aug 16, 2022
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Aug 17, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010

(cherry picked from commit b7d4f2c)
@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

Tested on the today's dev and it isn't fixed yet - https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd31-49ee-8db4-40fa5404a865

@rystsov rystsov reopened this Aug 19, 2022
@rystsov
Copy link
Contributor Author

rystsov commented Aug 19, 2022

Another instance - https://buildkite.com/redpanda/redpanda/builds/14311#0182b1e9-cd33-421b-a908-d7400ba039df

It isn't exactly the same but the same root cause: the picked a different node and it still has old data

test_id:    rptest.tests.cluster_config_test.ClusterConfigTest.test_invalid_settings_forced
status:     FAIL
run time:   28.218 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/cluster_config_test.py", line 439, in test_invalid_settings_forced
    assert n['invalid'] == []
AssertionError

jcsp added a commit to jcsp/redpanda that referenced this issue Aug 19, 2022
felixguendling pushed a commit to felixguendling/redpanda that referenced this issue Aug 22, 2022
The wait_for_version_sync was correct for
waiting for the configuration to propagate
across the cluster, but it was _not_ correct
for waiting for the configuration status to
be symmetric on all nodes (i.e. for all nodes
to know the status of all other nodes)

For test cases that query the status via
arbitrary notes, a stricter wait is needed.

This issue becomes visible once wait_until is
improved to avoid spurious extra sleeps, in
redpanda-data#6003

Fixes redpanda-data#6010
felixguendling pushed a commit to felixguendling/redpanda that referenced this issue Aug 22, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants