Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (connection failure in wait_for_removed) in RandomNodeOperationsTest.test_node_operations.enable_failures=True #8576

Closed
travisdowns opened this issue Feb 2, 2023 · 6 comments · Fixed by #8568
Assignees
Labels

Comments

@travisdowns
Copy link
Member

https://buildkite.com/redpanda/redpanda/builds/22336#018610d2-05ec-480e-83d5-d716ceac6f3d

Module: rptest.tests.random_node_operations_test
Class:  RandomNodeOperationsTest
Method: test_node_operations
Arguments:
{
  "enable_failures": true
}

Python backtraces:

----------------------------------------------------------------------------------------------------
test_id:    rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True
status:     FAIL
run time:   2 minutes 53.147 seconds


    ConnectionError(MaxRetryError("HTTPConnectionPool(host='docker-rp-28', port=9644): Max retries exceeded with url: /v1/brokers/4/decommission (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff42b7cca0>: Failed to establish a new connection: [Errno 111] Connection refused'))"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0xffff42b7cca0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='docker-rp-28', port=9644): Max retries exceeded with url: /v1/brokers/4/decommission (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff42b7cca0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/random_node_operations_test.py", line 103, in test_node_operations
    executor.execute_operation(op)
  File "/root/tests/rptest/utils/node_operations.py", line 381, in execute_operation
    self.wait_for_removed(node_id)
  File "/root/tests/rptest/utils/node_operations.py", line 240, in wait_for_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 124, in wait_for_removal
    decommission_status = self.admin.get_decommission_status(
  File "/root/tests/rptest/services/admin.py", line 476, in get_decommission_status
    return self._request('get', path, node=node).json()
  File "/root/tests/rptest/services/admin.py", line 307, in _request
    r = self._session.request(verb, url, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-28', port=9644): Max retries exceeded with url: /v1/brokers/4/decommission (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff42b7cca0>: Failed to establish a new connection: [Errno 111] Connection refused'))
@travisdowns travisdowns added kind/bug Something isn't working ci-failure labels Feb 2, 2023
@travisdowns
Copy link
Member Author

This looks similar to: #8362 thought the exact symptom is a bit different: connection refused while connecting to a node while waiting for a node to be decommissioned.

This run includes @mmaslankaprv's fixes in #8547, which included also changes to the ducktape side, so maybe this is the same underlying but with the ducktape-side stacks looking different now.

@graphcareful
Copy link
Contributor

graphcareful commented Feb 2, 2023

Another failure observed here: https://ci-artifacts.dev.vectorized.cloud/redpanda/018612b9-546d-4b6b-b59a-110785a1d4f5/vbuild/ducktape/results/2023-02-02--001/report.html

I took a quick look, I saw these two interesting logs:

[INFO  - 2023-02-02 15:40:07,031 - admin - _request - lineno:313]: Connection error, retrying on node docker-rp-12 (remaining ['docker-rp-22', 'docker-rp-23', 'docker-rp-11'])
[DEBUG - 2023-02-02 15:40:07,031 - admin - _request - lineno:305]: Dispatching POST http://docker-rp-12:9644/v1/security/users
[INFO  - 2023-02-02 15:40:07,032 - admin - _request - lineno:313]: Connection error, retrying on node docker-rp-22 (remaining ['docker-rp-23', 'docker-rp-11'])
[DEBUG - 2023-02-02 15:40:07,032 - admin - _request - lineno:305]: Dispatching POST http://docker-rp-22:9644/v1/security/users
[INFO  - 2023-02-02 15:40:07,033 - admin - _request - lineno:313]: Connection error, retrying on node docker-rp-23 (remaining ['docker-rp-11'])
[DEBUG - 2023-02-02 15:40:07,034 - admin - _request - lineno:305]: Dispatching POST http://docker-rp-23:9644/v1/security/users
[INFO  - 2023-02-02 15:40:07,035 - admin - _request - lineno:313]: Connection error, retrying on node docker-rp-11 (remaining [])
[DEBUG - 2023-02-02 15:40:07,035 - admin - _request - lineno:305]: Dispatching POST http://docker-rp-11:9644/v1/security/users

Indicating that maybe all nodes were decomissioned? Later on I see this exception:

Traceback (most recent call last):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 519, in execute_with_retries
    if op.validate(self.operation_ctx):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 271, in validate
    users = ctx.admin().list_users()
  File "/root/tests/rptest/services/admin.py", line 649, in list_users
    return self._request("get", "security/users", node=node).json()
  File "/root/tests/rptest/services/admin.py", line 288, in _request
    node = random.choice(self.redpanda.started_nodes())
  File "/usr/lib/python3.10/random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]

A crash when attempting to obtain the list of users, random.choice throws and the only way that can occur is on empty list, the list in question here, being self.redpanda.started_nodes()

@ztlpn
Copy link
Contributor

ztlpn commented Feb 2, 2023

@rystsov
Copy link
Contributor

rystsov commented Feb 3, 2023

@mmaslankaprv
Copy link
Member

In the last failure the underlying issue will be solved by: #8568

@mmaslankaprv
Copy link
Member

The traceback here:

Traceback (most recent call last):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 519, in execute_with_retries
    if op.validate(self.operation_ctx):
  File "/root/tests/rptest/services/admin_ops_fuzzer.py", line 271, in validate
    users = ctx.admin().list_users()
  File "/root/tests/rptest/services/admin.py", line 649, in list_users
    return self._request("get", "security/users", node=node).json()
  File "/root/tests/rptest/services/admin.py", line 288, in _request
    node = random.choice(self.redpanda.started_nodes())
  File "/usr/lib/python3.10/random.py", line 378, in choice
    return seq[self._randbelow(len(seq))] 

is caused by the fact that when test fails fuzzer is running after all redpanda nodes are stopped

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants