Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI-failure] Shutdown hang while stopping consensus/partition #10402

Closed
travisdowns opened this issue Apr 27, 2023 · 1 comment
Closed

[CI-failure] Shutdown hang while stopping consensus/partition #10402

travisdowns opened this issue Apr 27, 2023 · 1 comment
Labels

Comments

@travisdowns
Copy link
Member

Version & Environment

Redpanda version: 9a93a9c222

What went wrong?

Ducktape fails because one repdanda instance fails to stop.

This is a CI failure but I don't think it is test specific so I don't include the usual details about the test.

Run: https://buildkite.com/redpanda/redpanda/builds/28110#0187bff2-a596-4eef-90d3-48f9d357a538
Report: https://ci-artifacts.dev.vectorized.cloud/redpanda/28110/0187bff2-a596-4eef-90d3-48f9d357a538/vbuild/ducktape/results/2023-04-27--001/report.html

Error:

[INFO  - 2023-04-27 00:40:23,896 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.adjacent_segment_merging_test.AdjacentSegmentMergingTest.test_reupload_of_local_segments.acks=0.cloud_storage_type=CloudStorageType.ABS: Summary: TimeoutError('Redpanda node docker-rp-6 failed to stop in 30 seconds')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 95, in wrapped
    self.redpanda.stop_and_scrub_object_storage()
  File "/root/tests/rptest/services/redpanda.py", line 3037, in stop_and_scrub_object_storage
    self.stop()
  File "/root/tests/rptest/services/redpanda.py", line 2231, in stop
    self._for_nodes(self.nodes,
  File "/root/tests/rptest/services/redpanda.py", line 1289, in _for_nodes
    return list(executor.map(cb, nodes))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/services/redpanda.py", line 2232, in <lambda>
    lambda n: self.stop_node(n, **kwargs),
  File "/root/tests/rptest/services/redpanda.py", line 2265, in stop_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node docker-rp-6 failed to stop in 30 seconds

What should have happened instead?

Instance should stop.

Additional information

The hang appears to occur around stopping the test partition: after the line:

raft - [group_id:1, {kafka/panda-topic/0}] consensus.cc:255 - Stopping

No additional stop-related log lines come from shard 1. On another node in the same test, the following lines appear:

INFO  2023-04-27 00:39:20,894 [shard 1] raft - [group_id:1, {kafka/panda-topic/0}] consensus.cc:255 - Stopping
INFO  2023-04-27 00:39:20,895 [shard 1] rpc - rpc_server.cc:97 - Disconnected 172.16.16.23:52275 (stream closed)}
DEBUG 2023-04-27 00:39:20,897 [shard 1] cluster - partition.cc:448 - Stopping partition: {kafka/panda-topic/0}

So the usual state is proceed quickly to stopping partition. I don't know if the intervening message about the RPC disconnect is coincidental or not.

@andrwng
Copy link
Contributor

andrwng commented Apr 27, 2023

This looks like a dupe of #10085 which had a fix go in earlier today.

@andrwng andrwng closed this as completed Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants