Ducktape hangs on `NodeResizeTest.test_node_resize` #4634

rystsov · 2022-05-10T00:06:59Z

https://buildkite.com/redpanda/redpanda/builds/9879#8e5893f2-3651-4574-b6fb-023068c913a4

If we look at the buildkite log we'll see that it starts NodeResizeTest.test_node_resize but the test never finishes

[INFO:2022-05-07 06:56:26,930]: RunnerClient: rptest.tests.node_resize_test.NodeResizeTest.test_node_resize: Setting up...
[INFO:2022-05-07 06:56:41,704]: RunnerClient: rptest.tests.node_resize_test.NodeResizeTest.test_node_resize: Running...

The text was updated successfully, but these errors were encountered:

rystsov · 2022-05-10T00:53:29Z

Another instance https://buildkite.com/redpanda/redpanda/builds/9879#bbb0b5db-ff3e-4310-937a-2e37f27818b3

rystsov · 2022-05-10T04:40:44Z

Another instance https://buildkite.com/redpanda/redpanda/builds/9911#b595583f-b54f-417e-aa88-41581e695a28

rystsov · 2022-05-10T05:40:37Z

Another instance https://buildkite.com/redpanda/redpanda/builds/9910#07475eb3-dc12-4ff1-8414-fe33932ac044

ajfabbri · 2022-05-19T03:55:33Z

Taking a look, thanks.

ajfabbri · 2022-05-25T15:29:19Z

Tried a reproduction loop last night.. no luck so far.

--------------------------------------------------------------------------------
test_id:    rptest.tests.node_resize_test.NodeResizeTest.test_node_resize
status:     PASS
run time:   1 minute 2.366 seconds
--------------------------------------------------------------------------------
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.8
session_id:       2022-05-25--002
run time:         418 minutes 17.809 seconds
tests run:        360
passed:           360
failed:           0
ignored:          0
opassed:          0
ofailed:          0
================================================================================

rystsov · 2022-05-26T16:07:22Z

@ajfabbri I find that some tests are resource-sensitive so when I'm on the hunt for the flaky test I run only that particular test in a loop while in a parallel window I alternate between two far enough redpanda commits and constantly build them :) but of cause there are other ways to stress the hardware

ajfabbri · 2022-06-02T03:30:38Z

Spent some time this evening digging into zeromq options and looking at upstream history. Will try another reproduction loop shortly.

ajfabbri · 2022-06-08T22:45:02Z

Re-running again on clustered ducktape with some load added to the brokers in the background:

for i in $CLUSTER_IPS
do
  ssh ubuntu@$i -f 'nohup sh -c "( ( stress -t 600 -c 1 -m 1 -d 1 --hdd-bytes 102400 &>/dev/null) &)"'
done

Also looking at if we can change the runner to grab logs / artifacts in this case. Debugging without them is pretty hard.

dotnwat · 2022-06-09T01:55:18Z

The error is occurring because the RunnerClient isn't receiving a message for the configured 30 minute timeout. It appears this could be because either (1) the test actually takes more than 30 minutes, (2) an error some where else is resulting in a message not being sent that is expected to be received.

[INFO:2022-05-09 19:34:29,558]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: PASS
--
  | [INFO:2022-05-09 19:34:29,558]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Tearing down...
  | [INFO:2022-05-09 19:34:35,047]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Summary:
  | [INFO:2022-05-09 19:34:35,047]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Data: None
  | [ERROR:2022-05-09 20:04:35,342]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner.py", line 66, in recv
  | message = self.socket.recv()
  | File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv
  | File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv
  | File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy
  | File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
  | File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc
  | zmq.error.Again: Resource temporarily unavailable

We can effectively rule out a 30 minute test because it takes around 3 minutes from when the test verify_filter_test starts to when it is reported as having passed.

That leaves (2). Observe that all instances of the buildkite logs above all seem to hang around the verify_filter_test wasm test. So perhaps there is something in that test which is causing an issue within ducktape itself.

dotnwat · 2022-06-09T02:09:23Z

one place to look might be if there are any wasm related services that could get stuck on shutdown. but before i looked at that i'd try to verify what is happening between ducktape reporting PASS and the timeout occuring to see if it is possible for some other shutdown events to get interleaved there.

I looked briefly for any sort of port usage that might conflict with the zmq receiver/sender reange of 555x-566x and didn't see anything.

ajfabbri · 2022-06-09T03:04:18Z

Agreed on the behavior. The wasm thing is interesting. I'd noticed that when I first looked at this, but dismissed it in favor of adding some visibility on where the test client was stuck. The wasm tests have been disabled recently, though, which would explain why I cannot seem to reproduce it.

ajfabbri · 2022-06-09T04:33:28Z

Reproduced similar behavior on accident by:

Writing a new fake test test_hung_test (which also subclasses EndToEndTest--just mimicing test_node_resize).
Killing the ducktape driver via CTRL-C.
Re-running the test. It quickly fails with the same stacktrace:

[INFO:2022-06-09 04:27:51,709]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...
[ERROR:2022-06-09 04:27:52,010]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive
Traceback (most recent call last):                         
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 66, in recv                    
    message = self.socket.recv()                                                                                       
  File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv                             
  File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv                             
  File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy                              
  File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy                              
  File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc                              
zmq.error.Again: Resource temporarily unavailable                                                                      
                                                           
During handling of the above exception, another exception occurred:                                                    
                                                           
Traceback (most recent call last):                                                                                     
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 203, in run_all_tests          
    event = self.receiver.recv(timeout=self.session_context.test_runner_timeout)                                       
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 68, in recv                    
    raise TimeoutError("runner client unresponsive")                                                                   
ducktape.errors.TimeoutError: runner client unresponsive

A little bit later the runner client processes timeout and exit (but note that I'm running with modifications to ducktape which reduce some zmq timeouts):

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 155, in run
    self.log(logging.INFO, "FAIL: " + err_trace)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 280, in log
    self.send(self.message.log(msg, level=log_level))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 332, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 35, in run_client
    client.run()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 158, in run
    self.teardown_test(teardown_services=not self.session_context.no_teardown, test_status=test_status)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 244, in teardown_test
    self.log(logging.INFO, "Tearing down...")
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 280, in log
    self.send(self.message.log(msg, level=log_level))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 332, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

dotnwat · 2022-06-09T04:42:14Z

nice clue. my guess at this point is that there is some unrecoverable error in the send-side of the messages that the TestRunner receives. this error is the same, but in the previous instances in buildkite we can see the 30 minute timeout being hit from the log timestamps. but def seems like you're on the right track.

ajfabbri · 2022-06-10T07:25:20Z

Played with this a bit more tonight. Just a fake "hung" test and shortening --test-runner-timeout is enough to reproduce this stack trace. I've hacked together some stuff which (1) gets a stuck child process to dump a stacktrace and (2) forces it to exit, (which I'm hoping means we will collect logs.). It needs some cleanup and more testing--I'll post a patch when I get a chance to do that.

[INFO:2022-06-10 06:53:47,376]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...                                                                                                            
[ERROR:2022-06-10 06:54:02,392]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive after 15.02 seconds.
Timeout: hung test. Sending debug signal..                                                                                                                                                                                                    
runner client unresponsive after 15.02 seconds.                                                                        
Attempting to get logs...                                                                                                                                                                                                                     
Signal received, dumping stack trace:                                                                                  
  File "/home/ubuntu/.local/bin/ducktape", line 8, in <module>                                                                                                                                                                                
    sys.exit(main())                                       
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/command_line/main.py", line 183, in main                                                                                                                                     
    test_results = runner.run_all_tests()                                                                                                                                                                                                                                                                                                                                 
<snip>                    
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 150, in test_hung_test                                                                                                                                     
    self._some_func()                                      
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 143, in _some_func                                                                                                                                         
    dummy_data.append(self._some_helper_func())                                                                                                                                                                                               
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 136, in _some_helper_func                                                                                                                                  
    time.sleep(random.random())                                                                                                                                                                                                               
                                                                                                                       
Sending SIGINT to child pid 9514

ajfabbri · 2022-06-10T21:49:18Z

Update: this has not reproduced for about a month. I also could not reproduce it locally, or on CDT, even with various levels of background load running. Pandaresults:

FAIL test: NodeResizeTest.test_node_resize (3/793 runs)
  failure at 2022-05-12T20:38:14.888Z: Started but didn't finish!
      in job https://buildkite.com/redpanda/redpanda/builds/10078#d9101d53-f3d0-41e5-b468-93d0adaa36dc

I've added some changes to ducktape which should make it easier to debug these in the future. When those are merged, I plan to close this, with the understanding that we will reopen it as needed if it shows up again.

LenaAn · 2022-06-20T21:44:05Z

From PandaResults:

Unstable test: NodeResizeTest.test_node_resize (1/66 runs failed)
  Latest failure at 2022-06-18T07:50:57.756Z: <BadLogLines nodes=docker-rp-9(1) example="ERROR 2022-06-18 06:52:16,604 [shard 0] rpc - Service handler threw an exception: seastar::condition_variable_timed_out (Condition variable timed out)">
             in job https://buildkite.com/redpanda/redpanda/builds/11465#0181756f-c616-457e-be53-8f0dcea2d29c

ajfabbri · 2022-07-13T21:25:00Z

Looks like the latest report (previous comment here) is unrelated, as the test did not hang, but emitted a common error log line. I've filed #5461 to cover that failure.

ajfabbri · 2022-07-13T21:37:29Z

This is ready to be closed, just waiting for improved diagnostics on hung tests changes for ducktape to get in: redpanda-data/ducktape#10

Working theory is that one of the recently disabled WASM tests are not shutting down properly, causing the worker processes to hang. This bug has been impossible to reproduce since those tests were disabled, and the log activity seems to imply a relationship. This is likely also the cause of #4382.

ajfabbri · 2022-07-19T19:28:24Z

Closing this since this bug has not reproduced for months. If this reappears, note the hints about wasm/coproc tests being a possible culprit. Once the linked ducktape PR is merged, we should have an easier time seeing which test is broken in the logs.

rystsov added kind/bug Something isn't working area/tests ci-failure labels May 10, 2022

rystsov mentioned this issue May 10, 2022

Reduce raft_availability ducktape execution time #4598

Merged

piyushredpanda assigned ajfabbri May 14, 2022

abhijat mentioned this issue May 24, 2022

tests: adds support for node decommission, leadership transfer to context managers #4610

Closed

ajfabbri mentioned this issue Jun 10, 2022

Add debugging for hung tests redpanda-data/ducktape#10

Closed

ajfabbri closed this as completed Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ducktape hangs on `NodeResizeTest.test_node_resize` #4634

Ducktape hangs on `NodeResizeTest.test_node_resize` #4634

rystsov commented May 10, 2022 •

edited

Loading

rystsov commented May 10, 2022

rystsov commented May 10, 2022

rystsov commented May 10, 2022

ajfabbri commented May 19, 2022

ajfabbri commented May 25, 2022

rystsov commented May 26, 2022

ajfabbri commented Jun 2, 2022

ajfabbri commented Jun 8, 2022

dotnwat commented Jun 9, 2022

dotnwat commented Jun 9, 2022

ajfabbri commented Jun 9, 2022

ajfabbri commented Jun 9, 2022 •

edited

Loading

dotnwat commented Jun 9, 2022

ajfabbri commented Jun 10, 2022

ajfabbri commented Jun 10, 2022

LenaAn commented Jun 20, 2022

ajfabbri commented Jul 13, 2022

ajfabbri commented Jul 13, 2022 •

edited

Loading

ajfabbri commented Jul 19, 2022

Ducktape hangs on NodeResizeTest.test_node_resize #4634

Ducktape hangs on NodeResizeTest.test_node_resize #4634

Comments

rystsov commented May 10, 2022 • edited Loading

rystsov commented May 10, 2022

rystsov commented May 10, 2022

rystsov commented May 10, 2022

ajfabbri commented May 19, 2022

ajfabbri commented May 25, 2022

rystsov commented May 26, 2022

ajfabbri commented Jun 2, 2022

ajfabbri commented Jun 8, 2022

dotnwat commented Jun 9, 2022

dotnwat commented Jun 9, 2022

ajfabbri commented Jun 9, 2022

ajfabbri commented Jun 9, 2022 • edited Loading

dotnwat commented Jun 9, 2022

ajfabbri commented Jun 10, 2022

ajfabbri commented Jun 10, 2022

LenaAn commented Jun 20, 2022

ajfabbri commented Jul 13, 2022

ajfabbri commented Jul 13, 2022 • edited Loading

ajfabbri commented Jul 19, 2022

Ducktape hangs on `NodeResizeTest.test_node_resize` #4634

Ducktape hangs on `NodeResizeTest.test_node_resize` #4634

rystsov commented May 10, 2022 •

edited

Loading

ajfabbri commented Jun 9, 2022 •

edited

Loading

ajfabbri commented Jul 13, 2022 •

edited

Loading