Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ducktape hangs on NodeResizeTest.test_node_resize #4634

Closed
rystsov opened this issue May 10, 2022 · 19 comments
Closed

Ducktape hangs on NodeResizeTest.test_node_resize #4634

rystsov opened this issue May 10, 2022 · 19 comments
Assignees
Labels

Comments

@rystsov
Copy link
Contributor

rystsov commented May 10, 2022

https://buildkite.com/redpanda/redpanda/builds/9879#8e5893f2-3651-4574-b6fb-023068c913a4

If we look at the buildkite log we'll see that it starts NodeResizeTest.test_node_resize but the test never finishes

[INFO:2022-05-07 06:56:26,930]: RunnerClient: rptest.tests.node_resize_test.NodeResizeTest.test_node_resize: Setting up...
[INFO:2022-05-07 06:56:41,704]: RunnerClient: rptest.tests.node_resize_test.NodeResizeTest.test_node_resize: Running...
@rystsov rystsov added kind/bug Something isn't working area/tests ci-failure labels May 10, 2022
@rystsov
Copy link
Contributor Author

rystsov commented May 10, 2022

@rystsov
Copy link
Contributor Author

rystsov commented May 10, 2022

@rystsov
Copy link
Contributor Author

rystsov commented May 10, 2022

@ajfabbri
Copy link
Contributor

Taking a look, thanks.

@ajfabbri
Copy link
Contributor

Tried a reproduction loop last night.. no luck so far.

--------------------------------------------------------------------------------
test_id:    rptest.tests.node_resize_test.NodeResizeTest.test_node_resize
status:     PASS
run time:   1 minute 2.366 seconds
--------------------------------------------------------------------------------
================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.8
session_id:       2022-05-25--002
run time:         418 minutes 17.809 seconds
tests run:        360
passed:           360
failed:           0
ignored:          0
opassed:          0
ofailed:          0
================================================================================

@rystsov
Copy link
Contributor Author

rystsov commented May 26, 2022

@ajfabbri I find that some tests are resource-sensitive so when I'm on the hunt for the flaky test I run only that particular test in a loop while in a parallel window I alternate between two far enough redpanda commits and constantly build them :) but of cause there are other ways to stress the hardware

@ajfabbri
Copy link
Contributor

ajfabbri commented Jun 2, 2022

Spent some time this evening digging into zeromq options and looking at upstream history. Will try another reproduction loop shortly.

@ajfabbri
Copy link
Contributor

ajfabbri commented Jun 8, 2022

Re-running again on clustered ducktape with some load added to the brokers in the background:

for i in $CLUSTER_IPS
do
  ssh ubuntu@$i -f 'nohup sh -c "( ( stress -t 600 -c 1 -m 1 -d 1 --hdd-bytes 102400 &>/dev/null) &)"'
done

Also looking at if we can change the runner to grab logs / artifacts in this case. Debugging without them is pretty hard.

@dotnwat
Copy link
Member

dotnwat commented Jun 9, 2022

The error is occurring because the RunnerClient isn't receiving a message for the configured 30 minute timeout. It appears this could be because either (1) the test actually takes more than 30 minutes, (2) an error some where else is resulting in a message not being sent that is expected to be received.

[INFO:2022-05-09 19:34:29,558]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: PASS
--
  | [INFO:2022-05-09 19:34:29,558]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Tearing down...
  | [INFO:2022-05-09 19:34:35,047]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Summary:
  | [INFO:2022-05-09 19:34:35,047]: RunnerClient: rptest.tests.wasm_filter_test.WasmFilterTest.verify_filter_test: Data: None
  | [ERROR:2022-05-09 20:04:35,342]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner.py", line 66, in recv
  | message = self.socket.recv()
  | File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv
  | File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv
  | File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy
  | File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
  | File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc
  | zmq.error.Again: Resource temporarily unavailable

We can effectively rule out a 30 minute test because it takes around 3 minutes from when the test verify_filter_test starts to when it is reported as having passed.

That leaves (2). Observe that all instances of the buildkite logs above all seem to hang around the verify_filter_test wasm test. So perhaps there is something in that test which is causing an issue within ducktape itself.

@dotnwat
Copy link
Member

dotnwat commented Jun 9, 2022

one place to look might be if there are any wasm related services that could get stuck on shutdown. but before i looked at that i'd try to verify what is happening between ducktape reporting PASS and the timeout occuring to see if it is possible for some other shutdown events to get interleaved there.

I looked briefly for any sort of port usage that might conflict with the zmq receiver/sender reange of 555x-566x and didn't see anything.

@ajfabbri
Copy link
Contributor

ajfabbri commented Jun 9, 2022

Agreed on the behavior. The wasm thing is interesting. I'd noticed that when I first looked at this, but dismissed it in favor of adding some visibility on where the test client was stuck. The wasm tests have been disabled recently, though, which would explain why I cannot seem to reproduce it.

@ajfabbri
Copy link
Contributor

ajfabbri commented Jun 9, 2022

Reproduced similar behavior on accident by:

  • Writing a new fake test test_hung_test (which also subclasses EndToEndTest--just mimicing test_node_resize).
  • Killing the ducktape driver via CTRL-C.
  • Re-running the test. It quickly fails with the same stacktrace:
[INFO:2022-06-09 04:27:51,709]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...
[ERROR:2022-06-09 04:27:52,010]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive
Traceback (most recent call last):                         
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 66, in recv                    
    message = self.socket.recv()                                                                                       
  File "zmq/backend/cython/socket.pyx", line 791, in zmq.backend.cython.socket.Socket.recv                             
  File "zmq/backend/cython/socket.pyx", line 827, in zmq.backend.cython.socket.Socket.recv                             
  File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy                              
  File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy                              
  File "zmq/backend/cython/checkrc.pxd", line 20, in zmq.backend.cython.checkrc._check_rc                              
zmq.error.Again: Resource temporarily unavailable                                                                      
                                                           
During handling of the above exception, another exception occurred:                                                    
                                                           
Traceback (most recent call last):                                                                                     
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 203, in run_all_tests          
    event = self.receiver.recv(timeout=self.session_context.test_runner_timeout)                                       
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 68, in recv                    
    raise TimeoutError("runner client unresponsive")                                                                   
ducktape.errors.TimeoutError: runner client unresponsive 

A little bit later the runner client processes timeout and exit (but note that I'm running with modifications to ducktape which reduce some zmq timeouts):

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 155, in run
    self.log(logging.INFO, "FAIL: " + err_trace)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 280, in log
    self.send(self.message.log(msg, level=log_level))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 332, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 35, in run_client
    client.run()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 158, in run
    self.teardown_test(teardown_services=not self.session_context.no_teardown, test_status=test_status)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 244, in teardown_test
    self.log(logging.INFO, "Tearing down...")
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 280, in log
    self.send(self.message.log(msg, level=log_level))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 332, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

@dotnwat
Copy link
Member

dotnwat commented Jun 9, 2022

nice clue. my guess at this point is that there is some unrecoverable error in the send-side of the messages that the TestRunner receives. this error is the same, but in the previous instances in buildkite we can see the 30 minute timeout being hit from the log timestamps. but def seems like you're on the right track.

@ajfabbri
Copy link
Contributor

Played with this a bit more tonight. Just a fake "hung" test and shortening --test-runner-timeout is enough to reproduce this stack trace. I've hacked together some stuff which (1) gets a stuck child process to dump a stacktrace and (2) forces it to exit, (which I'm hoping means we will collect logs.). It needs some cleanup and more testing--I'll post a patch when I get a chance to do that.

[INFO:2022-06-10 06:53:47,376]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...                                                                                                            
[ERROR:2022-06-10 06:54:02,392]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive after 15.02 seconds.
Timeout: hung test. Sending debug signal..                                                                                                                                                                                                    
runner client unresponsive after 15.02 seconds.                                                                        
Attempting to get logs...                                                                                                                                                                                                                     
Signal received, dumping stack trace:                                                                                  
  File "/home/ubuntu/.local/bin/ducktape", line 8, in <module>                                                                                                                                                                                
    sys.exit(main())                                       
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/command_line/main.py", line 183, in main                                                                                                                                     
    test_results = runner.run_all_tests()                                                                                                                                                                                                                                                                                                                                 
<snip>                    
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 150, in test_hung_test                                                                                                                                     
    self._some_func()                                      
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 143, in _some_func                                                                                                                                         
    dummy_data.append(self._some_helper_func())                                                                                                                                                                                               
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 136, in _some_helper_func                                                                                                                                  
    time.sleep(random.random())                                                                                                                                                                                                               
                                                                                                                       
Sending SIGINT to child pid 9514                                                                    

@ajfabbri
Copy link
Contributor

Update: this has not reproduced for about a month. I also could not reproduce it locally, or on CDT, even with various levels of background load running. Pandaresults:

FAIL test: NodeResizeTest.test_node_resize (3/793 runs)
  failure at 2022-05-12T20:38:14.888Z: Started but didn't finish!
      in job https://buildkite.com/redpanda/redpanda/builds/10078#d9101d53-f3d0-41e5-b468-93d0adaa36dc

I've added some changes to ducktape which should make it easier to debug these in the future. When those are merged, I plan to close this, with the understanding that we will reopen it as needed if it shows up again.

@LenaAn
Copy link
Contributor

LenaAn commented Jun 20, 2022

From PandaResults:

Unstable test: NodeResizeTest.test_node_resize (1/66 runs failed)
  Latest failure at 2022-06-18T07:50:57.756Z: <BadLogLines nodes=docker-rp-9(1) example="ERROR 2022-06-18 06:52:16,604 [shard 0] rpc - Service handler threw an exception: seastar::condition_variable_timed_out (Condition variable timed out)">
             in job https://buildkite.com/redpanda/redpanda/builds/11465#0181756f-c616-457e-be53-8f0dcea2d29c

@ajfabbri
Copy link
Contributor

Looks like the latest report (previous comment here) is unrelated, as the test did not hang, but emitted a common error log line. I've filed #5461 to cover that failure.

@ajfabbri
Copy link
Contributor

ajfabbri commented Jul 13, 2022

This is ready to be closed, just waiting for improved diagnostics on hung tests changes for ducktape to get in: redpanda-data/ducktape#10

Working theory is that one of the recently disabled WASM tests are not shutting down properly, causing the worker processes to hang. This bug has been impossible to reproduce since those tests were disabled, and the log activity seems to imply a relationship. This is likely also the cause of #4382.

@ajfabbri
Copy link
Contributor

Closing this since this bug has not reproduced for months. If this reappears, note the hints about wasm/coproc tests being a possible culprit. Once the linked ducktape PR is merged, we should have an easier time seeing which test is broken in the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants