Add debugging for hung tests #10

ajfabbri · 2022-06-10T21:37:34Z

When debugging hard-to-reproduce "hung" test failures (like redpanda #4634), we ran into a couple of issues:

Lack of any clues as to where the child test process is stuck.
Ducktape processes failing to exit, and failing to gather any log output (related, IIUC).

The solution makes a couple of modifications:

Adding debug signal handler to child processes, which will dump stack traces for all threads.
Tolerating TimeoutError exceptions in main.py, as much as possible, to gather debug information and get all processes to exit.
Tweaking some zmq socket-related parameters to try to avoid stuck sockets on unclean (i.e. passive side of connection) TCP shutdown.

Here is an example of the new ouptut you should see for a hanging test:

INFO:2022-06-10 20:52:07,852]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...                                                                                                    [36/455]
[ERROR:2022-06-10 20:52:22,861]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive after 15.01 seconds.
Timeout: assuming hung test.                                                                                           
runner client unresponsive after 15.01 seconds.                                                                        
Hung test stacktrace: Sending SIGUSR1 to child pid 13003                                                               
Thread 0x00007f1803fff640 (most recent call first):                                                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Thread 0x00007f1810e67640 (most recent call first):                                                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Thread 0x00007f1811668640 (most recent call first):        
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Current thread 0x00007f181ad7afc0 (most recent call first):                                                            
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 134 in _some_helper_func            
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 143 in _some_func                   
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 150 in test_hung_test               
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35 in wrapped                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 233 in run_test         
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 141 in run              
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 39 in run_client        
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108 in run                                                
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315 in _bootstrap                                         
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71 in _launch                                          
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19 in __init__                                         
  File "/usr/lib/python3.9/multiprocessing/context.py", line 277 in _Popen                                             
  File "/usr/lib/python3.9/multiprocessing/context.py", line 224 in _Popen                                             
  File "/usr/lib/python3.9/multiprocessing/process.py", line 121 in start                                              
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 264 in _run_single_test        
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 214 in run_all_tests           
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/command_line/main.py", line 182 in main               
  File "/home/ubuntu/.local/bin/ducktape", line 8 in <module>                                                          
Sending SIGINT to child pid 13003

ajfabbri · 2022-06-10T21:54:56Z

I have a fake hanging test I can check in (with ok to fail) to test this out in our various CI environments.. LMK if you like that idea.

ducktape/tests/runner.py

- Don't double-print a stacktrace - Don't chain our TimeoutError with zmq.Again exception, etc. The cause of these exceptions are already clear from the message, and there are only a couple of cases that cause it.

The main test runner should be responsive to child process IPCs. Reduce the timeout for send and recv to parent from 3 to 1 seconds.

Instead of setting linger socket option to zero on a clean passive-side shutdown, set it to a reasonable number when we create the socket.

Add a SIGUSR1 handler to child runner client processes, which we can use to force a stuck test to print out stack traces.

This is used to force stuck child processes to print stacktraces.

To allow us to debug hard-to-reproduce "hung test" failures, this commit adds special handling of test timeouts which, for any still-running child processes: - Sends a SIGUSR1 which causes them to print stack traces of all threads. - Sends a SIGINT and wait for children processes to exit. The goals are to provide diagnostics on where child test clients are stuck, and to allow the ducktape processes to exit, which should allow CI to gather logs.

ajfabbri · 2022-08-16T20:17:30Z

Force-push: rebase on latest.

I finally got around to porting this to upstream ducktape and submitting a PR here.

ajfabbri · 2022-09-15T04:15:17Z

Cleaning up old PRs. Closing this one until I get upstream merged.

ajfabbri mentioned this pull request Jun 10, 2022

Ducktape hangs on NodeResizeTest.test_node_resize redpanda-data/redpanda#4634

Closed

ajfabbri commented Jun 10, 2022

View reviewed changes

ducktape/tests/runner.py Show resolved Hide resolved

ajfabbri force-pushed the handle-hung-tests branch 2 times, most recently from aa6bc31 to 70da78b Compare June 10, 2022 22:11

Aaron Fabbri added 8 commits August 16, 2022 12:43

tests/runner: tidy up exception output

b1b9ca9

- Don't double-print a stacktrace - Don't chain our TimeoutError with zmq.Again exception, etc. The cause of these exceptions are already clear from the message, and there are only a couple of cases that cause it.

runner_client: reduce timeout to send/recv from runner

ad3972e

The main test runner should be responsive to child process IPCs. Reduce the timeout for send and recv to parent from 3 to 1 seconds.

runner: set reasonable LINGER_MS sockopt up front

1f197de

Instead of setting linger socket option to zero on a clean passive-side shutdown, set it to a reasonable number when we create the socket.

runner_client: add type hint for socket type

48bc00f

runner: print elapsed time when client is unresponsive

58d57f9

runner_client: add debug signal handler to dump stacktraces

e16ca7c

Add a SIGUSR1 handler to child runner client processes, which we can use to force a stuck test to print out stack traces.

runner: propagate SIGUSR1 to child processes

7dd0573

This is used to force stuck child processes to print stacktraces.

ajfabbri force-pushed the handle-hung-tests branch from 70da78b to 80360b1 Compare August 16, 2022 19:43

ajfabbri closed this Sep 15, 2022

ajfabbri deleted the handle-hung-tests branch September 15, 2022 04:15

ajfabbri restored the handle-hung-tests branch September 15, 2022 04:15

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debugging for hung tests #10

Add debugging for hung tests #10

ajfabbri commented Jun 10, 2022

ajfabbri commented Jun 10, 2022

ajfabbri commented Aug 16, 2022

ajfabbri commented Sep 15, 2022

Add debugging for hung tests #10

Add debugging for hung tests #10

Conversation

ajfabbri commented Jun 10, 2022

ajfabbri commented Jun 10, 2022

ajfabbri commented Aug 16, 2022

ajfabbri commented Sep 15, 2022