Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging for hung tests #10

Closed
wants to merge 8 commits into from

Conversation

ajfabbri
Copy link

When debugging hard-to-reproduce "hung" test failures (like redpanda #4634), we ran into a couple of issues:

  • Lack of any clues as to where the child test process is stuck.
  • Ducktape processes failing to exit, and failing to gather any log output (related, IIUC).

The solution makes a couple of modifications:

  • Adding debug signal handler to child processes, which will dump stack traces for all threads.
  • Tolerating TimeoutError exceptions in main.py, as much as possible, to gather debug information and get all processes to exit.
  • Tweaking some zmq socket-related parameters to try to avoid stuck sockets on unclean (i.e. passive side of connection) TCP shutdown.

Here is an example of the new ouptut you should see for a hanging test:

INFO:2022-06-10 20:52:07,852]: RunnerClient: rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_hung_test: Running...                                                                                                    [36/455]
[ERROR:2022-06-10 20:52:22,861]: Exception receiving message: <class 'ducktape.errors.TimeoutError'>: runner client unresponsive after 15.01 seconds.
Timeout: assuming hung test.                                                                                           
runner client unresponsive after 15.01 seconds.                                                                        
Hung test stacktrace: Sending SIGUSR1 to child pid 13003                                                               
Thread 0x00007f1803fff640 (most recent call first):                                                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Thread 0x00007f1810e67640 (most recent call first):                                                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Thread 0x00007f1811668640 (most recent call first):        
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 301 in read_all                      
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/packet.py", line 459 in read_message                  
  File "/home/ubuntu/.local/lib/python3.9/site-packages/paramiko/transport.py", line 2055 in run                       
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner                                                 
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap                                                       
                                                                                                                       
Current thread 0x00007f181ad7afc0 (most recent call first):                                                            
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 134 in _some_helper_func            
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 143 in _some_func                   
  File "/home/ubuntu/redpanda/tests/rptest/tests/full_node_recovery_test.py", line 150 in test_hung_test               
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35 in wrapped                                    
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 233 in run_test         
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 141 in run              
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 39 in run_client        
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108 in run                                                
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315 in _bootstrap                                         
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71 in _launch                                          
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19 in __init__                                         
  File "/usr/lib/python3.9/multiprocessing/context.py", line 277 in _Popen                                             
  File "/usr/lib/python3.9/multiprocessing/context.py", line 224 in _Popen                                             
  File "/usr/lib/python3.9/multiprocessing/process.py", line 121 in start                                              
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 264 in _run_single_test        
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/tests/runner.py", line 214 in run_all_tests           
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ducktape/command_line/main.py", line 182 in main               
  File "/home/ubuntu/.local/bin/ducktape", line 8 in <module>                                                          
Sending SIGINT to child pid 13003

@ajfabbri
Copy link
Author

I have a fake hanging test I can check in (with ok to fail) to test this out in our various CI environments.. LMK if you like that idea.

@ajfabbri ajfabbri force-pushed the handle-hung-tests branch 2 times, most recently from aa6bc31 to 70da78b Compare June 10, 2022 22:11
Aaron Fabbri added 8 commits August 16, 2022 12:43
- Don't double-print a stacktrace
- Don't chain our TimeoutError with zmq.Again exception, etc.
  The cause of these exceptions are already clear from the message, and
  there are only a couple of cases that cause it.
The main test runner should be responsive to child process IPCs. Reduce
the timeout for send and recv to parent from 3 to 1 seconds.
Instead of setting linger socket option to zero on a clean passive-side
shutdown, set it to a reasonable number when we create the socket.
Add a SIGUSR1 handler to child runner client processes, which we can use
to force a stuck test to print out stack traces.
This is used to force stuck child processes to print stacktraces.
To allow us to debug hard-to-reproduce "hung test" failures, this commit
adds special handling of test timeouts which, for any still-running
child processes:

- Sends a SIGUSR1 which causes them to print stack traces of all
threads.
- Sends a SIGINT and wait for children processes to exit.

The goals are to provide diagnostics on where child test clients are
stuck, and to allow the ducktape processes to exit, which should allow
CI to gather logs.
@ajfabbri
Copy link
Author

Force-push: rebase on latest.

I finally got around to porting this to upstream ducktape and submitting a PR here.

@ajfabbri
Copy link
Author

Cleaning up old PRs. Closing this one until I get upstream merged.

@ajfabbri ajfabbri closed this Sep 15, 2022
@ajfabbri ajfabbri deleted the handle-hung-tests branch September 15, 2022 04:15
@ajfabbri ajfabbri restored the handle-hung-tests branch September 15, 2022 04:15
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant