-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ducktape hangs on NodeResizeTest.test_node_resize
#4634
Comments
Taking a look, thanks. |
Tried a reproduction loop last night.. no luck so far.
|
@ajfabbri I find that some tests are resource-sensitive so when I'm on the hunt for the flaky test I run only that particular test in a loop while in a parallel window I alternate between two far enough redpanda commits and constantly build them :) but of cause there are other ways to stress the hardware |
Spent some time this evening digging into zeromq options and looking at upstream history. Will try another reproduction loop shortly. |
Re-running again on clustered ducktape with some load added to the brokers in the background:
Also looking at if we can change the runner to grab logs / artifacts in this case. Debugging without them is pretty hard. |
The error is occurring because the RunnerClient isn't receiving a message for the configured 30 minute timeout. It appears this could be because either (1) the test actually takes more than 30 minutes, (2) an error some where else is resulting in a message not being sent that is expected to be received.
We can effectively rule out a 30 minute test because it takes around 3 minutes from when the test That leaves (2). Observe that all instances of the buildkite logs above all seem to hang around the |
one place to look might be if there are any wasm related services that could get stuck on shutdown. but before i looked at that i'd try to verify what is happening between ducktape reporting PASS and the timeout occuring to see if it is possible for some other shutdown events to get interleaved there. I looked briefly for any sort of port usage that might conflict with the zmq receiver/sender reange of 555x-566x and didn't see anything. |
Agreed on the behavior. The wasm thing is interesting. I'd noticed that when I first looked at this, but dismissed it in favor of adding some visibility on where the test client was stuck. The wasm tests have been disabled recently, though, which would explain why I cannot seem to reproduce it. |
Reproduced similar behavior on accident by:
A little bit later the runner client processes timeout and exit (but note that I'm running with modifications to ducktape which reduce some zmq timeouts):
|
nice clue. my guess at this point is that there is some unrecoverable error in the send-side of the messages that the TestRunner receives. this error is the same, but in the previous instances in buildkite we can see the 30 minute timeout being hit from the log timestamps. but def seems like you're on the right track. |
Played with this a bit more tonight. Just a fake "hung" test and shortening
|
Update: this has not reproduced for about a month. I also could not reproduce it locally, or on CDT, even with various levels of background load running. Pandaresults:
I've added some changes to ducktape which should make it easier to debug these in the future. When those are merged, I plan to close this, with the understanding that we will reopen it as needed if it shows up again. |
From PandaResults:
|
Looks like the latest report (previous comment here) is unrelated, as the test did not hang, but emitted a common error log line. I've filed #5461 to cover that failure. |
This is ready to be closed, just waiting for improved diagnostics on hung tests changes for ducktape to get in: redpanda-data/ducktape#10 Working theory is that one of the recently disabled WASM tests are not shutting down properly, causing the worker processes to hang. This bug has been impossible to reproduce since those tests were disabled, and the log activity seems to imply a relationship. This is likely also the cause of #4382. |
Closing this since this bug has not reproduced for months. If this reappears, note the hints about wasm/coproc tests being a possible culprit. Once the linked ducktape PR is merged, we should have an easier time seeing which test is broken in the logs. |
https://buildkite.com/redpanda/redpanda/builds/9879#8e5893f2-3651-4574-b6fb-023068c913a4
If we look at the buildkite log we'll see that it starts
NodeResizeTest.test_node_resize
but the test never finishesThe text was updated successfully, but these errors were encountered: