-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSHException in CompactionEndToEndTest
test_basic_compaction
#6792
Comments
Handling the exception makes sense to me. Just searching around, it looks like paramiko has some old live lock bugs when SSH re-keying overlaps with large data transfers within a session (like in this case where the producer is generating like 0.5gb of data per run). If the issue persists, I think we can also reduce the test load or try fiddling with RekeyLimit, I'm not 100% sure of the latter but probably worth a try if we exhaust all options. |
I have found a discussion about this error paramiko/paramiko#822 |
@ZeDRoman Ya, that seems like it could work, nice find. Its hard coded to ~536MB and the generated test output is very close to that value. It should go in as a ducktape patch? I was hoping to do something like this in ducktape where we pause the data transfer thread while packetizer is rekeying.. but I"m fine with either approach. Also some observations
|
I think that your fix might work, but we have multiple places, where this error appear. It also appears in |
Right it is hacky, do you want to submit your patch and loop this test? I suspect we need to reset both |
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
in this cdt run https://buildkite.com/redpanda/vtools/builds/4312#018489c4-806d-40af-b364-fd8e3cd49a22 |
The failures in CDT may be related to my last comment
We may need to bump the limit on the server side to delay server side triggers for initalization. Let me poke around a bit. |
I think the latest commit fixed the issue (no occurrences in the last night run). Lets reopen if it resurfaces. |
@bharathv did you have more ideas for addressing this? Looks like the last changes haven't quite covered it all. |
One last try here https://github.com/redpanda-data/vtools/pull/1177, other than that I don't have any new ideas atm. |
IMHO we're trying to solve the wrong problem, using ssh as a channel for streaming data is really weird. Somebody introduced this pattern as a shortcut instead of implementing a service with validation logic on the remote part and then we copied it and reused it across the whole tests and as a result we running into the same problems over and over again for a year now. In some cases we may wrap an existing verifier as a remote service, for others we may need to write an app which tests a precise scenario in mind. We use this approach in chaos tests and it works like a charm. I made an attempt to introduce this pattern to ducktape but run out of steam before landing #5238; probably it's time to push it over the finish line |
Agree that SSH is not the right channel for streaming the output of such verbose commands, can be discussed further in 5238. Closing this for now as the issue has not resurfaced in the last 96h. |
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792 (cherry picked from commit 7ca49dd)
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
Changing it due to redpanda-data/redpanda#6792
5x on
dev
in last 10 daysThis is not the same as #6745 but may have related cause, if perhaps in the compacted case the worker is not sending enough output to keep a connection alive?
https://buildkite.com/redpanda/redpanda/builds/16781#0183e27b-e51a-49b3-b16f-c6dd40a33b10
https://buildkite.com/redpanda/redpanda/builds/16753#0183d9e9-0fc6-4206-9f40-46aa62abeb64
https://buildkite.com/redpanda/redpanda/builds/16753#0183d9e9-0fc6-4206-9f40-46aa62abeb64
https://buildkite.com/redpanda/redpanda/builds/16788#0183e45a-5bc5-46b2-8a8b-d1c7507ac2cc
Earliest instance:
https://buildkite.com/redpanda/redpanda/builds/16507#0183c94c-eaa8-4816-9981-e301aada4d5b
In the case I looked at, this SSHException is happening ~30 seconds after the remote process has been signalled to stop, and the VerifiableProducer log file shows that it stopped cleanly.
So I think we should handle this exception in _worker and swallow it if the service has already been asked to stop.
The text was updated successfully, but these errors were encountered: