Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSHException in CompactionEndToEndTest test_basic_compaction #6792

Closed
jcsp opened this issue Oct 17, 2022 · 13 comments · Fixed by #6825
Closed

SSHException in CompactionEndToEndTest test_basic_compaction #6792

jcsp opened this issue Oct 17, 2022 · 13 comments · Fixed by #6825
Assignees
Labels

Comments

@jcsp
Copy link
Contributor

jcsp commented Oct 17, 2022

5x on dev in last 10 days

This is not the same as #6745 but may have related cause, if perhaps in the compacted case the worker is not sending enough output to keep a connection alive?

https://buildkite.com/redpanda/redpanda/builds/16781#0183e27b-e51a-49b3-b16f-c6dd40a33b10

https://buildkite.com/redpanda/redpanda/builds/16753#0183d9e9-0fc6-4206-9f40-46aa62abeb64

https://buildkite.com/redpanda/redpanda/builds/16753#0183d9e9-0fc6-4206-9f40-46aa62abeb64

https://buildkite.com/redpanda/redpanda/builds/16788#0183e45a-5bc5-46b2-8a8b-d1c7507ac2cc

Earliest instance:
https://buildkite.com/redpanda/redpanda/builds/16507#0183c94c-eaa8-4816-9981-e301aada4d5b

  File "/root/tests/rptest/tests/compaction_end_to_end_test.py", line 131, in test_basic_compaction
    self.producer.stop()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 86, in stop
    self._propagate_exceptions()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 100, in _propagate_exceptions
    raise Exception(self.errors)
Exception: VerifiableProducer-0-140085273973024-worker-1: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_producer.py", line 218, in _worker
    for line in node.account.ssh_capture(cmd):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/cluster/remoteaccount.py", line 652, in next
    return next(self.iter_obj)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/cluster/remoteaccount.py", line 318, in output_generator
    for line in iter(stdout.readline, ''):
  File "/usr/local/lib/python3.10/dist-packages/paramiko/file.py", line 291, in readline
    new_data = self._read(n)
  File "/usr/local/lib/python3.10/dist-packages/paramiko/channel.py", line 1361, in _read
    return self.channel.recv(size)
  File "/usr/local/lib/python3.10/dist-packages/paramiko/channel.py", line 710, in recv
    self.transport._send_user_message(m)
  File "/usr/local/lib/python3.10/dist-packages/paramiko/transport.py", line 1914, in _send_user_message
    raise SSHException(

In the case I looked at, this SSHException is happening ~30 seconds after the remote process has been signalled to stop, and the VerifiableProducer log file shows that it stopped cleanly.

So I think we should handle this exception in _worker and swallow it if the service has already been asked to stop.

@bharathv
Copy link
Contributor

Handling the exception makes sense to me.

Just searching around, it looks like paramiko has some old live lock bugs when SSH re-keying overlaps with large data transfers within a session (like in this case where the producer is generating like 0.5gb of data per run). If the issue persists, I think we can also reduce the test load or try fiddling with RekeyLimit, I'm not 100% sure of the latter but probably worth a try if we exhaust all options.

@ZeDRoman
Copy link
Contributor

I have found a discussion about this error paramiko/paramiko#822
In last comment there is a solution, that suggests to rise paramiko.packet.Packetizer.REKEY_BYTES to some large value.
@bharathv what do you think about it?

@bharathv
Copy link
Contributor

@ZeDRoman Ya, that seems like it could work, nice find. Its hard coded to ~536MB and the generated test output is very close to that value. It should go in as a ducktape patch?

I was hoping to do something like this in ducktape where we pause the data transfer thread while packetizer is rekeying.. but I"m fine with either approach.

Also some observations

@ZeDRoman
Copy link
Contributor

I was hoping to do something like this in ducktape where we pause the data transfer thread while packetizer is rekeying.. but I"m fine with either approach.

I think that your fix might work, but we have multiple places, where this error appear. It also appears in self.producer.stop()
So we will need to add it in multiple places

@bharathv
Copy link
Contributor

Right it is hacky, do you want to submit your patch and loop this test? I suspect we need to reset both REKEY_BYTES and RekeyLimit (on server) because either side can initiate it.

ZeDRoman added a commit to redpanda-data/ducktape that referenced this issue Oct 19, 2022
ZeDRoman added a commit to redpanda-data/ducktape that referenced this issue Oct 19, 2022
ZeDRoman added a commit to redpanda-data/ducktape that referenced this issue Oct 19, 2022
ZeDRoman added a commit to redpanda-data/ducktape that referenced this issue Oct 19, 2022
@andijcr
Copy link
Contributor

andijcr commented Nov 21, 2022

in this cdt run https://buildkite.com/redpanda/vtools/builds/4312#018489c4-806d-40af-b364-fd8e3cd49a22
CompactionEndToEndTest.test_basic_compaction.key_set_cardinality=10.initial_cleanup_policy=compact.transactions=True.tx_inject_aborts=True
failed again in a similar way :
SSHException(\nparamiko.ssh_exception.SSHException: Key-exchange timed out waiting for key negotiation\n')

@andijcr andijcr reopened this Nov 21, 2022
@bharathv
Copy link
Contributor

The failures in CDT may be related to my last comment

I suspect we need to reset both REKEY_BYTES and RekeyLimit (on server) because either side can initiate it.

We may need to bump the limit on the server side to delay server side triggers for initalization. Let me poke around a bit.

@bharathv
Copy link
Contributor

I think the latest commit fixed the issue (no occurrences in the last night run). Lets reopen if it resurfaces.

@dotnwat
Copy link
Member

dotnwat commented Dec 9, 2022

@dotnwat dotnwat reopened this Dec 9, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Dec 12, 2022

@bharathv did you have more ideas for addressing this? Looks like the last changes haven't quite covered it all.

@bharathv
Copy link
Contributor

One last try here https://github.com/redpanda-data/vtools/pull/1177, other than that I don't have any new ideas atm.

@rystsov
Copy link
Contributor

rystsov commented Dec 18, 2022

IMHO we're trying to solve the wrong problem, using ssh as a channel for streaming data is really weird. Somebody introduced this pattern as a shortcut instead of implementing a service with validation logic on the remote part and then we copied it and reused it across the whole tests and as a result we running into the same problems over and over again for a year now.

In some cases we may wrap an existing verifier as a remote service, for others we may need to write an app which tests a precise scenario in mind. We use this approach in chaos tests and it works like a charm.

I made an attempt to introduce this pattern to ducktape but run out of steam before landing #5238; probably it's time to push it over the finish line

@bharathv
Copy link
Contributor

Agree that SSH is not the right channel for streaming the output of such verbose commands, can be discussed further in 5238. Closing this for now as the issue has not resurfaced in the last 96h.

gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
andrewhsu pushed a commit to andrewhsu/ducktape that referenced this issue Aug 30, 2023
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 31, 2023
savex pushed a commit to redpanda-data/ducktape that referenced this issue Sep 8, 2023
savex pushed a commit to redpanda-data/ducktape that referenced this issue Sep 12, 2023
savex pushed a commit to redpanda-data/ducktape that referenced this issue Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants