Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in rpk:test-container #2418

Closed
jcsp opened this issue Sep 23, 2021 · 24 comments · Fixed by #4565 or #4721
Closed

Failure in rpk:test-container #2418

jcsp opened this issue Sep 23, 2021 · 24 comments · Fixed by #4565 or #4721
Assignees

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 23, 2021

This test is run after all the ducktape tests & seems to be failing every 10-20 runs.

https://buildkite.com/vectorized/redpanda/builds/2528#e0afe087-83be-4b09-9899-3cea1e8c966c

Couple lines stand out

Error: Error restarting the cluster: Error response from daemon: driver failed programming external connectivity on endpoint rp-node-1 (77e690e2d23268aed897727feaa2caa7837f7c60d531150ed7f87fe35234668d): Bind for 0.0.0.0:41369 failed: port is already allocated

Error: flag needs an argument: --brokers

@jcsp jcsp added kind/bug Something isn't working area/rpk ci-failure labels Sep 23, 2021
@jcsp
Copy link
Contributor Author

jcsp commented Oct 1, 2021

@jcsp
Copy link
Contributor Author

jcsp commented Oct 22, 2021

Again today:
https://buildkite.com/vectorized/redpanda/builds/3536#8d690446-253a-4601-8b03-31a00e6aa108

Looked at the code - I think this an rpk bug. In pkg/cli/cmd/container/start.go, the vnet.GetFreePort utility is being used repeatedly to pick ports for redpanda. That function is just opening a listen port and seeing which one linux allocates to it. But when a port is used for a redpanda node, there's no guarantee that the node is up+listening on all its ports before the next node's GetFreePort calls run.

It would be more robust to generate a pool of available ports up-front before starting any redpanda nodes, and then allocate from that pool -- that way there wouldn't be a race between redpanda startup and GetFreePort.

@jcsp
Copy link
Contributor Author

jcsp commented Oct 22, 2021

Would also be nice to have the rpk info/debug log output during this test (does it just need a -v flag?)

@jcsp
Copy link
Contributor Author

jcsp commented Oct 22, 2021

Failure of this test in a different way here: https://buildkite.com/vectorized/redpanda/builds/3537#29a68ee8-bc64-488d-b4e4-4c0a9020b9c0

On that one, we see two nodes start successfully, and the third node silently fails to start ('rpk start' runs, status check output appears, but not output from redpanda binary appears).

@twmb
Copy link
Contributor

twmb commented Oct 22, 2021

FWIW. my goal is to switch to using a static set of ports, but that'll likely conflict with these tests.

@jcsp
Copy link
Contributor Author

jcsp commented Nov 1, 2021

@jcsp
Copy link
Contributor Author

jcsp commented Nov 3, 2021

This seems to be becoming more frequent for some reason https://buildkite.com/vectorized/redpanda/builds/3908#ad8d710f-2aea-4577-bdc4-bd256fd935bf

@jcsp
Copy link
Contributor Author

jcsp commented Nov 9, 2021

Another one:
https://buildkite.com/vectorized/redpanda/builds/4105#b96178db-bc6c-4922-82e1-6d61cf709ca5

@twmb how essential is the coverage provided by these tests? The failure is happening often enough that I'm inclined to switch them off until fixed.

@twmb
Copy link
Contributor

twmb commented Nov 10, 2021

AFAICT the code the tests are testing is strictly isolated to rpk container, which is effectively frozen at the moment until future maintenance happens. I think we can comment out these tests until the future work happens.

@jcsp
Copy link
Contributor Author

jcsp commented Nov 10, 2021

@dotnwat
Copy link
Member

dotnwat commented Dec 22, 2021

  • traivs is going to rewrite/refactor rpk container and will deal with ports

@twmb
Copy link
Contributor

twmb commented Feb 15, 2022

As an update, this test is disabled and the code is frozen until aspects of it are rewritten, notably the port randomization should be fixed, and the test should tolerate that.

AFAICT, this test can be completely rewritten at that point, and I'm not sure it's worth spending time to fix the current issue when in the end it'll be removed or rewritten.

@dotnwat
Copy link
Member

dotnwat commented Feb 15, 2022

AFAICT, this test can be completely rewritten at that point, and I'm not sure it's worth spending time to fix the current issue when in the end it'll be removed or rewritten.

totally agree but if rpk:container is being used, then we should also keep in mind how long we think it might be until it is rewritten so that we do have testing for a in-use feature.

do you have any guidance on either (1) where someone would start poking around to make the port randomization be more robust / correct and (2) how long that might take?

@twmb
Copy link
Contributor

twmb commented Feb 16, 2022

The port probing can be pretty easily traced through gopls's jump to definition,
but I'm not sure how simple it will be to fix, because AFAICT it's using a pretty simple algorithm right now that should be robustish. Based on John's comment above, "generate a pool of available ports up-front before starting any redpanda nodes, and then allocate from that pool", this is pretty easy.

@r-vasquez
Copy link
Contributor

Reopening due to https://buildkite.com/redpanda/redpanda/builds/9845#c84f2c25-401e-47af-8b21-379197d2b1cc

+ echo 'Cleaning up'
--
  | Cleaning up
  | + ./rpk container purge -v
  | No nodes to remove.
  | You may start a new local cluster with 'rpk container start'
  | + n_nodes=3
  | + addrs=($(./rpk container start -v --image 'localhost/redpanda:dev' -n $n_nodes --retries 30 \| grep "        127.0.0.1:" \| awk '{print $2}' \| tr -d " \t"))
  | ++ ./rpk container start -v --image localhost/redpanda:dev -n 3 --retries 30
  | ++ grep '        127.0.0.1:'
  | ++ awk '{print $2}'
  | ++ tr -d ' \t'
  | Error: error restarting the cluster: Error response from daemon: driver failed programming external connectivity on endpoint rp-node-2 (09e5b37c70640d4bef5bdc7d6a02020a09499d4fdac8923dece86238bce5e8d6): Bind for 0.0.0.0:44947 failed: port is already allocated

@r-vasquez r-vasquez reopened this May 6, 2022
@r-vasquez
Copy link
Contributor

We tested with the tip of dev: 1k nodes / 5k ports and we allocate unique ports every time. Port colliding might be related to another service so we will keep an eye out if it happens again.

@jcsp
Copy link
Contributor Author

jcsp commented May 9, 2022

@r-vasquez what changed between that failure and the tip of dev to make us think this won't happen again?

I get that it might be interference from something else, but that interference needs to be hunted down, unless we've exhausted all avenues of investigation?

@twmb
Copy link
Contributor

twmb commented May 9, 2022

Our next approach if this fails again is to just reallocate 5 new free ports and retry the container start -- we could either leave this issue open, or reopen the first time we see this same CI failure.

We do intend to eventually switch away from the random port choices and instead make the ports static and driven by flags, but that's a bit longer term.

@jcsp
Copy link
Contributor Author

jcsp commented May 9, 2022

If this had failed once a few weeks ago I could understand, but this failed Friday, the day after it was re-enabled, and it's only Monday now. This doesn't seem like a super rare failure -- if there's already a test change in mind to make this more robust, then let's make the change, instead of waiting for more failures.

@jcsp jcsp reopened this May 9, 2022
@jcsp
Copy link
Contributor Author

jcsp commented May 11, 2022

Failed overnight here: https://buildkite.com/redpanda/redpanda/builds/9991#7efb7950-1304-4027-8344-4e8644c8ab69

Error: error restarting the cluster: Error response from daemon: driver failed programming external connectivity on endpoint rp-node-2 (6e8fceb71246db8151b4237ecfe589c8f6ab20698241542c5c16bbc11d5a5ec9): Bind for 0.0.0.0:44185 failed: port is already allocated

@r-vasquez please re-disable today if the fix isn't obvious right away.

@jcsp
Copy link
Contributor Author

jcsp commented May 11, 2022

I just looked again at the patch -- I think the issue is that getFreePortPool is able to find the same port is free more than once, because after it checks the port, there is nothing still listening on it, so future calls to getFreePort can find the same port again. When getFreePort is called, it should reject any port that is already allocated in the pool

https://github.com/redpanda-data/redpanda/pull/4565/files#diff-9b8eca18dc88aa451c5884a35d0e40579fd318e29ce0820679b098e44f9f9a74R62

@jcsp
Copy link
Contributor Author

jcsp commented May 16, 2022

Reopening to track re-enabling the test -- I didn't see a vtools PR since #655 disabled it?

@jcsp jcsp reopened this May 16, 2022
@r-vasquez
Copy link
Contributor

Thanks, I just created: https://github.com/redpanda-data/vtools/pull/667

@twmb
Copy link
Contributor

twmb commented May 16, 2022

Test re-enabling merged, I think we can close this.

@twmb twmb closed this as completed May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants