Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting Antrea should do a complete cleanup #181

Closed
antoninbas opened this issue Dec 4, 2019 · 14 comments
Closed

Deleting Antrea should do a complete cleanup #181

antoninbas opened this issue Dec 4, 2019 · 14 comments
Labels
good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
When running kubectl delete -f antrea.yml, one would expect everything created by Antrea to be deleted, including the OVS bridge on each Node, along with the gateway interface. Failure do so may cause networking issues if someone decides to deploy another CNI after deleting Antrea.

Versions:
Antrea version:

===> Version information <===
VERSION: v0.2.0-dev
GIT_SHA: 4a446d4
GIT_TREE_STATE: clean
RELEASE_STATUS: unreleased
DOCKER_IMG_VERSION: v0.2.0-dev-4a446d4

Additional context

  1. We cannot simply delete the bridge, etc. when the agent exists. For example, we may want to preserve connectivity during some Antrea upgrades. Note that currently this is not possible since all flows are deleted when the openvswitch daemon restarts. See these resources:
  1. @jianjuns and I discussed this cleanup issue in a separate PR: Fix Kind support (Linux hosts only) #137 (comment). He suggested having a cleanup daemonset dedicated to this purpose. See also these K8s issues:
@antoninbas antoninbas added the bug label Dec 4, 2019
@jianjuns
Copy link
Contributor

jianjuns commented Dec 4, 2019

I think we could either have a separate DaemonSet or add a ConfigMap option to indicate antrea-agent to do the cleanup (the two approaches are not very different from each other, but with the ConfigMap way agent could watch the option in runtime when we support runtime config changes later).

@antoninbas antoninbas added this to the Antrea v0.3.0 release milestone Dec 19, 2019
@antoninbas antoninbas self-assigned this Dec 19, 2019
@antoninbas
Copy link
Contributor Author

I'd like to start working on this. I like the DaemonSet approach more since it works even if there is an issue (e.g. buggy agent) or if Antrea has already been deleted from the cluster. Someone might have done kubectl delete -f antrea.yml already, and we need to be able to perform the cleanup without the Antrea agent / controller running.

My proposal is as follows:

  • define a YAML manifest antrea-cleanup.yaml to take care of cleaning up artifacts created by Antrea
  • the manifest would define one K8s Job in charge of performing the cleanup, plus necessary RBAC permissions
  • the Job would be in charge of creating the cleanup DaemonSet, monitoring the status of the cleanup process and deleting the DaemonSet once cleanup is complete

A few considerations:

  • there is no notion of a DaemonSet that runs to completion and doesn't restart Pods (see Run job on each node once to help with setup kubernetes/kubernetes#64623). But we can use a Controller and have each DaemonSet Pod return completion status to the Controller using CRDs.
  • the Job should probably return an error if Antrea is still running (and recommend deleting Antrea with kubectl delete -f antrea.yml). Although another option is to have the Job also delete all K8s resources with label app=antrea. I'm open to both solutions.
  • the DaemonSet needs to be aware of the Antrea agent configuration (as the OVS bridge name / gateway interface name may not be the default ones). This is a bit tricky: if Antrea is not running anymore, the ConfigMap is not available anymore. Rather than ask the user to modify a ConfigMap in the cleanup manifest (antrea-cleanup.yaml), I would rather modify the Antrea agent to have it save its config (antrea-agent.conf) on the host under /var/run/antrea/. I think this is a very non-intrusive change, very much in spirit with how we use /var/run/antrea/ for other things (e.g. OVSDB) that we do not need to persist across reboots.

@jianjuns @salv-orlando @tnqn any thoughts or concerns?

@jianjuns
Copy link
Contributor

Questions:

  1. Do you plan to use agent bin to do the clean up or a separate bin?
  2. For shared resources like Controller CRD who will do the clean up?

@antoninbas
Copy link
Contributor Author

  1. No, I was planning to use different binaries (one for the DaemonSet and one for the "orchestrator", which will be run by the K8s Job and talk to the apiserver).
  2. I think the "orchestrator" could take care of this easily, what do you think?

@jianjuns
Copy link
Contributor

I was thinking about running the same agent and controller bins, to avoid introducing new bins. But I saw there are benefits of using a separate bin too, that could be executed standalone, including executed manually. In this sense, do you think it is good to make it part of CLI, and so user could run the CLI manually to clean up?
Also note on Windows, we could not clean up Node state with DS.

@jianjuns
Copy link
Contributor

Just to add, it might be helpful to be able to clean up node after K8s is down.

@antoninbas
Copy link
Contributor Author

I think it's better to introduce new bins as I like the idea of separating the cleanup logic from the core functionality (rather than adding a new mode of operation for existing binaries).

I definitely think we should have the ability to run the binary directly as a process. I would like to organize the code as a library that can be used by the CLI & or in a DaemonSet. Even though kubectl apply -f antrea-cleanup.yaml may not work on Windows, it's a nice K8s-native way to do cleanup that doesn't require to download any binaries manually.

antoninbas added a commit to antoninbas/antrea that referenced this issue Jan 10, 2020
This is still very much a work-in-progress, I'm opening this PR to
gather feedback on the approach.

For someone deleting Antrea, the steps would be as follows:
 * `kubectl delete -f <path to antrea.yml>`
 * `kubectl apply -f <path to antrea-cleanup.yml>`
 * check that job has completed with `kubectl -n kube-system get jobs`
 * `kubectl delete -f <path to antrea-cleanup.yml>`

The cleanup manifest creates a DaemonSet that will perform the necessary
deletion tasks on each Node. After the tasks have been completed, the
"status" is reported to the cleanup controller through a custom
resource. Once the controller has received enough statuses (or after a
timeout of 1 minutes) the controller job completes and the user can
delete the cleanup manifest.

Known remaining items:
 * place cleanup binaries (antrea-cleanup-agent and
 antrea-cleanup-controller) in a separate docker image to avoid
 increasing the size of the main Antrea docker image
 * generate manifest with kustomize?
 * find a way to test this as part of CI?
 * update documentation
 * additional cleanup tasks: as of now we only take care of deleting the
 OVS bridge
 * place cleanup CRD in non-default namespace
 * use kubebuilder instead of the code generator directly (related to
 antrea-io#16); we probably want to punt this task to a future PR.

See antrea-io#181
antoninbas added a commit to antoninbas/antrea that referenced this issue Jan 10, 2020
This is still very much a work-in-progress, I'm opening this PR to
gather feedback on the approach.

For someone deleting Antrea, the steps would be as follows:
 * `kubectl delete -f <path to antrea.yml>`
 * `kubectl apply -f <path to antrea-cleanup.yml>`
 * check that job has completed with `kubectl -n kube-system get jobs`
 * `kubectl delete -f <path to antrea-cleanup.yml>`

The cleanup manifest creates a DaemonSet that will perform the necessary
deletion tasks on each Node. After the tasks have been completed, the
"status" is reported to the cleanup controller through a custom
resource. Once the controller has received enough statuses (or after a
timeout of 1 minutes) the controller job completes and the user can
delete the cleanup manifest.

Known remaining items:
 * place cleanup binaries (antrea-cleanup-agent and
 antrea-cleanup-controller) in a separate docker image to avoid
 increasing the size of the main Antrea docker image
 * generate manifest with kustomize?
 * find a way to test this as part of CI?
 * update documentation
 * additional cleanup tasks: as of now we only take care of deleting the
 OVS bridge
 * place cleanup CRD in non-default namespace
 * use kubebuilder instead of the code generator directly (related to
 antrea-io#16); we probably want to punt this task to a future PR.

See antrea-io#181
@antoninbas
Copy link
Contributor Author

Removing this from 0.3.0, as PR #307 still needs a lot of work.

@McCodeman McCodeman added the kind/bug Categorizes issue or PR as related to a bug. label Jan 29, 2020
@salv-orlando salv-orlando removed this from the Antrea v0.4.0 release milestone Feb 12, 2020
@antoninbas antoninbas added this to the Antrea v0.5.0 release milestone Feb 12, 2020
@salv-orlando
Copy link
Contributor

Unfortunately there weren't many review in the past few days. @antoninbas do you think we can still make a final push for 0.5.0 inclusion, or do we just move this issue to 0.6.0?

@antoninbas
Copy link
Contributor Author

Pushing this out to next release

@antoninbas antoninbas removed this from the Antrea v0.8.0 release milestone Jun 17, 2020
@antoninbas antoninbas added this to the Antrea v0.13.0 release milestone Dec 11, 2020
@antoninbas antoninbas added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 11, 2020
@antoninbas antoninbas removed this from the Antrea v1.0 release milestone Apr 7, 2021
@antoninbas antoninbas removed the bug label Apr 19, 2021
zyiou pushed a commit to zyiou/antrea that referenced this issue Jul 2, 2021
Delete the unrequired method of deleting record from
record map without lock.
Add a method to get the flow updated time for flow given flow key
zyiou pushed a commit to zyiou/antrea that referenced this issue Jul 2, 2021
Delete the unrequired method of deleting record from
record map without lock.
Add a method to get the flow updated time for flow given flow key
@antoninbas antoninbas added the good first issue Good for newcomers label Oct 2, 2021
@antoninbas antoninbas removed their assignment Oct 2, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2022
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2022
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

4 participants