Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix deprecated apis 1.16 #110

Merged

Conversation

Cryptophobia
Copy link
Member

https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/

Per the above recent blog post we need to migrate some of our APIs to work with the deprecations in k8s 1.16 .

Would really appreciate if other people looked at these changes and try to fix the broken tests. For now overwriting some url generations in the mock.py should fix the tests. We still have broken tests with these changes.

Steps to get a working development Django/Python environment:

  1. Install Postgresql local database and initialize with postgres user then edit /var/lib/pgsql/data/pg_hba.conf to allow local connections

  2. Install local django and python dependencies in /rootfs

     $ pip3 install -r requirements.txt --user
  3. Set these environment variables and make sure Postgresql database is running on port 5432:

    export DEIS_DATABASE_NAME=deis
    export DEIS_DATABASE_PASSWORD=postgres
    export DEIS_DATABASE_USER=postgres
    export DEIS_DATABASE_SERVICE_HOST=localhost
    export DEIS_DATABASE_SERVICE_PORT=5432
  4. To run an individual test run this command in /rootfs:

    python3 manage.py test --settings=api.settings.testing --noinput scheduler.tests.test_deployments

Copy link
Member

@kingdonb kingdonb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I still need to test this more, but these changes look good to me

@kingdonb
Copy link
Member

I don't think this is the final change, I was not able to see the Deployment api version apps/v1 being used when my upgraded controller created a new namespace and app deploy via Buildpack.

I'm not sure what's wrong but I revoke my checkmark, sorry for being premature.

Copy link
Member

@kingdonb kingdonb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the mock tests, but I'll try deploying a completely fresh cluster in case there is some stale state left on my instance contributing to the problem.

@felixbuenemann
Copy link
Contributor

@Cryptophobia if you are running both postgres and the python app on your local system, you don't need to modify pg_hba.conf, just make sure you have a database user that matches the user that is running the python app, you can then login to the db as that user without a password.

How do you plan on handling backwards-compatibility for pre-1.9.0 Kubernetes clusters?

The current change contain only one place, were the version is checked and the charts also hardcode the new value.

@kingdonb
Copy link
Member

kingdonb commented Sep 23, 2019

How do you plan on handling backwards-compatibility for pre-1.9.0 Kubernetes clusters?

I think I understood, and I'm happy to be proven wrong (although it will make this transition a bit more complicated)... that pre-1.9 K8s clusters didn't have extensions/v1beta1, but some earlier alpha API. I'm afraid this might not have been completely homogenous across all versions, and this goes back to we should really be checking with the cluster API to find out what APIs it has.

But if this was done right by upstreams, for definitions of right that include "they made it easy for us," then we can simply drop support for v1beta1 and substitute the v1 APIs in its place, and everything will be OK for all users on any version. Clusters with K8s versions older than 1.9 will have support for/be using deployment APIs that are older than v1beta1, and clusters with support for newer APIs will have no problem serving up the resources they manage with the newer APIs.

But like I said I'm happy to be proven wrong, and I have only done limited testing to understand the problem, it's entirely possible there are important points which I've glossed over (like, can a 1.12 cluster really serve up extensions/v1beta1 resources from the apps/v1 API? Or do you need to upgrade a cluster all the way to 1.16 to get the deprecated resources upgraded automatically?)

There are a lot of test scenarios which I haven't considered other than in passing, and it feels like this will be a good time for a minor version bump and an actual manual testing plan. I've started working on that.

I'm about 50/50 on whether we should go back and test those old versions. I maintained a 1.5 cluster for a long time and it was ultimately very easy to upgrade it to 1.13, once I got over RBAC, I've kept pace since then and it's on 1.15 now. I have yet to test this new controller on a 1.16 cluster which has removed the deprecated, and I suspect that in its current form it will not work. (What I'm saying in a round-about way is that versions <1.12 are not in support from upstream, and we should not go to great lengths to ensure we maintain support for them permanently)

So, that's where I'll start testing next, with a 1.16 cluster wherever I can put my hands on one.

@felixbuenemann
Copy link
Contributor

I'm currently stuck with some clusters on 1.5.8, that are hard but not impossible to upgrade (mainly because they are based on an old version of kube-aws, so I need to start with a fresh cluster and transfer all apps and dbs, or keep backporting stuff from the new kube-aws), so I could test changes in a Staging cluster.

@felixbuenemann
Copy link
Contributor

I just tried to change the apiVersion on deploy/deis-router from extensions/v1beta1 to apps/v1 and the change was rejected, so I think some usage of Capabilities.APIVersions.Has is needed inside the templates to use the proper apiVersion.

See: https://helm.sh/docs/chart_template_guide/#built-in-objects

Copy link
Contributor

@felixbuenemann felixbuenemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments with what I think is needed to handle backwards compatibility.

@@ -1,4 +1,4 @@
apiVersion: extensions/v1beta1
apiVersion: apps/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something like this to keep backwars compatibility with older clusters:

Suggested change
apiVersion: apps/v1
{{- if .Capabilities.APIVersions.Has "apps/v1" }}
apiVersion: apps/v1
{{- else }}
apiVersion: extensions/v1beta1
{{- end }}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to try the above and see if it works for us. 👍 Sometime before we have had problems using . Capabilities.APIVersions.Has.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs mention that checking for resources (apps/v1/Deployment) requires at least Helm v2.15, but just using apps/v1 (without a resource) should be fine, if that's specific enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can perhaps just check for "v1" – I think you might not find specific API groups, but you can count on support for apps/v1 if APIVersions has "v1"

There was an issue I found on the Prometheus helm chart repo that covered this, I will look to see what they settled on to solve the backwards compatibility question, and if it differs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this suggestion in 5930851 . 👍 I found this similar example in a Datadog agent chart that looks very similar to exactly what we are doing here:

https://github.com/helm/charts/blob/2215d0f752370ae75fa54a9f62b057ed14f8e744/stable/datadog/templates/agent-clusterchecks-deployment.yaml#L2-L9

I think they are adding the last else in the end to make sure it tries to get deployed using the default apiVersion as apps/v1 even if the deployer agent (Helm or whatever...) does not have permission to view .Capabilities.APIVersions?

rootfs/scheduler/mock.py Show resolved Hide resolved
rootfs/scheduler/mock.py Outdated Show resolved Hide resolved
rootfs/scheduler/resources/ingress.py Show resolved Hide resolved
rootfs/scheduler/resources/replicaset.py Outdated Show resolved Hide resolved
@Cryptophobia
Copy link
Member Author

Thank you for your notes @felixbuenemann . I wanted to add everyone involved in this PR and see how we can best do the backwards compatibility. Was not sure if we want to support < 1.9 cluster in the future anyways, but it is good to get feedback from users using Hephy.

I'm going to take your notes about the charts and versions and make some more changes.

@till
Copy link

till commented Sep 27, 2019

Do people need to run the kubectl convert commands on existing installations?

@kingdonb
Copy link
Member

That's a good question, I think the answer is yes, but more testing is needed. In the past we have decided which API version to use strictly by lowest-common-denominator strategy and paying attention to Kubectl API version. We have missed some opportunities to upgrade clusters >1.12 and preemptively ensure their clusters are using the non-beta APIs, before the deprecation.

I think it would be best if we supported both API versions, on clusters that are pre 1.16 and could support both, but the alternative would be to upgrade everyone on clusters that should support it. I'm not sure that we can reliably predict that based on version number, but it is a strategy that has worked for us in the past.

It sounds like we could provide a doc for people who are ready to upgrade deployment and other cluster resources to v1 API using kubectl convert, so they can do it at their convenience, and those who would be dragged kicking and screaming may get converted automatically if needed.

I didn't know there was a kubectl convert command, and the documentation for 1.16 has led to believe that objects in the deprecated APIs will still be available (eg. they would be automatically converted when the old support has dropped in an upgraded cluster.) I don't know if we need to do this in Hephy, but if there are envs which may be able to get upgraded by the controller in a way that is more-or-less seamless, that would be awesome.

Let's make it clear what happens for cluster operators, this would be a great addition in the documentation when we have worked out exactly what the v1 API transition plan should look like. I like the way we handled the postgres upgrades, and I don't have a problem with one-way-only transitions if necessary, but we should also consider what it might look like for an upgrade to go badly and how it could be safely reverted, if possible.

@felixbuenemann
Copy link
Contributor

I don't think you need to do anything for a running cluster.

The kubectl convert command is deprecated with the following message:

kubectl convert is DEPRECATED and will be removed in a future version.
In order to convert, kubectl apply the object to the cluster, then kubectl get at the desired version.

It's not entirely clear from the comment, wether the cluster needs to support both the old and new representation or if new cluster versions will still accept old resource manifests but convert them on the fly.

@kingdonb
Copy link
Member

Ah, so I have misunderstood. The API version is just a facade over a singular resource in the back-end, and can be queried with any supported API version. And then I guess, other than the restful resource interface itself, which changes shape in some ways, behavior should not be expected to change in any way when upgrading client API versions, (only kubelet versions.)

@Cryptophobia
Copy link
Member Author

@felixbuenemann @kingdonb @till

The API version is just a facade over a singular resource in the back-end, and can be queried with any supported API version.

This is what my interpretation was from the deprecation messages. On the golang client for k8s, I remember querying same resources on different APIs endpoints and receiving the same payload/resources. I noticed this when querying the APIs for the ingress and service objects. I hope this is the case still because it will make things much easier for us. 🐿️

@Cryptophobia
Copy link
Member Author

Made quite a bit of changes now. Can you guys try to test it out again on a 1.16 cluster if you can? @felixbuenemann @kingdonb

Testing this would require building a new image and deploying it using the new chart updates so that the controller get deployed on the right apiVersion: apps/v1 going forward.

@felixbuenemann
Copy link
Contributor

@kingdonb I don't have a 1.16 cluster for testing ATM, do you?

Shouldn't we also test on older clusters to make sure nothing breaks?

@Cryptophobia
Copy link
Member Author

Cryptophobia commented Oct 22, 2019

Shouldn't we also test on older clusters to make sure nothing breaks?

Yes, we need to test older versions to make sure backwards compatibility works and also try to see if upgrade path to 1.16 from older versions like 1.15, 1.14 also works without problems just in case.

@kingdonb
Copy link
Member

kingdonb commented Oct 22, 2019

I do in fact have a Kubeadm cluster running 1.16.2 now, so I can run these tests on the next version.

edit:

And, I have an example of what error you will get if you try to install the latest stable Hephy release on it.

$ helm install hephy/workflow --namespace deis --set global.use_rbac=true,router.host_port.enabled=true
Error: validation failed: [unable to recognize "": no matches for kind "DaemonSet" in version "extensions/v1beta1", unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"]

No surprises there. This is using Helm v2.15.0, the last minor version that will handle releases with a Tiller pod. I manually installed this cluster onto some Ubuntu nodes using kubeadm v1.16.2.

@kingdonb
Copy link
Member

So, I updated the charts in #111 and tried this updated controller out, there are still some things to work out. Pushing the example-go app to builder as a test, the slug builds successfully but the result of the deploy is an error:

remote: 2019/10/27 15:40:59 Error running git receive hook [The controller returned an error when publishing the release: Unknown Error (400): {"detail":"(app::deploy): unorderable types: str() < NoneType()"}]

This is distinct from the previous error, so the ball is moving forward! But there is something wrong here, and I'm not sure how to debug it further by myself.

@kingdonb
Copy link
Member

kingdonb commented Nov 2, 2019

I merged this with #111 but I can't verify that it works. It may be an issue in another component that causes the problems I'm seeing.

Signed-off-by: Cryptophobia <aouzounov@gmail.com>
@Cryptophobia
Copy link
Member Author

Merging this in now that we have done testing and preparing for the new release Hephy Workflow v2.22.0!

@Cryptophobia Cryptophobia merged commit 20506f4 into teamhephy:master Oct 1, 2020
@Cryptophobia
Copy link
Member Author

Fixes #125 in new release Hephy Workflow v2.22.0 coming soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants