Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm-controller pod is using stale tokens #479

Closed
albertschwarzkopf opened this issue May 10, 2022 · 17 comments · Fixed by #480
Closed

Helm-controller pod is using stale tokens #479

albertschwarzkopf opened this issue May 10, 2022 · 17 comments · Fixed by #480
Labels
bug Something isn't working

Comments

@albertschwarzkopf
Copy link

Hi,

the "Bound Service Account Token Volume" is graduated to stable and enabled by default in Kubernetes version 1.22.
I am using "helm-controller:v0.21.0" in AWS EKS 1.22 and I have checked, if it is using stale tokens (regarding https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html and https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshooting-boundservicetoken).

So when the API server receives requests with tokens that are older than one hour, then it annotates the pod with "annotations.authentication.k8s.io/stale-token". In my case I can see the following annotation. E.g.:
"annotations":{"authentication.k8s.io/stale-token":"subject: system:serviceaccount:flux-system:helm-controller, seconds after warning threshold: 56187"

Version:

helm-controller:v0.21.0

Cluster Details

AWS EKS 1.22

Steps to reproduce issue

  • Enable EKS Audit Logs
  • Query CW Insights (select cluster log group):
fields @timestamp
| filter @message like /seconds after warning threshold/
| parse @message "subject: *, seconds after warning threshold:*\"" as subject, elapsedtime
@stefanprodan
Copy link
Member

@albertschwarzkopf can you please confirm this happens with kustomize-controller also?

@albertschwarzkopf
Copy link
Author

@stefanprodan thanks for the fast reply!

No helm-controller only.
kustomize-controller is running in version 0.25.0

Also no issue with notification-controller:v0.23.5 and source-controller:v0.24.4

@stefanprodan
Copy link
Member

Does kustomize-controller runs on the the same node as helm-controller? Can you please post here kubectl -n flux-system get pods -owide

@albertschwarzkopf
Copy link
Author

No there are running on different nodes at this moment (we have several nodes).

grafik

@stefanprodan
Copy link
Member

I see that kustomize-controller was restarted recently, wait one hour and report back please if kustomize-controller runs into the same issue. I'm trying to figure out if this is something specific to helm-controller or is a general problem with Kubernetes client-go on EKS.

@pjbgf
Copy link
Member

pjbgf commented May 10, 2022

Relates to fluxcd/flux2#2074

@albertschwarzkopf
Copy link
Author

I see that kustomize-controller was restarted recently, wait one hour and report back please if kustomize-controller runs into the same issue. I'm trying to figure out if this is something specific to helm-controller or is a general problem with Kubernetes client-go on EKS.

After 72 minutes no issue with kustomize-controller...

@stefanprodan stefanprodan added the bug Something isn't working label May 10, 2022
@stefanprodan
Copy link
Member

I've created an EKS cluster:

$ kubectl version
Server Version: v1.22.6-eks-14c7a48

I've waited one hour:

$ kubectl -n flux-system get po
NAME                                       READY   STATUS    RESTARTS   AGE
helm-controller-88f6889c6-pwf7f            1/1     Running   0          73m
kustomize-controller-784bd54978-bckm6      1/1     Running   0          73m
notification-controller-648bbb9db7-58c2d   1/1     Running   0          73m
source-controller-79f7866bc7-k25z5         1/1     Running   0          73m

And there is no stale-token annotation on the pod:

$ kubectl -n flux-system get po helm-controller-88f6889c6-pwf7f -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    container.seccomp.security.alpha.kubernetes.io/manager: runtime/default
    kubernetes.io/psp: eks.privileged
    prometheus.io/port: "8080"
    prometheus.io/scrape: "true"
  creationTimestamp: "2022-05-10T10:08:59Z"
  generateName: helm-controller-88f6889c6-
  labels:
    app: helm-controller
    pod-template-hash: 88f6889c6
  name: helm-controller-88f6889c6-pwf7f
  namespace: flux-system

@albertschwarzkopf
Copy link
Author

Yes I can confirm this. Maybe it is visible only in the Audit Logs:

grafik

@hiddeco
Copy link
Member

hiddeco commented May 10, 2022

@albertschwarzkopf can you give the first mentioned image in #480 a try, and if that does not yield results, the second?

@albertschwarzkopf
Copy link
Author

@hiddeco thanks! I have tried both images today. Image ghcr.io/hiddeco/helm-controller:head-412201a has worked like expected only. So I cannot see the mentioned annotation in the audit logs even after 1 hour.

@hiddeco
Copy link
Member

hiddeco commented May 11, 2022

Thanks for confirming. I'll finalize the PR in that case, and make sure it is included in next release.

@Alan01252
Copy link

Alan01252 commented May 12, 2022

Note we even got an automated email about this from aws!

As of April 20th 2022, we have identified the below service accounts attached to pods in one or more of your EKS clusters using stale (older than 1 hour) tokens. Service accounts are listed in the format : |namespace:serviceaccount

arn:aws:eks:eu-west-2::cluster/prod-|kube-system:multus
arn:aws:eks:eu-west-2:
:cluster/prod-**|flux-system:helm-controller

This also totally explains fluxcd/flux2#2074 ( and the correlation between multus + helm we saw )

@balonik
Copy link

balonik commented May 12, 2022

Got same message from AWS. Only helm-controller SA was flagged. All controllers are running for the same period of time.

NAME                                           READY   STATUS    RESTARTS      AGE
helm-controller-5676d55dff-7lgvn               1/1     Running   0             16d
image-automation-controller-6444ccb58c-8xcls   1/1     Running   0             16d
image-reflector-controller-f64677dd5-974qs     1/1     Running   0             16d
kustomize-controller-76f9d4f99f-htp8d          1/1     Running   0             16d
notification-controller-846fff6d67-h677q       1/1     Running   0             16d
source-controller-55d799ff7d-w598g             1/1     Running   0             16d

@luong-komorebi
Copy link

luong-komorebi commented May 12, 2022

We got the notification message from AWS as well, but just for the helm-controller, albeit all pods are up and running 85 days long

@valeriano-manassero
Copy link

valeriano-manassero commented May 12, 2022

I can confirm same problem here on EKS v1.22.6-eks-7d68063. Not sure if it's interesting or related, but, after moving to EKS 1.22 authentication for client changed from client.authentication.k8s.io/v1alpha1 to client.authentication.k8s.io/v1beta1 .

@hiddeco
Copy link
Member

hiddeco commented May 12, 2022

As already mentioned in #479 (comment). We have identified the issue, staged a patch, and this will be solved on next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants