Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm-controller Pod gets OOM-killed even with 1GB of RAM #349

Closed
1 task done
zzvara opened this issue Nov 2, 2021 · 5 comments
Closed
1 task done

helm-controller Pod gets OOM-killed even with 1GB of RAM #349

zzvara opened this issue Nov 2, 2021 · 5 comments

Comments

@zzvara
Copy link

zzvara commented Nov 2, 2021

Describe the bug

Title says its all. Here is the Pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: helm-controller-f56848c5-gsd44
  generateName: helm-controller-f56848c5-
  namespace: flux-system
  uid: 5959073e-cf82-4d65-8925-9ece92fb366c
  resourceVersion: '408070363'
  creationTimestamp: '2021-11-02T11:08:39Z'
  labels:
    app: helm-controller
    pod-template-hash: f56848c5
  annotations:
    prometheus.io/port: '8080'
    prometheus.io/scrape: 'true'
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: helm-controller-f56848c5
      uid: e748e195-06e5-411d-acbf-005c180a47ed
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2021-11-02T11:08:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:prometheus.io/port': {}
            'f:prometheus.io/scrape': {}
          'f:generateName': {}
          'f:labels':
            .: {}
            'f:app': {}
            'f:pod-template-hash': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"e748e195-06e5-411d-acbf-005c180a47ed"}':
              .: {}
              'f:apiVersion': {}
              'f:blockOwnerDeletion': {}
              'f:controller': {}
              'f:kind': {}
              'f:name': {}
              'f:uid': {}
        'f:spec':
          'f:containers':
            'k:{"name":"manager"}':
              .: {}
              'f:args': {}
              'f:env':
                .: {}
                'k:{"name":"RUNTIME_NAMESPACE"}':
                  .: {}
                  'f:name': {}
                  'f:valueFrom':
                    .: {}
                    'f:fieldRef':
                      .: {}
                      'f:apiVersion': {}
                      'f:fieldPath': {}
              'f:image': {}
              'f:imagePullPolicy': {}
              'f:livenessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:name': {}
              'f:ports':
                .: {}
                'k:{"containerPort":8080,"protocol":"TCP"}':
                  .: {}
                  'f:containerPort': {}
                  'f:name': {}
                  'f:protocol': {}
                'k:{"containerPort":9440,"protocol":"TCP"}':
                  .: {}
                  'f:containerPort': {}
                  'f:name': {}
                  'f:protocol': {}
              'f:readinessProbe':
                .: {}
                'f:failureThreshold': {}
                'f:httpGet':
                  .: {}
                  'f:path': {}
                  'f:port': {}
                  'f:scheme': {}
                'f:periodSeconds': {}
                'f:successThreshold': {}
                'f:timeoutSeconds': {}
              'f:resources':
                .: {}
                'f:limits':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
                'f:requests':
                  .: {}
                  'f:cpu': {}
                  'f:memory': {}
              'f:securityContext':
                .: {}
                'f:allowPrivilegeEscalation': {}
                'f:readOnlyRootFilesystem': {}
              'f:terminationMessagePath': {}
              'f:terminationMessagePolicy': {}
              'f:volumeMounts':
                .: {}
                'k:{"mountPath":"/tmp"}':
                  .: {}
                  'f:mountPath': {}
                  'f:name': {}
          'f:dnsPolicy': {}
          'f:enableServiceLinks': {}
          'f:imagePullSecrets':
            .: {}
            'k:{"name":"redacted"}':
              .: {}
              'f:name': {}
          'f:nodeSelector':
            .: {}
            'f:kubernetes.io/os': {}
          'f:restartPolicy': {}
          'f:schedulerName': {}
          'f:securityContext': {}
          'f:serviceAccount': {}
          'f:serviceAccountName': {}
          'f:terminationGracePeriodSeconds': {}
          'f:volumes':
            .: {}
            'k:{"name":"temp"}':
              .: {}
              'f:emptyDir': {}
              'f:name': {}
    - manager: kubelet
      operation: Update
      apiVersion: v1
      time: '2021-11-02T17:40:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions':
            'k:{"type":"ContainersReady"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Initialized"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Ready"}':
              .: {}
              'f:lastProbeTime': {}
              'f:lastTransitionTime': {}
              'f:status': {}
              'f:type': {}
          'f:containerStatuses': {}
          'f:hostIP': {}
          'f:phase': {}
          'f:podIP': {}
          'f:podIPs':
            .: {}
            'k:{"ip":"10.233.79.245"}':
              .: {}
              'f:ip': {}
          'f:startTime': {}
  selfLink: /api/v1/namespaces/flux-system/pods/helm-controller-f56848c5-gsd44
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T11:08:39Z'
    - type: Ready
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T17:40:39Z'
    - type: ContainersReady
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T17:40:39Z'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2021-11-02T11:08:39Z'
  hostIP: 10.1.44.10
  podIP: 10.233.79.245
  podIPs:
    - ip: 10.233.79.245
  startTime: '2021-11-02T11:08:39Z'
  containerStatuses:
    - name: manager
      state:
        running:
          startedAt: '2021-11-02T17:40:30Z'
      lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2021-11-02T11:08:40Z'
          finishedAt: '2021-11-02T17:40:29Z'
          containerID: >-
            docker://d9a012aaadf8fc05ab30bcb1e18eb071ddc648a6e036c8e45a599e7583438b57
      ready: true
      restartCount: 1
      image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
      imageID: >-
        docker-pullable://ghcr.io/fluxcd/helm-controller@sha256:74b0442a90350b1de9fb34e3180c326d1d7814caa14bf5501750a71a1782d10d
      containerID: >-
        docker://cfd5ac78013a3fb1d80ed4ddff1ae3eb217b8be0dd2a0eff6b37922106ea372e
      started: true
  qosClass: Burstable
spec:
  volumes:
    - name: temp
      emptyDir: {}
    - name: kube-api-access-wzqn6
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: manager
      image: 'ghcr.io/fluxcd/helm-controller:v0.12.1'
      args:
        - '--events-addr=http://notification-controller/'
        - '--watch-all-namespaces=true'
        - '--log-level=debug'
        - '--log-encoding=json'
        - '--enable-leader-election'
      ports:
        - name: http-prom
          containerPort: 8080
          protocol: TCP
        - name: healthz
          containerPort: 9440
          protocol: TCP
      env:
        - name: RUNTIME_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 100m
          memory: 64Mi
      volumeMounts:
        - name: temp
          mountPath: /tmp
        - name: kube-api-access-wzqn6
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        httpGet:
          path: /healthz
          port: healthz
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /readyz
          port: healthz
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
  restartPolicy: Always
  terminationGracePeriodSeconds: 600
  dnsPolicy: ClusterFirst
  nodeSelector:
    kubernetes.io/os: linux
  serviceAccountName: helm-controller
  serviceAccount: helm-controller
  nodeName: sigma01
  securityContext: {}
  imagePullSecrets:
    - name: redacted
  schedulerName: default-scheduler
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority

Steps to reproduce

Not sure how to reproduce. Probably dependant on cluster and repository size. Most of the resources (about 20-30) are set to 1-minute reconciliation.

Expected behavior

The helm-controller to run for months without OOM.

Screenshots and recordings

No response

OS / Distro

Flatcar 2905.2.6

Flux version

flux version 0.21.1

Flux check

► checking prerequisites
✗ flux 0.20.1 <0.21.0 (new version is available, please upgrade)
✔ Kubernetes 1.21.5 >=1.19.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.12.1
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.16.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.13.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.16.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.18.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.17.1
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@hiddeco
Copy link
Member

hiddeco commented Nov 2, 2021

Possible duplicate of #345.

We are gathering details on this at present, as it looks like a recent change has introduced a serious increase of memory during operation. The controller itself has not seen any relevant changes besides dependency updates (Helm, K8s, kustomize, controller-runtime). If you happened to run an older Flux version before this that had a lower memory footprint (for some, 0.10.1 performed much better), it would be valuable to me know this version.

Having looked a bit further into it more just now, there are two changes that could be pointers:

If both of these versions appear to work fine, it will need a much deeper dive.

v0.11.2 seems to misbehave for people as well.

@barrydobson
Copy link

I'm running the latest release 0.25.2 and have assigned helm controller a limit of 2Gi and it's still very killed for OOM. This is with around 25 HelmReleases on the cluster, checking every 5 minutes

@CosminBriscaru
Copy link

Running the helm controller 0.15.0 with around 20 HelmReleases and checks every 5 minutes without limits on resources and it reaches 3.5GB on memory and 1 CPU. We removed the limits as we were getting errors on the helm side if the pod was restarted while upgrading.

@applike-ss
Copy link

helm-controller v0.30.0 still seems to have this issue.

@stefanprodan
Copy link
Member

stefanprodan commented Oct 11, 2023

Upgrading to Flux 2.1 and configuring Helm index caching should fix this: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants