Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory Usage after helm-controller v0.12.0 upgrade #345

Closed
Legion2 opened this issue Oct 24, 2021 · 39 comments
Closed

High Memory Usage after helm-controller v0.12.0 upgrade #345

Legion2 opened this issue Oct 24, 2021 · 39 comments
Labels
area/helm Helm related issues and pull requests bug Something isn't working

Comments

@Legion2
Copy link

Legion2 commented Oct 24, 2021

I updated to helm-controller v0.12.1 and started using ReconcileStrategy Revision for all my local helm charts. Now helm-controller is restarted each time I push a commit to the GitRepository source, because helm-controller uses too much memory and is killed by kubernetes (OOMKilled). As a result of the controller being killed by kubernetes, some helm release are stuck in the upgrade process which must be manually rolled back (helm/helm#8987).
image

@hiddeco
Copy link
Member

hiddeco commented Oct 24, 2021

We have had one other report mentioning this, but they were upgrading from a much older version, making it harder to trace the issue back to a specific commit.

What was the version you were using before the upgrade?

@hiddeco
Copy link
Member

hiddeco commented Oct 24, 2021

There is one difference however, and that is the change in strategy.

Can you try increasing the interval at which the charts are reconciled, and in addition, maybe increase the limit. Chart reconciliation from certain sources can be a more memory expensive task, as the chart needs to be loaded into memory for certain actions.

We are working to reduce this consumption, but e.g. storing a chart using the Helm SDK is done by loading a chart in full, which can't be avoided for charts from unpackaged sources.

@Legion2
Copy link
Author

Legion2 commented Oct 24, 2021

I updated from v0.11.2

@Legion2
Copy link
Author

Legion2 commented Oct 24, 2021

The current interval is 10m and limit is 4Gi. If I push a commit the memory usage increases by 1.5Gi. I have 26 Helm releases and 11 of them use ReconcileStrategy Revision.
image

@hiddeco
Copy link
Member

hiddeco commented Oct 24, 2021

Can you tell me more about the size of the repositories they originate from, and what chart related features you are making use of (valuesFrom, etc.)?

@Legion2
Copy link
Author

Legion2 commented Oct 24, 2021

All charts that use ReconcileStrategy Revision are from the flux-system git repository which is 1.8Mb in size. The charts have local and remote (transitive) dependencies, which are committed into the git repository in the charts/ subdirectory as tgz files (I don't know if this is still needed with the current helm-controller version). The HelmReleases only use interval, chart, releaseName, install, upgrade and values in the spec. All intervals are set to 10m.

@Legion2
Copy link
Author

Legion2 commented Oct 27, 2021

I now had to change all ReconcileStrategy to ChartVersion because the high memory utilization made our controller nodes unresponsive and autoscaler had do replace them.

@tehlers320
Copy link

tehlers320 commented Nov 2, 2021

I'm not sure this is just the revision strategy. We do not use this anywhere and have seen a huge jump in memory use. I grabbed a pprof heap map.
out

v0.11.2...v0.12.1

@hiddeco
Copy link
Member

hiddeco commented Nov 2, 2021

@tehlers320 can you provide more information about the version you upgraded from?

@tehlers320
Copy link

@hiddeco we jumped from 0.10.2 and do not use the new reconcile revision. Its been running for 5 days (when it can, OOMkiller). Sorry i don't mean to walk over this issue, but i think perhaps its not related to the reconcilestrategy since we dont use Revision.

Should i make a new one or do you think the issue is both?

@hiddeco
Copy link
Member

hiddeco commented Nov 2, 2021

No, I think your observations are correct based on other reports on Slack.

Did a quick dive into it with the limited time I had available, but the helm-controller didn't really change much besides Helm, kustomize and controller-runtime updates. It would be useful if someone could pinpoint the resource behavior change to an exact helm-controller version, which would help identifying the issue.

I am at present working on Helm improvements for the source-controller in the area of Helm repository index, dependency, and chart build memory consumption. Once that's done, I have time (and am planning) to look in much greater detail at the current shape of the helm-controller (as part of https://github.com/fluxcd/helm-controller/milestone/1).

@glen-uc
Copy link

glen-uc commented Nov 3, 2021

We are facing the same issue where the memory of helm controller keeps on growing

We upgraded from 0.16.2 (no issue with helm controller in this version) > 0.19.1 (started seeing issues ) > 0.20.1

We have around 258 helm releases and 8 helm repo's in our cluster with the interval set to 24h

We have set 4GB as the memory limit for our helm controller

Here is the output of curl -Sk -v http://<helm_controller>:8080/debug/pprof/heap

profile001

heap.zip

@stefanprodan
Copy link
Member

stefanprodan commented Nov 3, 2021

I've spend the day digging around to find the root cause of the sudden increase in memory usage. Here is what I've found:

We can't do much in Flux, we have to wait for that PR to get merged, then wait for a Kubernetes release, then wait for a Helm release that uses the latest Kubernetes release and finally update Helm in Flux to fix the OOM issues.

I propose we revert Helm to v3.6.3 for a couple of months until the kube-openapi fixes end up in Helm.

@stefanprodan stefanprodan changed the title High Memory Usage with ReconcileStrategy Revision High Memory Usage after helm-controller v0.12.1 upgrade Nov 3, 2021
@stefanprodan
Copy link
Member

stefanprodan commented Nov 3, 2021

We've pushed a release candidate for #352, here is the image: ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8

Please take it for a spin and let us know if it fixes the issue.

@glen-uc
Copy link

glen-uc commented Nov 4, 2021

We have been using helm controller with ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8 image for a couple of hours and already seeing a large reduction in memory footprint when compared to ghcr.io/fluxcd/helm-controller:v0.12.1

After a couple of hours in our staging cluster, ghcr.io/fluxcd/helm-controller:v0.12.1 used to consume around 4GB+ and crash as we have set the limit as 4GB, ghcr.io/fluxcd/helm-controller:rc-4fe7a7c8 is barely using 200MB

We will monitor the controller for few more days

flux: v0.21.0
helm-controller: rc-4fe7a7c8
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

@stefanprodan
Copy link
Member

@glen-uc could you help test another variant? I would like to see if we could keep Helm at the latest version and only replace kube-openapi with go-openapi.

@glen-uc
Copy link

glen-uc commented Nov 4, 2021

@stefanprodan Sure, if you give the image I can test it out in our staging cluster

@stefanprodan
Copy link
Member

Please give ghcr.io/fluxcd/helm-controller:rc-725fd784 a spin, this is built from #354

Thank you!

@glen-uc
Copy link

glen-uc commented Nov 4, 2021

@stefanprodan I have deployed helm controller with the image provided, will monitor it for a couple of hours and look for restarts due to OOM Kill

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

@stefanprodan
Copy link
Member

@glen-uc thank you very much for helping us 🤗

@glen-uc
Copy link

glen-uc commented Nov 5, 2021

@stefanprodan I'm glad I could help

Our staging cluster is using helm controller with ghcr.io/fluxcd/helm-controller:rc-725fd784 for a couple of hours now

Memory footprint is very minimal (< 100 MB) and we haven't observed any restarts due to OOM Kill

We will monitor the controller for few more days

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

@hiddeco
Copy link
Member

hiddeco commented Nov 5, 2021

Awesome, and thanks a lot for helping out. This seems to indicate that we can at least temporary work around the upstream problems by forcing the replacement of that specific Helm dependency, without having to stop receiving new Helm updates.

@Legion2
Copy link
Author

Legion2 commented Nov 7, 2021

I also tested ghcr.io/fluxcd/helm-controller:rc-725fd784 and experienced an OOM after 2 days at 2GiB usage. It looks like the momory usage of ghcr.io/fluxcd/helm-controller:rc-725fd784 is similar to the current release version.

@hiddeco
Copy link
Member

hiddeco commented Nov 7, 2021

@Legion2 would you be able to grab a pprof profile, as done for the graph in one of the comments above?

You can send it to me via DM on CNCF Slack (@hidde) or mail (hidde <at> weave.works).

@hiddeco
Copy link
Member

hiddeco commented Nov 8, 2021

Received a HEAP profile from @Legion2 which generates the following map:

HEAP profile

What version was this for Leon? This information was not included in your mail.

@stefanprodan
Copy link
Member

@Legion2 can you please run flux version against the cluster where the above profile was taken and post it here?

@Legion2
Copy link
Author

Legion2 commented Nov 8, 2021

Here is the output of flux version:

flux: v0.21.0
helm-controller: rc-725fd784
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

@stefanprodan
Copy link
Member

@Legion2 can you please try rc-4fe7a7c8?

@Legion2
Copy link
Author

Legion2 commented Nov 8, 2021

@Legion2 can you please try rc-4fe7a7c8?

yes

@hiddeco hiddeco pinned this issue Nov 8, 2021
@hiddeco hiddeco changed the title High Memory Usage after helm-controller v0.12.1 upgrade High Memory Usage after helm-controller v0.12.0 upgrade Nov 8, 2021
@hiddeco hiddeco added area/helm Helm related issues and pull requests bug Something isn't working labels Nov 8, 2021
@glen-uc
Copy link

glen-uc commented Nov 11, 2021

There is an update from my end, like @Legion2 with helm rc-725fd784 i started seeing restarts due to OOM Kill after 1 day (with memory limit set to 1GB) hence I promptly reverted the helm controller to rc-4fe7a7c8 it has been running for 3 days without any restarts

Unfortunately, I don't have the memory profile of the helm controller when it was running on rc-725fd784

I Will monitor rc-4fe7a7c8 for a few more days

flux: v0.21.0
helm-controller: rc-4fe7a7c8
image-automation-controller: v0.16.0
image-reflector-controller: v0.13.0
kustomize-controller: v0.16.0
notification-controller: v0.18.1
source-controller: v0.17.1

@hiddeco
Copy link
Member

hiddeco commented Nov 11, 2021

@stefanprodan I think we should deem rc-4fe7a7c8 the successor for now.

@hiddeco
Copy link
Member

hiddeco commented Nov 11, 2021

Thanks all for testing. This should now also be solved by updating the helm-controller Deployment image to v0.12.2.

CLI release for flux bootstrap, etc. will arrive later today.

@hiddeco hiddeco unpinned this issue Nov 11, 2021
@poblish
Copy link

poblish commented Nov 12, 2021

We're seeing a 75% drop in helm-controller memory (peaks and base level) since picking up this version 👍

@hiddeco
Copy link
Member

hiddeco commented Dec 9, 2021

Helm released a patch yesterday which likely addresses this issue

Due to the holiday period that is arriving pretty soon however, I am hesitant in releasing this as I will be on leave for 3 weeks. Unless someone has specific needs for the v3.7.x release range, in which case I can provide a RC.

hiddeco added a commit that referenced this issue Dec 10, 2021
This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x
release range _without_ any memory issues (see #345), which should have
been addressed in this release.

The change in replacements has been cross-checked with the dependencies
of Helm (and more specifically, the Oras project), and confirmed to not
trigger any warnings using `trivy`.

Signed-off-by: Hidde Beydals <hello@hidde.co>
@hiddeco
Copy link
Member

hiddeco commented Dec 10, 2021

Based on the PR above, a RC with Helm 3.7.2 is available as docker.io/fluxcd/helm-controller:rc-helm-3.7.2--a12eae8e

@stefanprodan
Copy link
Member

We are going ahead with the Helm v3.7.2 as v3.6.3 blocks us from fixing the containerd CVEs due to ORAS breaking changes. In case v3.7.2 brings back the memory leak please pin helm-controller to v0.12.2.

stefanprodan pushed a commit that referenced this issue Jan 7, 2022
This commit updates Helm to 3.7.2, in an attempt to get to a v3.7.x
release range _without_ any memory issues (see #345), which should have
been addressed in this release.

The change in replacements has been cross-checked with the dependencies
of Helm (and more specifically, the Oras project), and confirmed to not
trigger any warnings using `trivy`.

Signed-off-by: Hidde Beydals <hello@hidde.co>
@Legion2
Copy link
Author

Legion2 commented Jan 18, 2022

I updated flux to the latest version and can confirm that the memory leak is still present, will pin helm-controller to v0.14.1.

@poblish
Copy link

poblish commented Feb 4, 2022

Is it still sensible for us to try to pin the helm-controller version to avoid this? We're still using Flux v0.24.1 / helm-controller v0.14.1, but don't want to get too far behind, now that 0.26.x is released.

@hiddeco
Copy link
Member

hiddeco commented Feb 4, 2022

This issue has been confirmed to be solved in the latest release (https://github.com/fluxcd/helm-controller/releases/tag/v0.16.0) via #409.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm Helm related issues and pull requests bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants