Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaniko gets OOMKilled while building vast images #1680

Closed
YevheniiSemendiak opened this issue Jun 23, 2021 · 12 comments · Fixed by #1722
Closed

Kaniko gets OOMKilled while building vast images #1680

YevheniiSemendiak opened this issue Jun 23, 2021 · 12 comments · Fixed by #1722

Comments

@YevheniiSemendiak
Copy link

Actual behavior
While building and pushing quite a large docker image (say, 8+ gigs), Kaniko's pod gets OOM killed (resource limit is 10GB RAM).
This behavior is weird, since no build steps use this amount of RAM.

Expected behavior
Build success.

To Reproduce
Steps to reproduce the behavior:

  1. Create a dockerfile, which results in the relatively huge image (~8 gigs):
FROM neuromation/neuro-extras:21.3.19
RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py && \
    python3 garbage-files-generator.py 1 7Gb
  1. Execute build in K8s pod with 10Gigs RAM resource limit.
  2. Get OOMKilled pod error at the final step
INFO[0295] Taking snapshot of full filesystem...
DEBU[0498] build: composite key for command RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb &{[sha256:94bdc53de1c713cffb9ba645703aa370734e7fe08fda274fb3251650c31e41d1 RUN touch qwe RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb]}
DEBU[0498] build: cache key for command RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb b2a58dfda5dd029ae6a27bccaf6c45e6b98007cd5c9e698d3f86c0fb4461da26

Additional Information

  • Build Context is empty

I was testing it on different versions of Kaniko:

  • v1.6.0 - OOMKilled
  • v1.6.0 with --cache=false or default --snapshotMode - OOMKilled
  • v1.5.0 - OOMKilled
  • v1.5.0 with --cache=false or default --snapshotMode - OOMKilled
  • v1.3.0 - OK

Triage Notes for the Maintainers

Description Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use --cache flag
Please check if your dockerfile is a multistage dockerfile
@massimeddu-sonic
Copy link

I've the exact same issue. Using 1.3 image works fine, 1.6 gets OOMKilled.

@Phylu
Copy link
Contributor

Phylu commented Jul 8, 2021

This seems to be also related to: #909

After creating a huge layer, Kaniko is killed when trying to create the layer snapshot. When using the --single-snapshot option, the build will run until the end and as only killed when the single snapshot is created. So it is happening during the snapshot creation.

When I am monitoring the Kaniko pod during its execution using watch kubectl top pod, the memory that is shown before the pod is killed is much lower than the requested memory for the pod (~100-300MB memory usage, 2GB requested, 3GB limit).

@rgembalik
Copy link

I had a similar problem. In my case, the image is built with a large set of files being copied into it. I tried

  • 1.6.0 with different snapshot modes, cache, no cache, copying files from tar.gz, copying unpacked files - everything gives OOM
  • 1.3.0 tar.gz - OOM, but during copying itself, everything else doesn't matter
  • 1.3.0 directly from unpacked files - Works!

So if any of you still have problems with 1.3.0, check if you have large files being copied/unpacked and if you can split their individual size (e.g. by unpacking tar.gz before build). Doesn't work on 1.6.0 though.

@tk42
Copy link

tk42 commented Jul 11, 2021

My problem is similar. Building with latest image is failed, but with v1.3.0 was succeeded.
So I kept downgrading to see where the breakup was.
I used the images here and kept going back.

In my case, build debug-b04399e with commit b04399e is succeeded, but the next build debug-1ad4295 with commit 1ad4295 (#1527) is failed.
That is, #1527 might be suspicious. if so, I’m wondering how does the memory consume by io.Copy in tarball/layer.go#WithCompressedCaching.
But a similar issue #909 was reported before merging #1527, so another problem might be left yet.

@dan-cohn-sabre
Copy link

We are experiencing the same. We're building a fairly large image, and I can see the kaniko container using 24 GB of memory. With a limit of 20 GB, it still gets OOM killed some of the time.

We're on 1.5.0.

Does anyone know of a workaround besides rolling back to 1.3.0?

@vediatoni
Copy link

vediatoni commented Jul 23, 2021

Happens to me too.
I create several (QoS Guaranteed) Kaniko pods at once and most of them crash with OOMKilled when: INFO[0057] Taking a snapshot of full filesystem... This only happens on :latest version of Kaniko and it works fine on 1.3.0, but building is painfully slow on 1.3.0.

Additional args I used:

        - '--cache=false'
        - '--snapshotMode=redo'

Resources:

      resources:
         limits:
           memory: 1Gi
           cpu: 700m
        requests:
          memory: 1Gi
          cpu: 700m

@Phylu
Copy link
Contributor

Phylu commented Aug 10, 2021

That is, #1527 might be suspicious. if so, I’m wondering how does the memory consume by io.Copy in tarball/layer.go#WithCompressedCaching.

I have just digging into the code of this MR a bit. The WithCompressedCaching [1] flag seems to trigger Memoization of the function call thus storing a gzip layer in memory. This is probably a performance improvement when enough memory is available to store the layer. However, this will use all available memory if the layer is too big and therefore leading to the kaniko pod being killed.

An easy fix would probably be removing this option which would lead to a performance decrease for all those images which are now build correctly (as they are small enough).

[1] https://pkg.go.dev/github.com/google/go-containerregistry/pkg/v1/tarball#WithCompressedCaching

@tk42
Copy link

tk42 commented Aug 12, 2021

@Phylu Thank you for your digging and reporting!
I was investigating for OOM and GC behavior. In general, if programs use more than 50% of the limit and then allocate new memory that exceeds the limit within 2 minutes, GC will not be done automatically and OOM will occur.
To deal with this, call runtime.GC() (perform GC manually) when the memory usage reaches 50% or more. However, executing GC manually is expected to affect performance. Therefore, it is better to monitor the memory every few seconds with koron-go/phymem, etc. and perform GC manually depending on the memory usage and taking into account the time elapsed since the last GC.

Ref. https://zenn.dev/koron/articles/b96cccfa82c0c1 (Japanese)

@Phylu
Copy link
Contributor

Phylu commented Aug 13, 2021

I just made an example build on the latest 1.6.0 release where I removed the WithCompressedCaching flag. I am happy to provide a Pull Request. However, this will – most likely – result in degraded performance for those builds that have been working before. Some ideas that I have:

  • Create a new flag that determines whether the optimisation should be used (I can try to do this myself).
  • Some logic to check how big the layer is (or something similar) to decide whether the flag should be used (I don't feel comfortable coding this).

Perhaps, @dlorenc @priyawadhwa @sharifelgamal can give some guidance here. :)

Phylu added a commit to Phylu/kaniko that referenced this issue Aug 13, 2021
Fixes GoogleContainerTools#1680
Large images cannot be build as the kaniko container will be killed due to an OOM error
Removing the tarball compression drastically reduces the memory required to push large image layers
Phylu added a commit to Phylu/kaniko that referenced this issue Aug 13, 2021
Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes GoogleContainerTools#1680

This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.
@Phylu
Copy link
Contributor

Phylu commented Sep 17, 2021

I just improved my PR by adding a command line flag. Now it is possible to set --compressed-caching=false to disable the compression, which lets the build work even with those large images.

I tested this with my own breaking image and reliably it works with the flag set while it breaks otherwise. The standard behaviour stays with using the compression.

@MatanAmoyal
Copy link

Hey guys
Any plan to fix this bug/feature?
can't build some big images with kaniko 1.6.0-1.5.0 that are working with docker build and kaniko 1.3.0

tejal29 pushed a commit to Phylu/kaniko that referenced this issue Oct 19, 2021
Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes GoogleContainerTools#1680

This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.
tejal29 pushed a commit that referenced this issue Oct 19, 2021
…#1722)

* Remove tarball.WithCompressedCaching flag to resolve OOM Killed error

Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes #1680

This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.

* Add new command line flag to toggle compressed caching

* Add unittest for build with --compressed-caching command line flag set to false
@Phylu Phylu mentioned this issue Jan 5, 2022
@0x217
Copy link

0x217 commented Nov 15, 2023

I am building vast images, and I had the same issue.
Disabling the compressed caching (--compressed-caching=false) helps me, but when I try to build without cache and keep the default compressed-caching (use only --cache=false), the OOMKilled happens again.
Does it make sense? :)

My Kaniko version is v1.15.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants