Kaniko gets OOMKilled while building vast images #1680

YevheniiSemendiak · 2021-06-23T14:03:13Z

Actual behavior
While building and pushing quite a large docker image (say, 8+ gigs), Kaniko's pod gets OOM killed (resource limit is 10GB RAM).
This behavior is weird, since no build steps use this amount of RAM.

Expected behavior
Build success.

To Reproduce
Steps to reproduce the behavior:

Create a dockerfile, which results in the relatively huge image (~8 gigs):

FROM neuromation/neuro-extras:21.3.19
RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py && \
    python3 garbage-files-generator.py 1 7Gb

Execute build in K8s pod with 10Gigs RAM resource limit.
Get OOMKilled pod error at the final step

INFO[0295] Taking snapshot of full filesystem...
DEBU[0498] build: composite key for command RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb &{[sha256:94bdc53de1c713cffb9ba645703aa370734e7fe08fda274fb3251650c31e41d1 RUN touch qwe RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb]}
DEBU[0498] build: cache key for command RUN wget https://raw.githubusercontent.com/neuro-inc/platform-client-python/master/build-tools/garbage-files-generator.py &&     python3 garbage-files-generator.py 1 7Gb b2a58dfda5dd029ae6a27bccaf6c45e6b98007cd5c9e698d3f86c0fb4461da26

Additional Information

Build Context is empty

I was testing it on different versions of Kaniko:

v1.6.0 - OOMKilled
v1.6.0 with --cache=false or default --snapshotMode - OOMKilled
v1.5.0 - OOMKilled
v1.5.0 with --cache=false or default --snapshotMode - OOMKilled
v1.3.0 - OK

Triage Notes for the Maintainers

Description	Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use `--cache` flag
Please check if your dockerfile is a multistage dockerfile

The text was updated successfully, but these errors were encountered:

massimeddu-sonic · 2021-07-07T14:22:52Z

I've the exact same issue. Using 1.3 image works fine, 1.6 gets OOMKilled.

Phylu · 2021-07-08T12:48:06Z

This seems to be also related to: #909

After creating a huge layer, Kaniko is killed when trying to create the layer snapshot. When using the --single-snapshot option, the build will run until the end and as only killed when the single snapshot is created. So it is happening during the snapshot creation.

When I am monitoring the Kaniko pod during its execution using watch kubectl top pod, the memory that is shown before the pod is killed is much lower than the requested memory for the pod (~100-300MB memory usage, 2GB requested, 3GB limit).

rgembalik · 2021-07-09T08:19:12Z

I had a similar problem. In my case, the image is built with a large set of files being copied into it. I tried

1.6.0 with different snapshot modes, cache, no cache, copying files from tar.gz, copying unpacked files - everything gives OOM
1.3.0 tar.gz - OOM, but during copying itself, everything else doesn't matter
1.3.0 directly from unpacked files - Works!

So if any of you still have problems with 1.3.0, check if you have large files being copied/unpacked and if you can split their individual size (e.g. by unpacking tar.gz before build). Doesn't work on 1.6.0 though.

tk42 · 2021-07-11T11:35:03Z

My problem is similar. Building with latest image is failed, but with v1.3.0 was succeeded.
So I kept downgrading to see where the breakup was.
I used the images here and kept going back.

In my case, build debug-b04399e with commit b04399e is succeeded, but the next build debug-1ad4295 with commit 1ad4295 (#1527) is failed.
That is, #1527 might be suspicious. if so, I’m wondering how does the memory consume by io.Copy in tarball/layer.go#WithCompressedCaching.
But a similar issue #909 was reported before merging #1527, so another problem might be left yet.

dan-cohn-sabre · 2021-07-15T23:02:21Z

We are experiencing the same. We're building a fairly large image, and I can see the kaniko container using 24 GB of memory. With a limit of 20 GB, it still gets OOM killed some of the time.

We're on 1.5.0.

Does anyone know of a workaround besides rolling back to 1.3.0?

vediatoni · 2021-07-23T05:28:02Z

Happens to me too.
I create several (QoS Guaranteed) Kaniko pods at once and most of them crash with OOMKilled when: INFO[0057] Taking a snapshot of full filesystem... This only happens on :latest version of Kaniko and it works fine on 1.3.0, but building is painfully slow on 1.3.0.

Additional args I used:

        - '--cache=false'
        - '--snapshotMode=redo'

Resources:

      resources:
         limits:
           memory: 1Gi
           cpu: 700m
        requests:
          memory: 1Gi
          cpu: 700m

Phylu · 2021-08-10T13:29:20Z

That is, #1527 might be suspicious. if so, I’m wondering how does the memory consume by io.Copy in tarball/layer.go#WithCompressedCaching.

I have just digging into the code of this MR a bit. The WithCompressedCaching [1] flag seems to trigger Memoization of the function call thus storing a gzip layer in memory. This is probably a performance improvement when enough memory is available to store the layer. However, this will use all available memory if the layer is too big and therefore leading to the kaniko pod being killed.

An easy fix would probably be removing this option which would lead to a performance decrease for all those images which are now build correctly (as they are small enough).

[1] https://pkg.go.dev/github.com/google/go-containerregistry/pkg/v1/tarball#WithCompressedCaching

tk42 · 2021-08-12T09:28:10Z

@Phylu Thank you for your digging and reporting!
I was investigating for OOM and GC behavior. In general, if programs use more than 50% of the limit and then allocate new memory that exceeds the limit within 2 minutes, GC will not be done automatically and OOM will occur.
To deal with this, call runtime.GC() (perform GC manually) when the memory usage reaches 50% or more. However, executing GC manually is expected to affect performance. Therefore, it is better to monitor the memory every few seconds with koron-go/phymem, etc. and perform GC manually depending on the memory usage and taking into account the time elapsed since the last GC.

Ref. https://zenn.dev/koron/articles/b96cccfa82c0c1 (Japanese)

Phylu · 2021-08-13T09:58:31Z

I just made an example build on the latest 1.6.0 release where I removed the WithCompressedCaching flag. I am happy to provide a Pull Request. However, this will – most likely – result in degraded performance for those builds that have been working before. Some ideas that I have:

Create a new flag that determines whether the optimisation should be used (I can try to do this myself).
Some logic to check how big the layer is (or something similar) to decide whether the flag should be used (I don't feel comfortable coding this).

Perhaps, @dlorenc @priyawadhwa @sharifelgamal can give some guidance here. :)

Fixes GoogleContainerTools#1680 Large images cannot be build as the kaniko container will be killed due to an OOM error Removing the tarball compression drastically reduces the memory required to push large image layers

Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes GoogleContainerTools#1680 This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.

Phylu · 2021-09-17T11:32:13Z

I just improved my PR by adding a command line flag. Now it is possible to set --compressed-caching=false to disable the compression, which lets the build work even with those large images.

I tested this with my own breaking image and reliably it works with the flag set while it breaks otherwise. The standard behaviour stays with using the compression.

MatanAmoyal · 2021-10-14T13:54:29Z

Hey guys
Any plan to fix this bug/feature?
can't build some big images with kaniko 1.6.0-1.5.0 that are working with docker build and kaniko 1.3.0

Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes GoogleContainerTools#1680 This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful.

…#1722) * Remove tarball.WithCompressedCaching flag to resolve OOM Killed error Large images cannot be build as the kaniko container will be killed due to an OOM error. Removing the tarball compression drastically reduces the memory required to push large image layers. Fixes #1680 This change may increase the build time for smaller images. Therefore a command line option to trigger the compression or a more intelligent behaviour may be useful. * Add new command line flag to toggle compressed caching * Add unittest for build with --compressed-caching command line flag set to false

0x217 · 2023-11-15T13:21:18Z

I am building vast images, and I had the same issue.
Disabling the compressed caching (--compressed-caching=false) helps me, but when I try to build without cache and keep the default compressed-caching (use only --cache=false), the OOMKilled happens again.
Does it make sense? :)

My Kaniko version is v1.15.0.

YevheniiSemendiak mentioned this issue Jun 23, 2021

[B] Building large Docker images in Kaniko leads to OOMKilled job neuro-inc/neuro-extras#283

Open

tk42 mentioned this issue Jul 8, 2021

"gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137 #1669

Open

Bobgy mentioned this issue Aug 5, 2021

[v2 sample test] kaniko build times out / OOM kubeflow/pipelines#6238

Closed

Phylu mentioned this issue Aug 13, 2021

Remove tarball.WithCompressedCaching flag to resolve OOM Killed error #1722

Merged

4 tasks

tejal29 closed this as completed in #1722 Oct 19, 2021

Phylu mentioned this issue Jan 5, 2022

Release v1.8.0 #1871

Closed

aaron-prindle added issue/oom issue/big-image labels Jun 22, 2023

Abdellwahed mentioned this issue Sep 8, 2023

Kaniko Performance Testing. #1305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaniko gets OOMKilled while building vast images #1680

Kaniko gets OOMKilled while building vast images #1680

YevheniiSemendiak commented Jun 23, 2021

massimeddu-sonic commented Jul 7, 2021

Phylu commented Jul 8, 2021

rgembalik commented Jul 9, 2021

tk42 commented Jul 11, 2021

dan-cohn-sabre commented Jul 15, 2021

vediatoni commented Jul 23, 2021 •

edited

Loading

Phylu commented Aug 10, 2021

tk42 commented Aug 12, 2021

Phylu commented Aug 13, 2021

Phylu commented Sep 17, 2021

MatanAmoyal commented Oct 14, 2021

0x217 commented Nov 15, 2023 •

edited

Loading

Kaniko gets OOMKilled while building vast images #1680

Kaniko gets OOMKilled while building vast images #1680

Comments

YevheniiSemendiak commented Jun 23, 2021

massimeddu-sonic commented Jul 7, 2021

Phylu commented Jul 8, 2021

rgembalik commented Jul 9, 2021

tk42 commented Jul 11, 2021

dan-cohn-sabre commented Jul 15, 2021

vediatoni commented Jul 23, 2021 • edited Loading

Phylu commented Aug 10, 2021

tk42 commented Aug 12, 2021

Phylu commented Aug 13, 2021

Phylu commented Sep 17, 2021

MatanAmoyal commented Oct 14, 2021

0x217 commented Nov 15, 2023 • edited Loading

vediatoni commented Jul 23, 2021 •

edited

Loading

0x217 commented Nov 15, 2023 •

edited

Loading