Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137 #1669

Open
snthibaud opened this issue Jun 13, 2021 · 17 comments
Labels
area/caching For all bugs related to cache issues area/gcb area/performance issues related to kaniko performance enhancement issue/build-fails issue/oom platform/cloud-build priority/p2 High impact feature/bug. Will get a lot of users happy work-around-available

Comments

@snthibaud
Copy link

Actual behavior
I am running a build on Cloud build. The build succeeds, but the caching snapshot at the end fails with the following messages:

Step #0: INFO[0154] Taking snapshot of full filesystem...
Finished Step #0
ERROR
ERROR: build step 0 "gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137

Expected behavior
I would like the whole build to succeed - including caching.

To Reproduce
Steps to reproduce the behavior:

  1. Build on GCP Cloud Build using a cloudbuild.yaml with Kaniko caching enabled.

Additional Information
I cannot provide the Dockerfile, but it is based on continuumio/miniconda3 and also installs tensorflow in a conda environment. I think it started failing after tensorflow was added to the list of dependencies.

@snthibaud
Copy link
Author

Additionally - it builds fine with caching disabled and when a heavy 8 CPU machine type is used. However, I think it's strange that Kaniko caching requires more resources than the build itself.

@hugbubby
Copy link

I've been trying to work around this issue for the past several days. Kaniko consistently tries to use more memory than our kubernetes cluster has available. It only happens with our large images.

@dakl
Copy link

dakl commented Jul 1, 2021

Any workaround available? My base image is tensorflow/tensorflow:2.4.0-gpu which weighs 2.35 GB compressed.

@tk42
Copy link

tk42 commented Jul 8, 2021

@dakl try to downgrade to v.1.3.0 (as is mentioned in #1680). it works for me.

@Mistic92
Copy link

Any update on this topic? I have this issue on every ML related dockerfile where we need to use pytorch and other libs.

@imjasonh
Copy link
Collaborator

The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0

It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps: gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631

If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.

@Mistic92
Copy link

Looks like it worked but I tried with cache disabled. On 1.6 even with cache disabled it was stopping. So good sign

@wahyueko22
Copy link

any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.

@imjasonh
Copy link
Collaborator

imjasonh commented Mar 8, 2022

any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.

It sounds like #1669 (comment) says this works with a newer commit-tagged image, and with caching disabled. It sounds like caching causes filesystem contents to be buffered in memory, which causes problems with large images.

@lappazos
Copy link

The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0

It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps: gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631

If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.

happened to me too with a large image, and the referenced commit solved it. any update why its not solved yet in v1.8.1? @imjasonh

@imjasonh
Copy link
Collaborator

#2115 is the issue tracking the next release. I don't have any more information than what's in that issue.

@imjasonh
Copy link
Collaborator

Does this issue still happen at the latest commit-tagged image? With and without caching enabled?

@granthamtaylor
Copy link

granthamtaylor commented Aug 7, 2022

@imjasonh I am still experiencing this issue with latest and v1.8.1 for an image with pytorch installed.

v1.3.0 seems to work as expected. Thank you @tk42 for the suggestion!

@irg1008
Copy link

irg1008 commented Sep 7, 2022

Any news on this? Still happening on v1.9.0

@spookyuser
Copy link

If you add --compressed-caching=false it works for me on 1.9.0

devxpy added a commit to GooeyAI/gooey-server that referenced this issue Nov 16, 2022
devxpy added a commit to GooeyAI/gooey-server that referenced this issue Nov 16, 2022
@jtwigg
Copy link

jtwigg commented Mar 29, 2023

--compressed-caching=false worked well for most things except for COPY <src> <dst> and it turns out theres also --cache-copy-layers. I was still getting crushed by pytorch installations.

This is the cloudbuild.yaml that works really well now

steps:
- name: 'gcr.io/kaniko-project/executor:latest'
  args:
  - --destination=gcr.io/$PROJECT_ID/<name>
  - --cache=true
  - --cache-ttl=48h
  - --compressed-caching=false
  - --cache-copy-layers=true

davidcavazos added a commit to davidcavazos/beam that referenced this issue Jun 5, 2023
Disable cache compression to allow large images, like images depending on `tensorflow` or `torch`.

For more information, see: GoogleContainerTools/kaniko#1669
@aaron-prindle aaron-prindle added area/caching For all bugs related to cache issues issue/oom work-around-available area/gcb platform/cloud-build issue/build-fails area/performance issues related to kaniko performance enhancement priority/p2 High impact feature/bug. Will get a lot of users happy labels Jun 25, 2023
@javiercornejo
Copy link

I confirm I was having the same issue in Cloud Build and the --compressed-caching=false solved the problem with :latest so far.

@aaron-prindle aaron-prindle added priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. priority/p2 High impact feature/bug. Will get a lot of users happy and removed priority/p2 High impact feature/bug. Will get a lot of users happy priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. labels Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/caching For all bugs related to cache issues area/gcb area/performance issues related to kaniko performance enhancement issue/build-fails issue/oom platform/cloud-build priority/p2 High impact feature/bug. Will get a lot of users happy work-around-available
Projects
None yet
Development

No branches or pull requests