cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

mmochan · 2021-03-19T03:19:03Z

Platform I'm building on:

AWS EKS

v1.18.9

Cilium version

Client: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64
Daemon: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64

Kernel version

Bottlerocket OS
Linux ip-10-95-107-127.ap-southeast-2.compute.internal 5.4.95 #1 SMP Wed Mar 17 19:08:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

AMI:  bottlerocket-aws-k8s-1.18-x86_64-v1.0.7-099d3398

What I expected to happen:
I expected the cilium-node-init pod to start successfully

What actually happened:
cilium-node-init fails with

nsenter: failed to execute bash: No such file or directory                                                                                                                                                               
!!! startup-script failed! exit code '127'                                                                                                                                                                                 
stream closed

How to reproduce the problem:

Deployed Cilium 1.9.5 with the following configuration

cni.chainingMode=aws-cni
masquerade=false
tunnel=disabled
nodeinit.enabled=true
hubble.relay.enabled=true
hubble.listenAddress=:4244
hubble.ui.enabled=true

Added a worker node ( Autoscaling group / Launch Template with Bottlerocket AMI )
cilium-operator pod starts successfully
cilium pod start successfully
cilium-node-init fails with

nsenter: failed to execute bash: No such file or directory                                                                                                                                                               
!!! startup-script failed! exit code '127'                                                                                                                                                                                 
stream closed

The daemonset for the cilium-node-init pods runs a startup script and is expecting bash to be available which I don't believe it is?

......
......
    spec:
      containers:
      - env:
        - name: CHECKPOINT_PATH
          value: /tmp/node-init.cilium.io
        - name: STARTUP_SCRIPT
          value: |
            #!/bin/bash

            set -o errexit
            set -o pipefail
            set -o nounset
            set -x trace

            mount | grep "/sys/fs/bpf type bpf" || {
              # Mount the filesystem until next reboot
              echo "Mounting BPF filesystem..."
              mount bpffs /sys/fs/bpf -t bpf
......
......

The working node is an Amazon Linux 2 AMI

The text was updated successfully, but these errors were encountered:

WilboMo · 2021-03-22T16:13:37Z

Thank you for bringing the issue to our attention. The problem you’re experiencing is a result of the cilium-node-init container attempting to use the host’s Bash. As Bottlerocket does not have a shell available, accessing Bash in this manner is not possible. However, we don’t believe that cilium-init-nodes are required to run Cilium on Bottlerocket successfully. Can you try that out and see if that works for you? You should be able to install Cilium via Helm and omit the cilium-init-nodes as shown below:

helm install cilium cilium/cilium --version 1.9.5 --namespace kube-system --set eni=true --set ipam.mode=eni --set egressMasqueradeInterfaces=eth0 --set tunnel=disabled —set nodeinit.enabled=false

mmochan · 2021-03-23T02:02:58Z

Thanks I've tried that and that does work. However, it only works when cni.chainingMode is not set

I need chaining mode enabled cni.chainingMode=aws-cni

With it enabled Cilium doesn't start

cilium-operator-5f8b885d44-t76bk  0 ErrImagePull
cilium-p4dlv                      0 Init:ImagePullBackOff
cilium-rb2vw                      0 Init:ImagePullBackOff

and coredns terminates and restarts continuously

springroll12 · 2021-03-26T11:06:41Z

@mmochan Did you find a workaround for this? We're seeing the exact same issue unfortunately.

mmochan · 2021-03-26T22:33:26Z

@springroll12 Unfortunately not yet.

jpculp · 2021-06-04T23:01:00Z

After some trial and error, I was able to get Cilium v1.9.x to work after upgrading the AWS-CNI to v1.7.9 (v1.7.5 seems to be the current default). On the amazon-vpc-cni-k8s repo you can read more about the issue and the fix. I did run into some different issues with v1.10.x, so the following only applies to v1.9.x.

These were the steps I used to successfully install cilium:

Launch 1.18 cluster without a node group.
Update AWS-CNI to v1.7.9.
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.9/config/v1.7/aws-k8s-cni.yaml
Verify the new version.
kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2
Deploy cilium via helm with nodeinit.enabled=false:

helm install cilium cilium/cilium --version 1.9.5 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set masquerade=false \
  --set tunnel=disabled \
  --set nodeinit.enabled=false

Restart core-dns pods.
kubectl rollout restart -n kube-system deployment/coredns
Create nodegroup.
Enable hubble:

helm upgrade cilium cilium/cilium --version 1.9.5 \
  --namespace kube-system \
  --reuse-values \
  --set hubble.relay.enabled=true \
  --set hubble.listenAddress=:4244 \
  --set hubble.ui.enabled=true

Please give these steps a shot and let us know how it goes!

Smana · 2021-07-07T16:10:49Z

I have exactly the same problem, I've installed cilium successfully on Amazon Linux but I got the same error using bottleneck.
k8s_version = 1.20
cilium = 1.10.2

I tried the steps described by @jpculp but that didn't work on my side.

jpculp · 2021-07-07T16:29:49Z

Hi @Smana. I'm sorry to hear you're running into issues. We also saw them on our end with 1.10.x and are continuing to investigate. The fix above only applies to cilium 1.9.x.

Smana · 2021-07-07T16:34:42Z

Let me know if you want me to tests on my side whenever a fix is ready to be tested.

jpculp · 2021-08-17T23:04:45Z

Hi @Smana, sorry for the delay. Cilium 1.10.3 seems to initialize successfully with the steps detailed above, but with a slight modification to step 4.

Please give the follow install configuration a shot and let us know how it goes!

helm install cilium cilium/cilium --version 1.10.3 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set enableIPv4Masquerade=false \
  --set tunnel=disabled \
  --set nodeinit.enabled=false

If you aren't interested in AWS CNI chaining, these steps are also valid:

kubectl -n kube-system delete daemonset aws-node

helm install cilium cilium/cilium --version 1.10.3 \
  --namespace kube-system \
  --set eni.enabled=true \
  --set ipam.mode=eni \
  --set egressMasqueradeInterfaces=eth0 \
  --set tunnel=disabled \
  --set nodeinit.enabled=false

jpculp · 2021-09-07T22:06:26Z

Hey @Smana, hopefully the updated instructions above are working for your use case. If you run into anything else, don't hesitate to open another issue and let us know.

mmochan mentioned this issue Mar 19, 2021

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS cilium/cilium#15393

Closed

jhaynes assigned WilboMo Mar 22, 2021

jhaynes added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Mar 23, 2021

jhaynes added this to the next+1 milestone Mar 26, 2021

gregdek added the status/research This issue is being researched label Apr 2, 2021

gregdek modified the milestones: next+1, oncall Apr 2, 2021

gregdek added priority/p0 and removed status/needs-triage Pending triage or re-evaluation labels Apr 2, 2021

gthao313 unassigned WilboMo Apr 7, 2021

jpculp self-assigned this May 24, 2021

samuelkarp added the area/kubernetes K8s including EKS, EKS-A, and including VMW label Aug 3, 2021

jpculp closed this as completed Sep 7, 2021

bcressey removed the status/research This issue is being researched label Nov 11, 2021

olga-mir mentioned this issue Jul 28, 2022

Cilium on EKS with Bottlerocket and AWS VPC CNI chaning mode does not work without setting hostLegacyRouting to true cilium/cilium#20677

Open

2 tasks

carlosjgp mentioned this issue Aug 19, 2022

Cilium no longer runs on Bottlerocket Linux as of 1.11.3 cilium/cilium#19256

Closed

2 tasks

solidnerd mentioned this issue Sep 11, 2022

Karpenter nodes stuck in "NotReady" status and fail to join EKS cluster aws/karpenter-provider-aws#2423

Closed

YuvarajuR mentioned this issue Oct 28, 2022

Add cilium to Kubernetes addon issue aws-ia/terraform-aws-eks-blueprints#1083

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

mmochan commented Mar 19, 2021 •

edited

Loading

WilboMo commented Mar 22, 2021

mmochan commented Mar 23, 2021

springroll12 commented Mar 26, 2021

mmochan commented Mar 26, 2021

jpculp commented Jun 4, 2021

Smana commented Jul 7, 2021

jpculp commented Jul 7, 2021

Smana commented Jul 7, 2021

jpculp commented Aug 17, 2021

jpculp commented Sep 7, 2021

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

Comments

mmochan commented Mar 19, 2021 • edited Loading

WilboMo commented Mar 22, 2021

mmochan commented Mar 23, 2021

springroll12 commented Mar 26, 2021

mmochan commented Mar 26, 2021

jpculp commented Jun 4, 2021

Smana commented Jul 7, 2021

jpculp commented Jul 7, 2021

Smana commented Jul 7, 2021

jpculp commented Aug 17, 2021

jpculp commented Sep 7, 2021

mmochan commented Mar 19, 2021 •

edited

Loading