Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

Closed
mmochan opened this issue Mar 19, 2021 · 10 comments
Closed

cilium-node-init CrashLoopBackOff when running on Bottlerocket OS #1405

mmochan opened this issue Mar 19, 2021 · 10 comments
Assignees
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW type/bug Something isn't working
Milestone

Comments

@mmochan
Copy link

mmochan commented Mar 19, 2021

Platform I'm building on:

  • AWS EKS
v1.18.9
  • Cilium version
Client: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64
Daemon: 1.9.5 079bdaf 2021-03-10T13:12:19-08:00 go version go1.15.8 linux/amd64
  • Kernel version
Bottlerocket OS
Linux ip-10-95-107-127.ap-southeast-2.compute.internal 5.4.95 #1 SMP Wed Mar 17 19:08:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

AMI:  bottlerocket-aws-k8s-1.18-x86_64-v1.0.7-099d3398

What I expected to happen:
I expected the cilium-node-init pod to start successfully

What actually happened:
cilium-node-init fails with

nsenter: failed to execute bash: No such file or directory                                                                                                                                                               
!!! startup-script failed! exit code '127'                                                                                                                                                                                 
stream closed

How to reproduce the problem:

  1. Deployed Cilium 1.9.5 with the following configuration
cni.chainingMode=aws-cni
masquerade=false
tunnel=disabled
nodeinit.enabled=true
hubble.relay.enabled=true
hubble.listenAddress=:4244
hubble.ui.enabled=true
  1. Added a worker node ( Autoscaling group / Launch Template with Bottlerocket AMI )
  2. cilium-operator pod starts successfully
  3. cilium pod start successfully
  4. cilium-node-init fails with
nsenter: failed to execute bash: No such file or directory                                                                                                                                                               
!!! startup-script failed! exit code '127'                                                                                                                                                                                 
stream closed

Screen Shot 2021-03-19 at 12 16 11 pm

The daemonset for the cilium-node-init pods runs a startup script and is expecting bash to be available which I don't believe it is?

......
......
    spec:
      containers:
      - env:
        - name: CHECKPOINT_PATH
          value: /tmp/node-init.cilium.io
        - name: STARTUP_SCRIPT
          value: |
            #!/bin/bash

            set -o errexit
            set -o pipefail
            set -o nounset
            set -x trace

            mount | grep "/sys/fs/bpf type bpf" || {
              # Mount the filesystem until next reboot
              echo "Mounting BPF filesystem..."
              mount bpffs /sys/fs/bpf -t bpf
......
......

The working node is an Amazon Linux 2 AMI

@WilboMo
Copy link
Contributor

WilboMo commented Mar 22, 2021

Thank you for bringing the issue to our attention. The problem you’re experiencing is a result of the cilium-node-init container attempting to use the host’s Bash. As Bottlerocket does not have a shell available, accessing Bash in this manner is not possible. However, we don’t believe that cilium-init-nodes are required to run Cilium on Bottlerocket successfully. Can you try that out and see if that works for you? You should be able to install Cilium via Helm and omit the cilium-init-nodes as shown below:

helm install cilium cilium/cilium --version 1.9.5 --namespace kube-system --set eni=true --set ipam.mode=eni --set egressMasqueradeInterfaces=eth0 --set tunnel=disabled —set nodeinit.enabled=false

@mmochan
Copy link
Author

mmochan commented Mar 23, 2021

Thanks I've tried that and that does work. However, it only works when cni.chainingMode is not set

I need chaining mode enabled cni.chainingMode=aws-cni

With it enabled Cilium doesn't start

cilium-operator-5f8b885d44-t76bk  0 ErrImagePull
cilium-p4dlv                      0 Init:ImagePullBackOff
cilium-rb2vw                      0 Init:ImagePullBackOff

and coredns terminates and restarts continuously

@jhaynes jhaynes added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Mar 23, 2021
@springroll12
Copy link

@mmochan Did you find a workaround for this? We're seeing the exact same issue unfortunately.

@jhaynes jhaynes added this to the next+1 milestone Mar 26, 2021
@mmochan
Copy link
Author

mmochan commented Mar 26, 2021

@springroll12 Unfortunately not yet.

@gregdek gregdek added the status/research This issue is being researched label Apr 2, 2021
@gregdek gregdek modified the milestones: next+1, oncall Apr 2, 2021
@gregdek gregdek added priority/p0 and removed status/needs-triage Pending triage or re-evaluation labels Apr 2, 2021
@jpculp jpculp self-assigned this May 24, 2021
@jpculp
Copy link
Member

jpculp commented Jun 4, 2021

After some trial and error, I was able to get Cilium v1.9.x to work after upgrading the AWS-CNI to v1.7.9 (v1.7.5 seems to be the current default). On the amazon-vpc-cni-k8s repo you can read more about the issue and the fix. I did run into some different issues with v1.10.x, so the following only applies to v1.9.x.

These were the steps I used to successfully install cilium:

  1. Launch 1.18 cluster without a node group.
  2. Update AWS-CNI to v1.7.9.
    kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.9/config/v1.7/aws-k8s-cni.yaml
  3. Verify the new version.
    kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2
  4. Deploy cilium via helm with nodeinit.enabled=false:
helm install cilium cilium/cilium --version 1.9.5 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set masquerade=false \
  --set tunnel=disabled \
  --set nodeinit.enabled=false
  1. Restart core-dns pods.
    kubectl rollout restart -n kube-system deployment/coredns
  2. Create nodegroup.
  3. Enable hubble:
helm upgrade cilium cilium/cilium --version 1.9.5 \
  --namespace kube-system \
  --reuse-values \
  --set hubble.relay.enabled=true \
  --set hubble.listenAddress=:4244 \
  --set hubble.ui.enabled=true

Please give these steps a shot and let us know how it goes!

@Smana
Copy link

Smana commented Jul 7, 2021

I have exactly the same problem, I've installed cilium successfully on Amazon Linux but I got the same error using bottleneck.
k8s_version = 1.20
cilium = 1.10.2

I tried the steps described by @jpculp but that didn't work on my side.

@jpculp
Copy link
Member

jpculp commented Jul 7, 2021

Hi @Smana. I'm sorry to hear you're running into issues. We also saw them on our end with 1.10.x and are continuing to investigate. The fix above only applies to cilium 1.9.x.

@Smana
Copy link

Smana commented Jul 7, 2021

Let me know if you want me to tests on my side whenever a fix is ready to be tested.

@samuelkarp samuelkarp added the area/kubernetes K8s including EKS, EKS-A, and including VMW label Aug 3, 2021
@jpculp
Copy link
Member

jpculp commented Aug 17, 2021

Hi @Smana, sorry for the delay. Cilium 1.10.3 seems to initialize successfully with the steps detailed above, but with a slight modification to step 4.

Please give the follow install configuration a shot and let us know how it goes!

helm install cilium cilium/cilium --version 1.10.3 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set enableIPv4Masquerade=false \
  --set tunnel=disabled \
  --set nodeinit.enabled=false

If you aren't interested in AWS CNI chaining, these steps are also valid:

kubectl -n kube-system delete daemonset aws-node
helm install cilium cilium/cilium --version 1.10.3 \
  --namespace kube-system \
  --set eni.enabled=true \
  --set ipam.mode=eni \
  --set egressMasqueradeInterfaces=eth0 \
  --set tunnel=disabled \
  --set nodeinit.enabled=false

@jpculp
Copy link
Member

jpculp commented Sep 7, 2021

Hey @Smana, hopefully the updated instructions above are working for your use case. If you run into anything else, don't hesitate to open another issue and let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants