Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guest agent sometimes doesn't start (or is unreachable from host agent) #2064

Open
jandubois opened this issue Dec 13, 2023 · 19 comments
Open
Labels
bug Something isn't working component/qemu QEMU

Comments

@jandubois
Copy link
Member

jandubois commented Dec 13, 2023

Well, at least not initially.

[Update: it seems to start, but hostagent cannot connect to it]

We've observed this multiple times with Rancher Desktop after switching to Lima 0.19.0 (rancher-sandbox/rancher-desktop#6151).

AFAIK, this was always with QEMU and not VZ, and always when restarting an existing instance, and not when creating a new one.

Rancher Desktop is using Alpine; I have not seen it with other distros, but that doesn't mean much.

I don't think guest-agent logs are written to file on Alpine; I will need to look into fixing that.

I've run into it twice today on different machines, but there don't seem to be any steps to reliably reproduce the problem.

  • macOS Sonoma 14.2 on M1 with QEMU 8.0.0
    'time="2023-12-13T10:04:13-08:00" level=info msg="[hostagent] Waiting for the guest agent to be running"\n' +
    'time="2023-12-13T10:04:13-08:00" level=info msg="[hostagent] Forwarding \\"/var/run/docker.sock\\" (guest) to \\"/Users/jan/.rd/docker.sock\\" (host)"\n' +
    'time="2023-12-13T10:05:13-08:00" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \\"boot scripts must have finished\\""\n' +
    'time="2023-12-13T10:05:13-08:00" level=info msg="[hostagent] The final requirement 1 of 1 is satisfied"\n' +
    'time="2023-12-13T10:05:13-08:00" level=error msg="[guest agent does not seem to be running; port forwards will not work]"\n' +
    'time="2023-12-13T10:05:13-08:00" level=warning msg="DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see \\"/Users/jan/Library/Application Support/rancher-desktop/lima/0/ha.stderr.log\\")"\n' +
    'time="2023-12-13T10:05:13-08:00" level=fatal msg="degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:63025}"\n',
  • macOS Monterey 12.7.2 on Intel with QEMU 7.1.0
    'time="2023-12-13T10:35:24-08:00" level=info msg="[hostagent] Waiting for the guest agent to be running"\n' +
    'time="2023-12-13T10:35:24-08:00" level=info msg="[hostagent] Forwarding \\"/var/run/docker.sock\\" (guest) to \\"/Users/jan/.rd/docker.sock\\" (host)"\n' +
    'time="2023-12-13T10:36:24-08:00" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \\"boot scripts must have finished\\""\n' +
    'time="2023-12-13T10:36:24-08:00" level=info msg="[hostagent] The final requirement 1 of 1 is satisfied"\n' +
    'time="2023-12-13T10:36:24-08:00" level=error msg="[guest agent does not seem to be running; port forwards will not work]"\n' +
    'time="2023-12-13T10:36:24-08:00" level=warning msg="DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see \\"/Users/jan/Library/Application Support/rancher-desktop/lima/0/ha.stderr.log\\")"\n' +
    'time="2023-12-13T10:36:24-08:00" level=fatal msg="degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:62183}"\n',
@mook-as
Copy link
Contributor

mook-as commented Dec 20, 2023

FWIW, I've been able to (sometimes) reproduce this with the default (Ubuntu-based) image. It's just using a script to start and stop in a loop:

#!/usr/bin/env bash
set -o errexit
machine="$(limactl list --format '{{.Name}}')"
while true; do
    limactl list
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped) limactl --debug start "${machine}";;
    Running) limactl stop "${machine}";;
    Broken) exit;;
    esac
    sleep 1
done

Eventually limactl start returns an error (I don't believe I ever reached the Broken state).

This was at 04e3886. The VM was created with limactl create and accepting all defaults. QEMU (from homebrew) 8.1.2.

stderr:
INFO[0022] [hostagent] Waiting for the guest agent to be running 
INFO[0082] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" 
DEBU[0082] [hostagent] executing script "boot scripts must have finished" 
DEBU[0082] [hostagent] executing ssh for script "boot scripts must have finished": /usr/bin/ssh [ssh -F /dev/null -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers="^aes128-gcm@openssh.com,aes256-gcm@openssh.com" -o User=markyen -o ControlMaster=auto -o ControlPath=".../.lima/default/ssh.sock" -o ControlPersist=yes -p 60022 127.0.0.1 -- /bin/bash] 
DEBU[0082] [hostagent] stdout="", stderr="+ timeout 30s bash -c 'until sudo diff -q /run/lima-boot-done /mnt/lima-cidata/meta-data 2>/dev/null; do sleep 3; done'\n", err=<nil> 
INFO[0082] [hostagent] The final requirement 1 of 1 is satisfied 
DEBU[0082] received an event                             event="{2023-12-20 14:02:45.836987 -0800 PST {true true false [guest agent does not seem to be running; port forwards will not work] 60022}}"
ERRO[0082] [guest agent does not seem to be running; port forwards will not work] 
WARN[0082] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see ".../.lima/default/ha.stderr.log") 
FATA[0082] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022} 
sudo journalctl -b -u lima-guestagent
Dec 20 22:01:33 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Dec 20 22:01:34 lima-default lima-guestagent[1340]: time="2023-12-20T22:01:34Z" level=info msg="event tick: 3s"
Dec 20 22:01:34 lima-default lima-guestagent[1340]: time="2023-12-20T22:01:34Z" level=info msg="serving the guest agent on qemu serial file: /dev/virtio-ports/io.lima-vm.guest_agent.0"
Dec 20 22:01:39 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Dec 20 22:01:39 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Dec 20 22:01:39 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Dec 20 22:01:39 lima-default systemd[1]: lima-guestagent.service: Consumed 6.750s CPU time.
Dec 20 22:01:39 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Dec 20 22:01:39 lima-default lima-guestagent[2137]: time="2023-12-20T22:01:39Z" level=info msg="event tick: 3s"
Dec 20 22:01:39 lima-default lima-guestagent[2137]: time="2023-12-20T22:01:39Z" level=info msg="serving the guest agent on qemu serial file: /dev/virtio-ports/io.lima-vm.guest_agent.0"

@jandubois jandubois changed the title Guest agent sometimes doesn't start Guest agent sometimes doesn't start (or is unreachable from host agent) Dec 20, 2023
@jandubois
Copy link
Member Author

jandubois commented Dec 22, 2023

Both @mook-as and I have run git bisect with this updated script:

#!/usr/bin/env bash

mkdir -p bisect
make binaries

machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0

while [[ "$count" -lt 100 ]]; do
    limactl list
    printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped)
        if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
            count=$((count+1))
        else
            if grep --quiet 'guest agent does not seem to be running' "bisect/${commit}.log"; then
                exit 1
            fi
            errors=$((errors+1))
            if [[ "$errors" -gt 10 ]]; then
                exit 125
            fi
        fi
        ;;
    Running) limactl stop "${machine}" ;;
    esac
done

It needs $PWD/_output/bin in the PATH and you must have created a single instance for testing.

Bisecting between v0.18.0 and v0.19.0. For both of us (on M1 machines) the bisect identified hostagent: avoid requiring fuser and ss in the guest ; CI: update colima (v0.5.5→v0.6.5) by AkihiroSuda · #2009 as the "first bad commit". I still have another bisect running on an Intel machine, but it is a very slow box; may not have results until tomorrow.

I tried adding sleep 60 after the limactl stop, in case qemu was not finished by the time the hostagent exited, but it made no difference. Extending the hostagent timeout from 1min to 5min also did not make a difference (beyond it now taking 5min to report that the guestagent is unreachable).

@AkihiroSuda Any ideas?

PS: Yes, the bisect script is explicitly looking for the error message that was introduced in the PR that was eventually found. But the old code had never indicated that there was a problem communicating with the guestagent.

Note that the guestagent is running; it is the code trying to communicate with it that is failing.

@jandubois jandubois removed the guest/alpine Guest: Alpine label Jan 2, 2024
@balajiv113
Copy link
Member

@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

@jandubois
Copy link
Member Author

jandubois commented Jan 2, 2024

@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??

Yes, I can see the logs. Now it happened to me for the first time on a fresh start:

$ l start
? Creating an instance "default" Proceed with the current configuration
INFO[0000] Starting the instance "default" with VM driver "qemu"
INFO[0000] QEMU binary "/opt/homebrew/bin/qemu-system-aarch64" seems properly signed with the "com.apple.security.hypervisor" entitlement
[...]
INFO[0099] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0099] [hostagent] The final requirement 1 of 1 is satisfied
ERRO[0099] [guest agent does not seem to be running; port forwards will not work]
WARN[0099] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see "/Users/jan/.lima/default/ha.stderr.log")
FATA[0099] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022}

The guestagent log indicates that the agent was stopped and started again, which feels odd:

$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:21:14 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:21:14 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="event tick: 3s"
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

I don't think this is always the case, but I will verify this.

There is no error in ha.stderr.log and the instance no longer looks degraded, so maybe this is different from the other failures.

ha.stderr.log

@jandubois
Copy link
Member Author

I don't think this is always the case, but I will verify this.

Have reproduced with the bisect script (just running against HEAD of master):

$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:39:44 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
Jan 02 11:39:48 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:39:48 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Consumed 5.085s CPU time.
Jan 02 11:39:48 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

This time the first run of the guest agent gets to emit at least the 2 startup log lines, but otherwise looks the same.

@jandubois
Copy link
Member Author

But running things on Alpine (again with HEAD of master) I don't see the guest agent being restarted (but we still go into the degraded state because communication fails):

$ l shell alpine sudo rc-status | grep guestagent
 lima-guestagent                                       [  started 00:02:14 (0) ]

$ l shell alpine sudo cat /var/log/lima-guestagent.log
time="2024-01-02T19:49:02Z" level=info msg="event tick: 3s"
time="2024-01-02T19:49:02Z" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

I guess it is possible that the first run of guest agent didn't output anything, but I think the (0) part of the rc-status output means that supervise-daemon did not restart the service.

@balajiv113 It is pretty straight-forward to repro the issue: You must have just a single instance, and then run the script from #2064 (comment), which will stop and restart the instance until it breaks. Or edit the script to hard-code the instance name, if you have other instances that you need to keep.

Failure typically happens in the first 5-10 iterations, but sometimes takes longer.

@jandubois
Copy link
Member Author

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

Possible, but in my testing simply reverting 482678b (and installing ss in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.

@jandubois
Copy link
Member Author

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

Possible, but in my testing simply reverting 482678b (and installing ss in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.

@balajiv113 I was wrong; the problem seems to be indeed due to the use of serialport as a channel, and not related to the updated readiness test. The bisect script was not testing correctly for the error condition. I've since re-run the bisect between 0.18.0 and 0.19.0 with QEMU and the default Ubuntu image with this script:

#!/usr/bin/env bash

mkdir -p bisect
make binaries

machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0

while [[ "$count" -lt 100 ]]; do
    limactl list
    printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped)
        if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
            count=$((count+1))
            sleep 7
            if [ "$machine" = "alpine" ]; then
                limactl shell --workdir / "${machine}" sudo timeout 10 nc -l -p 4444 -s 0.0.0.0 2>/dev/null
            else
                limactl shell --workdir / "${machine}" sudo timeout 10 nc -l 4444
            fi
            sleep 7
            if ! grep --quiet '0.0.0.0:4444' "$HOME/.lima/${machine}/ha.stderr.log"; then
                exit 1
            fi
        else
            exit 125
        fi
        ;;
    Running) limactl stop "${machine}" ;;
    esac
done

Instead of relying on indirect evidence of the guest agent running, it opens the port 4444 inside the instance and then checks ha.stderr.log that a forwarding request has been received.

This bisect identifies f947de0 as the "first bad commit" causing the script to fail.

So far I've still not seen any tests to fail with VZ, just with QEMU, so only the serial port code seems to be affected.

jandubois added a commit to rancher-sandbox/lima that referenced this issue Jan 3, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
jandubois added a commit to rancher-sandbox/lima that referenced this issue Jan 3, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
(cherry picked from commit a8ef354)
jandubois added a commit to rancher-sandbox/lima that referenced this issue Jan 3, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
jandubois added a commit to rancher-sandbox/lima that referenced this issue Jan 3, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
jandubois added a commit to rancher-sandbox/lima that referenced this issue Jan 3, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
(cherry picked from commit 9e1a07a)
DennisRasey pushed a commit to DennisRasey/lima that referenced this issue Jan 11, 2024
…ication

The serial port sometimes doesn't seem to work: lima-vm#2064

Signed-off-by: Jan Dubois <jan.dubois@suse.com>
@iilyak
Copy link

iilyak commented Jan 18, 2024

It seems like the fix has been merged. Is there a timeline for new release so it can be pulled by brew upgrade lima without messing with building from master?

@AkihiroSuda
Copy link
Member

Released v0.20.

Will be available on Homebrew in a couple of hours

@DrScho
Copy link

DrScho commented Apr 15, 2024

I use lima 0.21.0 but still have the problem.

@dubcl
Copy link

dubcl commented Apr 17, 2024

same here, with lima 0.21.0 also

@jandubois
Copy link
Member Author

@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?

@dubcl
Copy link

dubcl commented Apr 29, 2024

Hi @jandubois I'm on macOS, Sonoma 14.4.1, arm64.

$ uname -a
Darwin MACHINENAME 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64

@jandubois
Copy link
Member Author

@dubcl Please confirm if you are using QEMU or VZ

@dubcl
Copy link

dubcl commented Apr 29, 2024

sorry, QEMU

$ qemu-system-x86_64 -version
QEMU emulator version 8.2.1

@jandubois
Copy link
Member Author

sorry, QEMU

Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).

Just for completeness, are you running the default template, or something else?

And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while? Does this happen with an empty instance, or only after you deploy some workload on it?

Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.

@DrScho
Copy link

DrScho commented May 6, 2024

@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?

I am using Linux (0n MAC OS)
23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64
and qemu
QEMU emulator version 9.0.0
and lima 0.21.0 with default instance
I just started 'limactl start' and there is no problem with the guest system.
But when I try to start a docker container with 'docker run --name myName -p 127.0.0.1:1521:1521/tcp myImage'
I still receive 'docker: 'run --name' is not a docker command.' although the command appears to be correct and docker in VM has the option --name

@dubcl
Copy link

dubcl commented May 6, 2024

sorry, QEMU

Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).

I already use 0.21, I think this happens when update from .20 to .21

Just for completeness, are you running the default template, or something else?

yes I guess, this is my config:

# This template requires Lima v0.7.0 or later.
arch: "x86_64"
cpus: 4
memory: "6GiB"
vmType: "qemu"
firmware:
  legacyBIOS: true
images:
  # Try to use release-yyyyMMdd image if available. Note that release-yyyyMMdd will be removed after several months.
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-amd64.img"
    arch: "x86_64"
    digest: "sha256:5ad255d32a30a2cda9f0df19f0a6ce8d6f3c81b63845086a4cb5c43cf97fcb92"
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-arm64.img"
    arch: "aarch64"
    digest: "sha256:af62ca6ba307388f7e0a8ad1c46103e6aea0130a09122e818df8d711637bf998"
  # Fallback to the latest release image.
  # Hint: run `limactl prune` to invalidate the cache
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-amd64.img"
    arch: "x86_64"
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-arm64.img"
    arch: "aarch64"
mounts:
  - location: "~"
  - location: "/tmp/lima"
    writable: true

And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while?

I'm not sure, I never put to sleep, is always on because I use it as desktop mostly.

Does this happen with an empty instance, or only after you deploy some workload on it?

I realize when I try to do a port forward for a nginx image that I tried to test

Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.

np, glad to help you.


Additionally, to test, I removed the default folder and start a "new default" with same default.yaml and all work ok. To follow, If I get the error again, I will back to comment you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/qemu QEMU
Projects
None yet
Development

No branches or pull requests

7 participants