Live migration failures after upgrading from 4.5.2 to 4.5.4 #372

ebyrne242 · 2023-02-07T07:43:25Z

Everything was running on oVirt 4.5.2 on Rocky 8.6 with no issues for months. On Friday I updated our standalone engine to Rocku 8.7 and oVirt 4.5.4, decommissioned one of the hosts, then built a replacement on new hardware (same CPU) on 4.5.4. I was successfully able to migrate VMs to and from the host and called it good, then upgraded the rest of my hosts in sequence to Rocku 8.7 and oVirt 4.5.4. The only issue I ran into is I had to manually update and run dnf update --allow-erasing due to some ovs package changes. No issues were apparent in operation at this time.

On Monday morning, one VM appeared to have hung and was manually restarted. A couple of hours later a few VMs stopped running with the error in the web UI: "VM is down. Exit message: User shut down from within the guest". This may have been triggered by an oVirt migration attempt after the VMs were marked unresponsive. Within 10 minutes after this, the host they were on was marked as unresponsive. oVirt tried to migrate them and failed due to the error below. My suspicion was that the VM host was holding onto the storage lease despite the VMs being dead. I saw errors like this in the log on the host marked Nonoperational:

2023-02-06 11:47:31,184-0800 ERROR (jsonrpc/2) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpebl0nwmd/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 3275650\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-9b11821b-a9dc-4641-9c6f-00bb326a4eda --conf /tmp/brickrbd_ukuiqq_t --id vmstore --mon_host 192.168.88.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:114)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 107, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 1808, in detach_volume
    return managedvolume.detach_volume(sd_id, vol_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 138, in detach_volume
    run_helper("detach", vol_info)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 184, in run_helper
    sub_cmd, cmd_input=cmd_input)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 38, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 36, in <lambda>
    **kwargs)
  File "<string>", line 2, in managedvolume_run_helper
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
vdsm.storage.exception.ManagedVolumeHelperFailed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpebl0nwmd/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 3275650\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-9b11821b-a9dc-4641-9c6f-00bb326a4eda --conf /tmp/brickrbd_ukuiqq_t --id vmstore --mon_host 192.168.8.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)

And on the target migration host:

2023-02-06 18:09:37,891-0800 ERROR (jsonrpc/6) [api] FINISH detach_volume error=Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpezistr1k/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 2922225\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-37800cff-ec62-4d84-9f52-6a179fde1472 --conf /tmp/brickrbd_kchs7qub --id vmstore --mon_host 192.168.8:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8.77:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',) (api:114)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 107, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 1808, in detach_volume
    return managedvolume.detach_volume(sd_id, vol_id)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 63, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 138, in detach_volume
    run_helper("detach", vol_info)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/managedvolume.py", line 184, in run_helper
    sub_cmd, cmd_input=cmd_input)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 38, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 36, in <lambda>
    **kwargs)
  File "<string>", line 2, in managedvolume_run_helper
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
vdsm.storage.exception.ManagedVolumeHelperFailed: Managed Volume Helper failed.: ('Error executing helper: Command [\'/usr/bin/python3\', \'/usr/libexec/vdsm/managedvolume-helper\', \'detach\'] failed with rc=1 out=b\'\' err=b\'oslo.privsep.daemon: Running privsep helper: [\\\'sudo\\\', \\\'privsep-helper\\\', \\\'--privsep_context\\\', \\\'os_brick.privileged.default\\\', \\\'--privsep_sock_path\\\', \\\'/tmp/tmpezistr1k/privsep.sock\\\']\\noslo.privsep.daemon: Spawned new privsep daemon via rootwrap\\noslo.privsep.daemon: privsep daemon starting\\noslo.privsep.daemon: privsep process running with uid/gid: 0/0\\noslo.privsep.daemon: privsep process running with capabilities (eff/prm/inh): CAP_SYS_ADMIN/CAP_SYS_ADMIN/none\\noslo.privsep.daemon: privsep daemon running as pid 2922225\\nTraceback (most recent call last):\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 139, in <module>\\n    sys.exit(main(sys.argv[1:]))\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 62, in main\\n    args.command(args)\\n  File "/usr/libexec/vdsm/managedvolume-helper", line 134, in detach\\n    ignore_errors=False)\\n  File "/usr/lib/python3.6/site-packages/vdsm/storage/nos_brick.py", line 110, in disconnect_volume\\n    run_as_root=True)\\n  File "/usr/lib/python3.6/site-packages/os_brick/executor.py", line 53, in _execute\\n    result = self.__execute(*args, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/os_brick/privileged/rootwrap.py", line 172, in execute\\n    return execute_root(*cmd, **kwargs)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 272, in _wrap\\n    r_call_timeout)\\n  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 215, in remote_call\\n    raise exc_type(*result[2])\\noslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.\\nCommand: rbd unmap /dev/rbd/vmstore/volume-37800cff-ec62-4d84-9f52-6a179fde1472 --conf /tmp/brickrbd_kchs7qub --id vmstore --mon_host 192.168.8.75:6789 --mon_host 192.168.8.76:6789 --mon_host 192.168.8:6789\\nExit code: 16\\nStdout: \\\'\\\'\\nStderr: \\\'rbd: sysfs write failed\\\\nrbd: unmap failed: (16) Device or resource busy\\\\n\\\'\\n\'',)

After this I believe the VMs failed and they were restarted on another host manually. But there were several still running on this host. I was able to successfully migrate them off to another host later. I rebooted that host and was able to migrate other VMs to and from it without issues, so I iniitially though that was the end of the problem, and left it running with a small number of VMs to see if it reoccurs.

After this I was manually rebalancing some of the VMs across other hosts and found that some of them fail to migrate between certain hosts with the same error, but can migrate to others. These are VMs that were never on the host that had the initial problem.

Connectivity looks good. I show no disconnections on any network interface, and only a handfull of packet drops on the host side on my storage network (not increasing, and likely related to the storage network being set up after the host booted).

I googled for the error and found this previous occurance:

https://bugzilla.redhat.com/show_bug.cgi?id=1755801
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/PVGTQPXCTEQI4LUUSXDRLSIH3GXXQC2N/?sort=date

This sounds like either a bug that was presumed fixed, or an issue due to an external mount of the same volume. The latter is definitely not the case here, we use a dedicated ceph RBD volume for oVirt, along with a dedicated cephfs volume for the primary storage domain (including some of our VMs which we haven't yet migrated to RBD since it's currently a long manual process involving VM downtime).

The text was updated successfully, but these errors were encountered:

bennyz · 2023-02-07T08:07:17Z

Can you confirm you have the multipath blacklist rule?

$ multipathd show config | grep rbd

ebyrne242 · 2023-02-08T02:09:52Z

Yep, I did already check that after seeing it mentioned in the archive post. It does appear to be blacklisted properly on all of my hosts:

        devnode "^(rbd)[0-9]*"

sknolin · 2023-09-05T18:18:00Z

I just added ceph storage to our 4.5.4-1 install on rocky 8.7, and see the same issues. I saw ours due to failure to delete testing images, and now moving things around between nodes see it.

I've resorted to manually doing 'rbd unmap' and finding/removing entries in ovirt_cinderlib volume_attachment table for some.

Our multipath blacklist is in place too, I looked for that.

sknolin · 2023-09-07T21:21:41Z

To follow up, this was resolved by both upgrading to rocky 8.8 and ovirt nightly - I missed that ovirt releases aren't happening any more. Our issues were compounded by this kernel bug on 8.7 https://bugzilla.redhat.com/show_bug.cgi?id=2079311

sandrobonazzola transferred this issue from oVirt/ovirt-engine Feb 10, 2023

nirs changed the title ~~LIve migration failures after upgrading from 4.5.2 to 4.5.4~~ Live migration failures after upgrading from 4.5.2 to 4.5.4 Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live migration failures after upgrading from 4.5.2 to 4.5.4 #372

Live migration failures after upgrading from 4.5.2 to 4.5.4 #372

ebyrne242 commented Feb 7, 2023

bennyz commented Feb 7, 2023

ebyrne242 commented Feb 8, 2023

sknolin commented Sep 5, 2023

sknolin commented Sep 7, 2023

Live migration failures after upgrading from 4.5.2 to 4.5.4 #372

Live migration failures after upgrading from 4.5.2 to 4.5.4 #372

Comments

ebyrne242 commented Feb 7, 2023

bennyz commented Feb 7, 2023

ebyrne242 commented Feb 8, 2023

sknolin commented Sep 5, 2023

sknolin commented Sep 7, 2023