Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10 VM running BlueIris has igfx driver crash every few days. #228

Open
bheikes1 opened this issue Mar 31, 2023 · 5 comments
Open

Comments

@bheikes1
Copy link

Greetings all,

Looking for some hints as to what might be the issue with my setup. I have a Windows 10 VM running BlueIris that has started exhibiting igfx driver crashes approximately a month ago. Previously, this system was stable with uptimes of several months with no issues.

Host system:
Proxmox 7.4-3
Kernels recently used 6.2, 6.1, 5.19, 5.15, 5.13
Intel E-2186G, 128 GB ram, Nvidia T1000, LSI HBA

VMs:
Ubuntu 22.04 running PiHole, no issues noted
TrueNas Core, has LSI HBA passed through, no issues noted
Ubuntu 22.04 running Portainer, has Nvidia T1000 passed through, no issues noted
Windows 10 22H2, has Intel igpu p630 passed through (GVT-d), igfx driver crashes every few days.

This setup has been in place for approximately a year with virtually no issues until approximately a month ago (March 8th from my notes). In the last week or so, I've worked my way through linux kernels 5.19, 6.1, 6.2, as well as trying out GVT-g to see if i could stop the igfx driver crashes. Using GVT-g, when the crash happens the VM would stop responding completely, and cause issues with the host as well necessitating a host reboot. Using GVT-d, only the VM needs to be rebooted.

Under the 6.1 and 6.2 (and perhaps 5.19) kernels using GVT-G I get syslog entries (on host) like this when a crash happens

Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to flush post shadow
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail to dispatch workload, skip
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c000
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c008
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c010
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c018
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c020

and

Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c948
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 17 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 15 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 11 kernel messages
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6ca80
Mar 15 16:03:14 pve systemd-journald[1702]: Missed 13 kernel messages

Under 6.2 and 6.1 using GVT-d I get messages like this when a crash happens

Mar 26 07:20:45 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 26 07:20:45 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8024c046000 [fault reason 0x07] Next page table ptr is invalid
Mar 29 12:08:47 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 29 12:08:47 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff8004014b4000 [fault reason 0x07] Next page table ptr is invalid
Mar 31 05:48:36 pve kernel: DMAR: DRHD: handling fault status reg 3
Mar 31 05:48:36 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800417686000 [fault reason 0x07] Next page table ptr is invalid

I'm trying out older kernels now (currently 5.13) to see if there is any appreciable difference. I do realize that I am running quite a complicated system, and might be bumping up against an edge case.

Any thoughts?

@bheikes1
Copy link
Author

Did some testing on linux kernel 5.13 over the last month and the behavior noted above completely resolved.

Moving up to kernel 5.15 now, since it's actually being maintained.

@bheikes1
Copy link
Author

Running on 5.15, I was able to get about 3 weeks out of the system before I noticed this in the syslog, and a crashed video driver on the Win10 guest.

May 16 05:26:30 pve kernel: DMAR: DRHD: handling fault status reg 3
May 16 05:26:30 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8024cff5000 [fault reason 0x07] Next page table ptr is invalid

@bheikes1
Copy link
Author

Same setup and versions as last time, looks like same error.

May 26 03:05:48 pve kernel: DMAR: DRHD: handling fault status reg 3
May 26 03:05:48 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffffb8021a516000 [fault reason 0x07] Next page table ptr is invalid

@bheikes1
Copy link
Author

Same setup as before.

Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800401e73000 [fault reason 0x07] Next page table ptr is invalid
Jun 12 02:50:44 pve kernel: DMAR: DRHD: handling fault status reg 2
Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800500e73000 [fault reason 0x07] Next page table ptr is invalid
Jun 12 02:50:44 pve kernel: DMAR: DRHD: handling fault status reg 2
Jun 12 02:50:44 pve kernel: DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xffff800401e70000 [fault reason 0x07] Next page table ptr is invalid

@tpressure
Copy link

Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c010
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 0000000000000000 guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: spt 00000000315456ba guest entry 0xffffffffffffffff type 9
Mar 15 16:03:14 pve kernel: gvt: vgpu 1: fail: shadow page 00000000315456ba guest entry 0xffffffffffffffff type 9.
Mar 15 16:03:14 pve kernel: gvt: guest page write error, gpa 4df6c018

For these kinds of errors, you can try the workaround I've posted here: #153 (comment)

It's not a 100% solution though. Check the comments in #153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants