OS-6591: VMX support is not checking for PAT load/save support

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-02-20T18:07:06.224Z)

Fix Versions

2018-03-15 Nibelheim (Release Date: 2018-03-15)

Related Issues

Related Links

Description

I tried booting @mike.gerdts platform "20180126T183046Z" in my COAL and then creating a bhyve VM using payload:

{
  "alias": "b6",
  "brand": "bhyve",
  "resolvers": [
    "8.8.8.8",
    "8.8.4.4"
  ],
  "ram": 1024,
  "vcpus": 2,
  "nics": [
    {
      "nic_tag": "external",
      "ip": "172.26.17.206",
      "netmask": "255.255.255.0",
      "gateway": "172.26.17.1",
      "model": "virtio",
      "vlan_id": 3317,
      "primary": true
    }
  ],
  "disks": [
    {
      "image_uuid": "38396fc7-2472-416b-e61b-d833b32bd088",
      "boot": true,
      "model": "virtio"
    }
  ]
}

and this panicked my COAL with:

> ::status
debugging crash dump vmcore.0 (64-bit) from headnode
operating system: 5.11 joyent_20180126T183046Z (i86pc)
image uuid: (not set)
panic message: vmcs_init error 2
dump content: kernel pages only
> ::stack
vpanic()
vmx_vminit+0x792(ffffff04265fe000, ffffff031a3dda10)
vmm`vm_init+0x49(ffffff04265fe000, 1)
vm_create+0xcb(ffffff0011bcec30, ffffff03cf6226d0)
vmmdev_do_vm_create+0x182(ffffff0011bcec30, ffffff035cdfde18)
vmm_ioctl+0x168(13000000000, 564d01, 66611a, 202403, ffffff035cdfde18, ffffff0011bceea8)
cdev_ioctl+0x39(13000000000, 564d01, 66611a, 202403, ffffff035cdfde18, ffffff0011bceea8)
spec_ioctl+0x60(ffffff0349768840, 564d01, 66611a, 202403, ffffff035cdfde18, ffffff0011bceea8)
fop_ioctl+0x55(ffffff0349768840, 564d01, 66611a, 202403, ffffff035cdfde18, ffffff0011bceea8)
ioctl+0x9b(3, 564d01, 66611a)
sys_syscall+0x19f()
>

I have uploaded the dump to thoth as 37d01332034a1443db66648c873d222f

Comments

Comment by Patrick Mooney
Created at 2018-02-08T16:59:43.187Z
For reference, this was nested virt under VMware Fusion, so something funky may have been afoot

Comment by Bryan Cantrill
Created at 2018-02-10T01:19:49.062Z
Even though this is under VMware Fusion, it's dead reproducible – and wondering if the answer here may also relate to OS-6603.

Comment by Josh Wilsdon
Created at 2018-02-10T07:50:27.736Z
Update! I did some futzing around with this... While it was dead reproducible before, I went into VMWare and "upgraded" the virtual hardware of this COAL VM to "hardware version 12". After doing this I have been able to actually boot a bhyve VM! I'm not sure what the previous virtual hardware version was, but it will be whatever is the default that comes with a `make coal` build. I am running VMWare Fusion Version 8.5.10 (7527438) in case that's helpful.

Comment by Josh Wilsdon
Created at 2018-02-11T22:39:03.651Z
I just ran through a new setup of COAL and the default hardware version is 9.

The following patch seems to be all that's required for updating:

--- USB-headnode.vmx.bak	2018-02-11 11:50:19.000000000 -0800
+++ USB-headnode.vmx	2018-02-11 11:50:51.000000000 -0800
@@ -1,6 +1,6 @@
 .encoding = "UTF-8"
 config.version = "8"
-virtualHW.version = "9"
+virtualHW.version = "12"
 scsi0.present = "TRUE"
 scsi0.virtualDev = "lsilogic"
 memsize = "6144"

Comment by Mike Gerdts
Created at 2018-02-12T14:59:02.043Z
Got tabs confused. I thought that I had added my earlier comment to a different bug and deleted it. That still stands - it looks like this is a non-issue if we require COAL to use a relatively recent version of VMware.

Comment by John Levon
Created at 2018-02-13T10:35:37.870Z
FWIW, as I had to reinstall COAL today, I took the opportunity to diff cpuid between VMWare versions 9 and 12:

/jlevon/public/bugs/OS-6591/cpuid.diff

Nothing seems obvious there though.

Comment by John Levon
Created at 2018-02-15T13:55:23.807Z
We are failing here:

364         if ((error = vmwrite(VMCS_HOST_IA32_PAT, pat)) != 0)

and this is backed up by the IA32_VMX_EXIT_CTLS MSR, which reports
0x33ffff00036dfb. Namely the one setting part of this (0x33ffff) does not have the bits
set for VM_EXIT_LOAD_PAT or VM_EXIT_SAVE_PAT.

This is a panic, not a clean failure, because of an upstream bug, namely these
values are missing from VM_EXIT_CTLS_ONE_SETTING.

Comment by John Levon
Created at 2018-02-15T14:22:14.663Z
It's not until HW version 11 that VMWare supports this feature. This corresponds (according to Wikipedia) to VMWare Fusion 7, released September 2014.

On HW, the actual feature dates back to at least 2008, and this comment:
https://www.mail-archive.com/kvm@vger.kernel.org/msg03698.html
implies that it's part and parcel with EPT, in which case we do not need to worry about support.

So I'm picking up this bug to convert the panic into the proper failure to load vmm.

It would be nice if the next COAL release could use HW11, but I'm not sure if that's considered too new.

Comment by John Levon
Created at 2018-02-15T15:03:23.527Z
The upstream history of this doesn't seem to make much sense:

commit f458d8769b166545af7d9b45e377be8f224f26dc
Author: neel <neel@FreeBSD.org>
Date:   Tue Feb 24 05:35:15 2015 +0000

    Always emulate MSR_PAT on Intel processors and don't rely on PAT save/restore

removed use of VM_EXIT_LOAD/SAVE_PAT in the checking. But the Intel manual is clear:

24.5 HOST-STATE AREA
— IA32_PAT (64 bits). This field is supported only on processors that support the 1-setting of the “load
IA32_PAT” VM-exit control.

It's unclear to me how the upstream fix could work, but for right now at least, we should require VM_EXIT_LOAD_PAT.

Comment by Jira Bot
Created at 2018-03-07T19:07:23.219Z
illumos-joyent commit 5367c10ccf59c25a48d868138bd4e511e70b0dcf (branch master, by John Levon)

OS-6591 VMX support is not checking for PAT load/save support
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>