OS-7583: bhyve goes awry on large memory systems

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-02-15T14:54:51.129Z
Updated at:2019-02-15T22:39:34.906Z

People

Created by:Patrick Mooney [X]
Reported by:Patrick Mooney [X]
Assigned to:Patrick Mooney [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-02-15T22:39:34.888Z)

Fix Versions

2019-02-28 Mind Grapes (Release Date: 2019-02-28)

Related Links

Labels

bhyve

Description

One of our customers reported issues with bhyve instance failing to boot on their CN. Those guests would "panic" early in the uefi ROM, all with the same status:

!!!! X64 Exception Type - 000000000000000E     CPU Apic ID - 00000000 !!!!
RIP  - 000000003FAB67CC, CS  - 0000000000000028, RFLAGS - 0000000000010202
ExceptionData - 0000000000000010
RAX  - 00000000FED00000, RCX - 00000000FED00000, RDX - 000000003FAE94FC
RBX  - 000000003FAB52AF, RSP - 000000003FBDDBC0, RBP - 000000003FBDDBF0
RSI  - 0000000000000007, RDI - 0000000000000000
R8   - 000000003FAB7420, R9  - 000000003F996F18, R10 - 0000000000000000
R11  - 0000000000000000, R12 - 0000000000000000, R13 - 0000000000000000
R14  - 0000000000000000, R15 - 0000000000000000
DS   - 0000000000000008, ES  - 0000000000000008, FS  - 0000000000000008
GS   - 0000000000000008, SS  - 0000000000000008
CR0  - 0000000080000033, CR2 - 000000003FAB67CC, CR3 - 000000003FB7C000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 000000003FB66E98 000000000000003F, LDTR - 0000000000000000
IDTR - 000000003F2EE018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000003FBDD820
!!!! Find PE image /export/ws/smartos-live/projects/illumos-extra/uefi-edk2/uefi-edk2-a36132939e-64/Build/BhyveX64/RELEASE_ILGCC/X64/PcAtChipsetPkg/HpetTimerDxe/HpetTimerDxe/DEBUG/HpetTimerDxe.dll (ImageBase=000000003FAB5000, EntryPoint=000000003FAB52AF) !!!!

Which decodes to a function for reading an HPET register:

0x3fab67b0:                     pushq  %rbp
0x3fab67b1:                     movq   %rsp,%rbp
0x3fab67b4:                     subq   $0x30,%rsp
0x3fab67b8:                     movq   %rcx,0x10(%rbp)
0x3fab67bc:                     movq   $0x3fab66a6,%rax
0x3fab67c6:                     call   *%rax
0x3fab67c8:                     movq   0x10(%rbp),%rax
0x3fab67cc:                     movq   (%rax),%rax
0x3fab67cf:                     movq   %rax,-0x8(%rbp)
0x3fab67d3:                     movq   $0x3fab66a6,%rax
0x3fab67dd:                     call   *%rax
0x3fab67df:                     movq   -0x8(%rbp),%rax
0x3fab67e3:                     leave
0x3fab67e4:                     ret

The instruction in question is performing the actual read (with %rax holding the relevant address 0xFED00000.

If we follow the exception data given to us: type:0xE data:0x10, we come up with a page-fault - "non-present page, read, supervisor-mode, caused by instruction fetch". With %cr2 containing the same address as %rip, it does seem we faulted on the instruction itself, not the address. An instruction fetch failure in this context does not make much sense.

The customer was kind enough to grant me access to debug the failing instances. I first verified the validity of the EPT and in-guest page tables, which looked fine. After several reboots where the guest consistently experienced the same fault, I was able to capture some interesting data from the in-kernel bhyve logic:

96536858355373 ptp_hold(3fb7c000) = fffff8b744406000
96536858361914 pf_error_code(0, 5, 0, 0)
96536858371625 vm_inject_pf(fffffd6a0e9f9000, 0, 10, 3fab67cc)

96727266522633 ptp_hold(3fb7c000) = fffff88de0206000
96727266527810 pf_error_code(0, 5, 0, 6574734120656874)
96727266534354 vm_inject_pf(fffffd695df59000, 0, 10, 3fab67cc)

As part of the instruction decode for an EPT fault (due to accessing the not-backed-by-physmem vHEPT address), bhyve is walking the guest page table to determine the guest-physical address of %rip. It's successfully getting a page hold on the top level table pointed to by %cr3, but the PTE its getting does not look right. (The PTE is the final argument to the pf_error_code calls, 0 in the first case and 6574734120656874 in the second. Some part of the PTE fetch is clearly wrong.

After more rebooting and tracing, since the issue was not always 100% reliable, I was able to capture more tracing data which pointed us in the right direction:

99478544110470 vm_object_pager_heap(3fb7c000) = 8d03c86
99478544115691 ptp_hold(3fb7c000) = fffff88d03c86000
99478544122115 pf_error_code(0, 5, 0, c16)
99478544129023 vm_inject_pf(fffffd691ba22000, 0, 10, 3fab67cc)

> fffffd74efa4aab0::print 'struct vm_object'
{
    vmo_refcnt = 0x3
    vmo_type = 0
    vmo_size = 0x40000000
    vmo_pager = vm_object_pager_heap
    vmo_data = 0xfffff965ff300000
    vmo_lock = {
        _opaque = [ 0 ]
    }
    vmo_attr = '\006'
}
> fffff965ff300000+3fb7c000=K
                fffff9663ee7c000
> fffff965ff300000+3fb7c000=K
> fffff9663ee7c000::vatopfn
        level=0 htable=0xfffffd6b167b7b90 pte=0x8d03c86263
        level=1 htable=0xfffffd68f0d1ee20 pte=0xb8ad3c3027
        level=2 htable=0xfffffd64eb121e00 pte=0x2e4e04c027
        level=3 htable=0xfffffd64eb1250a0 pte=0xb9ebfff027
Virtual address 0xfffff9663ee7c000 maps pfn 0x8d03c86

> fffff9663ee7c000/K
0xfffff9663ee7c000:             3fb7d023
> *kpm_vbase + 8d03c86000 =K
                fffff90d03c86000
> fffff90d03c86000/K
0xfffff90d03c86000:             3fb7d023

The pfn returned by vm_object_pager_heap appears to be right, when verified against the mappings for that VM memory object. Once its direct-mapped address is calculated at the end of vm_gpa_hold, things seem to go wrong. That calculation is done as such:

return ((void *)(PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)) + pageoff));

Nothing seems wrong there. What about PHYS_TO_DMAP?:

#define PHYS_TO_DMAP(x) ({                      \
        ASSERT((uintptr_t)(x) < kpm_size);      \
        (uintptr_t)(x) | (uintptr_t)kpm_vbase; })

This seems fine until one closely examines the true value of kpm_vbase and the circumstances of the CN in question. This machine has 768GB of physical memory attached. The address of the KPM range, fffff88000000000 has enough trailing zeros to handle a large range of addresses via the bitwise-AND, but not that large. In this case, the most significant bit of the PFN is colliding with the lowest asserted bit in kpm_vbase and the calculation is wrapping. The address should be ADDed instead of ANDed.

Comments

Comment by Patrick Mooney [X]
Created at 2019-02-15T19:54:10.429Z

While I don't have a 512GB+ system to test this on, I did boot up guests on PIs (non-DEBUG and DEBUG) featuring this fix to ensure that they still functioned as expected. Additionally, I disassembled the compiled/linked functions which use the macros to verify that the expected inline-ing had occurred.


Comment by Jira Bot
Created at 2019-02-15T22:38:09.713Z

illumos-joyent commit cae3a8510282b8d062b677f03a671b304db5e1be (branch master, by Patrick Mooney)

OS-7583 bhyve goes awry on large memory systems
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>