OS-6923: bhyve could be more precise with vPIT

Details

Issue Type:Improvement
Priority:4 - Normal
Status:Resolved
Created at:2018-05-02T20:59:08.298Z
Updated at:2018-06-19T14:55:20.193Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-06-19T14:55:10.575Z)

Fix Versions

2018-06-21 Underwater Reactor (Release Date: 2018-06-21)

Labels

bhyve

Description

As part of an investigation into guest troubles with NTP synchronization while running under bhyve, I found that the means by which the PIT is emulated leaves some room for improvement. During testing, I was booting a Linux guest with apic=debug in the kernel params to provide detailed information about timing parameters of the guest:

[    0.102614] Using local APIC timer interrupts.
               calibrating APIC timer ...
[    0.108000] ... lapic delta = 838768
[    0.108000] ... PM-Timer delta = 358171
[    0.108000] ... PM-Timer result ok
[    0.108000] ..... delta 838768
[    0.108000] ..... mult: 36024811
[    0.108000] ..... calibration result: 536811
[    0.108000] ..... CPU clock speed is 2799.2422 MHz.
[    0.108000] ..... host bus clock speed is 134.0811 MHz.
[    0.108046] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (family: 0x6, model: 0x3e, stepping: 0x4)

The emulated lapic timer runs at a frequency of 134.217Mhz (128 x 2^20 hz). The vPIT, running at the standard frequency of 1193182hz, is configured to fire at the scheduler HZ frequency, resulting in the kernel sampling the APIC timer value over the course of 100ms. The sample value of the 100ms interval is represented by lapic delta or 838768 "ticks" of the APIC over the period. Factoring in the divider setting of 16, the guest APIC should have counted 838860 "ticks", a difference of 92 from the measured value or ~109ppm.

Perfect timing isn't to be expected here, as we're at the mercy of many factors including timer precision on the host. That said, one thing stuck out as I was reviewing the vatpit code in bhyve: The sbintime_t type was used for frequency and duration calculations for the vPIT. Normally, this type is more than adequate, with 32 bits for the fractional portion of the time. In the vatpit, however, it's used to store the inverse of the PIT frequency:

        FREQ2BT(PIT_8254_FREQ, &bt);
        vatpit->freq_sbt = bttosbt(bt);

This is out at the edge of the limits of useful precision for sbintime_t and makes itself somewhat apparent when that value is effectively multiplied out as the timer fires and resets, adding the frequency-interval value to determine the next expiration.

I wrote up a throw-away program scaffold to compare how these calculations would compare when using the full bintime_t vs sbintime_t:

/* from linux */
#define LHZ     250
#define PIT_LATCH       ((PIT_8254_FREQ + LHZ/2) / LHZ)
#define PIT_LATCH_NS    3999809 /* calced from above */

#define PIT_LATCH25     (PIT_LATCH * 25)
#define PIT_LATCH25_NS  99995223 /* calced from above */


static hrtime_t
sbttohrtime(sbintime_t sbt)
{
        return (((sbt >> 32) * NANOSEC) +
            (((uint64_t)NANOSEC * (uint32_t)sbt) >> 32));
}

int
main()
{
        struct bintime base, trunc;
        sbintime_t sbt, tick_sbt;
        hrtime_t tick_sbt_hr;
        uint64_t freq;
        double err;

        FREQ2BT(PIT_8254_FREQ, &base);
        sbt = bttosbt(base);
        trunc = sbttobt(sbt);
        freq = BT2FREQ(&trunc);

        printf("orig freq:\t%lu\n", PIT_8254_FREQ);
        printf("sbt freq:\t%lu\n", freq);

        err = ((freq - PIT_8254_FREQ) * 1.0) / PIT_8254_FREQ;
        printf("err: %lu / %lu = %f%%\n", (freq - PIT_8254_FREQ), PIT_8254_FREQ,
             err * 100.0);

        tick_sbt_hr = sbttohrtime(sbt * PIT_LATCH);
        err = ((PIT_LATCH_NS - tick_sbt_hr) * 1.0) / PIT_LATCH_NS;
        printf("hr err: %lu/%lu = %f%%\n", (PIT_LATCH_NS - tick_sbt_hr),
            PIT_LATCH_NS, err * 100.0);

        tick_sbt_hr = sbttohrtime(sbt * PIT_LATCH25);
        printf("%lu %lu\n", tick_sbt_hr, PIT_LATCH25_NS);
        err = ((PIT_LATCH25_NS - tick_sbt_hr) * 1.0) / PIT_LATCH25_NS;
        printf("hr err: %lu/%lu = %f%%\n", (PIT_LATCH25_NS - tick_sbt_hr),
            PIT_LATCH25_NS, err * 100.0);

        return (0);
}

The results:

orig freq:      1193182
sbt freq:       1193378
err: 196 / 1193182 = 0.016427%
hr err: 238/3999809 = 0.005950%
99989277 99995223
hr err: 5946/99995223 = 0.005946%

Here we can see 59ppm of error simply due to the inadequate precision. Considering the sampled value by the guest was off by 109ppm and programs like NTP tolerate drift of only 500ppm, it seems like the little addition work to use bintime_t rather than sbintime_t might be justified.

Comments

Comment by Former user
Created at 2018-05-03T02:14:52.909Z

I took more readings of that lapic delta across several boots:

[    0.100000] ... lapic delta = 838753
[    0.108000] ... lapic delta = 838776
[    0.112000] ... lapic delta = 838774
[    0.108000] ... lapic delta = 838706
[    0.112000] ... lapic delta = 838768
[    0.108000] ... lapic delta = 838769

And with a patch which switched to bintime_t for vPIT calculations (as well as localizing PIT cyclics):

[    0.116000] ... lapic delta = 838922
[    0.116000] ... lapic delta = 838903
[    0.100000] ... lapic delta = 838908
[    0.108000] ... lapic delta = 838906
[    0.116000] ... lapic delta = 838905
[    0.124000] ... lapic delta = 838907

These correspond to the following error rates.

Stock:

838753 - 107    (128 ppm)
838776 - 84     (100 ppm)
838774 - 86     (103 ppm)
838706 - 154    (184 ppm)
838768 - 92     (110 ppm)
838769 - 91     (108 ppm)

Patched:

838922 - 62     (74 ppm)
838903 - 43     (51 ppm)
838908 - 48     (57 ppm)
838906 - 46     (55 ppm)
838905 - 45     (54 ppm)
838907 - 47     (56 ppm)

Comment by Jira Bot
Created at 2018-06-19T14:55:20.193Z

illumos-joyent commit 4f723dcffd015c7d11cd1c7a3155f46aa0600646 (branch master, by Patrick Mooney)

OS-6923 bhyve could be more precise with vPIT
OS-6849 bhyve should localize vatpit resources
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>