OS-7662: need a way to disable SMT

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-03-14T12:27:42.024Z
Updated at:2019-09-04T12:10:58.824Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-05-15T19:41:45.031Z)

Fix Versions

2019-05-23 Spaceman (Release Date: 2019-05-23)

Related Issues

Description

For various reasons, we need a way to effectively disable hyper-threading in SmartOS and Triton, in a managable way. The exact details need thrashing out, but the basic idea is that during early(ish) boot, some configuration item (possibly a kernel cmdline one, possibly in the USB key's config.inc/) will trigger a call to the kernel to offline all the sibling CPUs. They will no longer be available for scheduling of processes, including HVM processes.

A non-exhaustive list of things to think about here:

Comments

Comment by Former user
Created at 2019-03-25T14:26:23.293Z

For standalone SmartOS, the boot option is a pain:

So we will add support for ht_enabled in /usbkey/config. This will disable the sibling CPUs when we reach the smartdc/config service. This is much later in boot, but still before we start running any workloads. We'll modify the install script to disable HT by default (probably).


Comment by Former user
Created at 2019-03-27T14:53:07.982Z

There was concern over what disabling HT siblings might mean for how interrupts are allocated across the system. I took a look at what happens today and what the implications might be.

I more or less discounted anything pcplusmp related, on the presumption that most useful systems these days are going to be using apix. If not, pcplusmp limits MSI-X to 2 interrupts anyway, which would seems like it wouldn't be an issue here.

First up we have IRM-enabled drivers. Since we basically care about NIC interrupts, this means ixgbe, the only IRM-enabled driver we have right now (as discussed in OS-6786).

Essentially, IRM hands out as many interrupts as a driver can handle. The pool size is limiting factor (aside from the actual number of interrupts actually supported for the PCI device), and this is more or less the very large:

ipool_totsz = (ncpus_online) * (nr HW vectors)

since apix can place any individual interrupt in any of the standard vectors, on any CPU_ENABLEd CPU.

ixgbe limits itself to a maximum of 16, but if it were to ask for more, it would get them. IRM expects that a shortage would lead to an IRM callback to reclaim interrupts, but realistically, we're unlikely to hit a shortage.

Because we call ht_late_init() prior to the NICs attaching themselves typically, in one sense the offlining of the siblings is correct WRT IRM. When a CPU is taking into no-intr mode, it reduces the size of the IRM pool. So by the time ixgbe calls ddi_intr_alloc(), the siblings are already taken out of consideration. (This isn't true for SmartOS method described above).

None of this seems that useful in terms of figuring out how many interrupts a driver would actually want for things like RSS. In other words, the IRM mechanism is designed for managing maximums, not right-sizing.

It seems like the driver itself would have to try to figure it out - and clearly ixgbe doesn't even try. From this point of view, the sibling offlining makes little difference.

For non-IRM drivers, as per OS-6786, we are limited to a maximum of 8 anyway. It doesn't seem like the impact of losing the siblings is likely to be significant there. It's unclear what cost extra interrupts on a CPU (say, doubling up) actually has. We have headroom in terms of vector slots under apix, and otherwise it seems harmless.

If/when we actually raise the ddi_msix_alloc_limit to something more modern, we could potentially base that value dynamically on ncpus_intr_enabled (I'm going to account this in my changes, just in case we could use it). But even then it seems like we'd want to at least double the returned maximum, with perhaps some slop?

And all this is is aside from multiple-socket considerations, which seems likely to have a fair greater impact. As OS-6786 notes, we don't seem to make very good choices there right now.


Comment by Former user
Created at 2019-03-28T13:12:45.440Z
Updated at 2019-03-28T16:19:26.630Z

To come back to: sysinfo still reports all CPUs, not just online ones with these changes:

CPU_Total_Cores=32

This is used by DAPI's calculateServerUnreserved():

 71                 /* also convert to MiB and cpu_cap units */                      
 72                 server.unreserved_cpu = server.sysinfo['CPU Total Cores'] * 100; 
...

It's also reported in sdc-adminui. Given its name, it seems like we should leave this alone, and introduce a new "CPUs Online" sysinfo parameter, and change DAPI to use that.


Comment by Former user
Created at 2019-05-14T19:35:50.365Z

Testing notes are in TRITON-1353


Comment by Jira Bot
Created at 2019-05-15T17:15:43.804Z

smartos-live commit a89b45abfd80497679a12f385678d5dda3fde9ad (branch master, by John Levon)

OS-7662 need a way to disable SMT
OS-7684 sysinfo mis-parses bootparams
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>


Comment by Jira Bot
Created at 2019-05-15T17:15:48.704Z

illumos-kvm commit a40ccdebb9773dc0c23527bf5a74646a3c037563 (branch master, by John Levon)

OS-7662 need a way to disable SMT
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>


Comment by Jira Bot
Created at 2019-05-15T17:15:59.993Z

illumos-joyent commit d980e527387fd27a9f615306897c215e07c5df8b (branch master, by John Levon)

OS-7662 need a way to disable SMT
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>