OS-7335: atomic ops in syscall_mstate() induce significant overhead

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-10-30T17:18:39.751Z
Updated at:2019-01-14T16:20:50.068Z

People

Created by:John Levon [X]
Reported by:John Levon [X]
Assigned to:John Levon [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-11-12T17:47:50.354Z)

Fix Versions

2018-11-22 Funcooker (Release Date: 2018-11-22)

Related Links

Description

As described in https://www.illumos.org/issues/9936, we should replace the atomic ops around the zone statistics with a more scalable method. I've prototyped using per-CPU arrays and summing these at collection time with good results.

Comments

Comment by John Levon [X]
Created at 2018-10-30T17:21:41.512Z

My test case for this so far is using an OS zone with unlimited CPU on this system:

[root@volcano ~]# psrinfo -vp
The physical processor has 14 cores and 28 virtual processors (0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54)
...
    x86 (GenuineIntel 406F1 family 6 model 79 step 1 clock 2600 MHz)
      Intel(r) Xeon(r) CPU E5-2690 v4 @ 2.60GHz
The physical processor has 14 cores and 28 virtual processors (1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55)
...
    x86 (GenuineIntel 406F1 family 6 model 79 step 1 clock 2600 MHz)
      Intel(r) Xeon(r) CPU E5-2690 v4 @ 2.60GHz
[root@volcano ~]# prtconf | head
System Configuration:  Dell Inc.  i86pc
Memory size: 262050 Megabytes

I ran ./tools/build_illumos in a loop ten times, and saw results very similar to the graph seen on the Illumos bug. During this, I observed both the individual sys/user/wait kstats and the overall virtualized load average, and it all seemed to be OK.


Comment by John Levon [X]
Created at 2018-10-31T13:32:28.211Z

I ran a couple of other tests too:


Comment by John Levon [X]
Created at 2018-11-02T16:32:30.508Z
Updated at 2018-11-02T16:35:05.595Z

My first version didn't pad to cacheline size, which is definitely worth doing. During investigation of some variants, I got a bunch of different results. "union-64" looks like this:

0 zone_stime for CPU0
8-63 pad
64 zone_stime for CPU1

then separate arrays for utime, wtime. union-128 is a 128-byte aligned version of this. "uarray" looks, instead, like this:

0 zone_stime for CPU0
8 zone_utime for CPU0
16 zone_wtime for CPU0
24-63 pad
64 zone_stime for CPU1
72 zone_utime for CPU1
80 zone_wtime for CPU1
...

In addition to the layout differences here, "union" is a separate allocation for each u/s/wtime array, and "uarray" is a single allocation.

These are the kernel compile results:

baseline:real    3m11.514s
baseline:real    3m0.152s
baseline:real    3m18.280s
baseline:real    3m18.757s
baseline:real    3m18.421s
non-aligned:real    2m23.363s
non-aligned:real    2m18.896s
non-aligned:real    2m23.524s
non-aligned:real    2m23.740s
non-aligned:real    2m22.708s
uarray:real    2m14.632s
uarray:real    2m14.192s
uarray:real    2m6.944s
uarray:real    2m8.980s
uarray:real    2m9.611s
uarray-128:real    2m21.399s
uarray-128:real    2m17.659s
uarray-128:real    2m18.188s
uarray-128:real    2m17.470s
uarray-128:real    2m17.606s
union-128:real    2m16.927s
union-128:real    2m28.121s
union-128:real    2m7.509s
union-128:real    2m6.757s
union-128:real    2m11.458s
union-64:real    2m19.802s
union-64:real    2m17.611s
union-64:real    2m18.974s
union-64:real    2m17.248s
union-64:real    2m20.900s

Basically, uarray wins here. But when considering the upstream micro-benchmark which does getppid for 60 processes (system has 56 CPUs):

baseline:average:7897781
baseline:average:7894385
uarray:average:25885456
uarray:average:25464986
uarray-128:average:50031454
uarray-128:average:50345253
union-128:average:50654839
union-128:average:50506947
union-64:average:32470333
union-64:average:33126815

We can see there's a big advantage to 128-byte padding. This is cross-socket traffic, as we can see if we turn off the second socket:

baseline:average:10762302
baseline:average:10763861
uarray:average:32505923
uarray:average:32556042
uarray-128:average:32526533
uarray-128:average:32539482
union-128:average:32822138
union-64:average:32819296
union-64:average:32875118

Comment by John Levon [X]
Created at 2018-11-05T19:48:53.673Z

The uarray-128 results above for illumos build aren't representative. From more testing, they're basically the same as uarray.


Comment by John Levon [X]
Created at 2018-11-05T22:16:40.820Z

So, on a presumption that the 64 versus 128 improvement in the getppid case might be due to adjacent lines,
I disabled the BIOS option "Adjacent Line Pre-fetch". The results here look like this:

uarray:average:41560421
uarray:average:40894510
uarray:average:41692073
uarray-128:average:57175191
uarray-128:average:57028316
uarray-128:average:57310125

IOW, there is still a significant difference, regardless of this BIOS setting.

If I run the test with just 28 tasks:

uarray:average:43290163
uarray:average:43323546
uarray:average:43331092
uarray-128:average:43873811
uarray-128:average:43879470
uarray-128:average:43852918

which seems to imply still that it's cross-socket stuff


Comment by John Levon [X]
Created at 2018-11-12T13:51:02.849Z

Testing of latest changes same as described above


Comment by Jira Bot
Created at 2018-11-12T14:56:24.925Z

illumos-joyent commit aa694ffb2f51c06ac16cadc3685f0e847b680409 (branch master, by John Levon)

OS-7335 atomic ops in syscall_mstate() induce significant overhead
OS-7339 zone secflags are not initialized correctly
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>


Comment by Jira Bot
Created at 2019-01-14T16:20:50.068Z

illumos-joyent commit f9f91f24e6485d0445061fe4b6f0d256a59a6b37 (branch master, by John Levon)

OS-7509 OS-7335 upstream causes duplicate allocation lines
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>