OS-7753: THREAD_KPRI_RELEASE does nothing of the sort

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-04-17T23:53:27.350Z
Updated at:2019-06-14T13:45:47.076Z

People

Created by:Patrick Mooney
Reported by:Patrick Mooney
Assigned to:Patrick Mooney

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-05-23T03:24:04.752Z)

Fix Versions

2019-06-06 TGS (Release Date: 2019-06-06)

Related Issues

Related Links

Description

On several occasions, most recently in SWSUP-1426, systems have been observed in dire or deadlocked states thanks to threads running with MINCLSYSPRI acquired via THREAD_KPRI_REQUEST. With such a thread blocks on a lock with nonzero t_kpri_req, FSS and TSS will set a flag (FSSKPRI and TSSKPRI, respectively) which will persist in the scheduler struct until it is cleared in fss_trapret when the thread returns to userspace. If the thread spends a lot of time in userspace, as is common for KVM, bhyve, or workloads like sendfile(), then FSSKPRI will remain asserted, causing that thread to bypass many scheduler decisions, including expiration of its time quantum so other threads can preempt it.

This behavior has dubious origins, and in the world of fully preemptable kernel logic running on large multi-CPU systems, it seems ideal to rip it out.

Comments

Comment by Mike Gerdts
Created at 2019-04-22T14:09:20.157Z

Another instance of this is in SWSUP-1427. In this latest case the cpu flags are slightly different (cpu is quiesced) and the bhyve thread's t_pri is 0x3b rather than 1. t_epri is 0x3c in both cases.


Comment by Patrick Mooney
Created at 2019-04-26T04:08:13.718Z
Updated at 2019-04-26T04:10:14.985Z

With a stock PI and one with my change to rip out the KPRI logic, I ran full illumos-joyent nightly builds and compared the performance. When run repeatedly, the build duration was consistent within each PI, but the patched version was faster by a significant margin. Here are two such runs:

Stock:

==== Elapsed build time (non-DEBUG) ====

real    14:01.1
user  1:50:14.7
sys     55:59.8

Patched:

==== Elapsed build time (non-DEBUG) ====

real    10:37.6
user  1:52:30.6
sys     46:22.0

Between each of the builds, I clobbered much of the workspace (leaving the bootstrap tools intact) and then initiated the build after a reboot so ARC warmth would not be a factor.

At @bryan's request, I captured kstat -p before and after those two builds. The output files are attached.


Comment by Bryan Cantrill
Created at 2019-04-26T17:12:08.074Z

Based on Patrick's data here are the CPU stat deltas:

STAT                          STOCK        PATCHED          DELTA %DELTA
anonfree                          0              0              0
anonpgin                          0              0              0
anonpgout                         0              0              0
as_fault                  251918824      251290619        -628205 -0.249%
bawrite                         273            243            -30 -11%
bread                           113            113              0 0 %
bwrite                          319            283            -36 -11.3%
canch                             0              0              0
class                             0              0              0
cow_fault                  69901189       69813180         -88009 -0.126%
cpumigrate                  3307100        1738532       -1568568 -47.4%
crtime                            0              0              0
dfree                             1              0             -1 -100%
execfree                          0              0              0
execpgin                          0              0              0
execpgout                         0              0              0
fileovf                           0              0              0
fsfree                            1              0             -1 -100%
fspgin                         4418           4470             52 +1.18%
fspgout                           1              0             -1 -100%
hat_fault                         0              0              0
idle                       36390845       24360985      -12029860 -33.1%
idlethread                  7833598        4120089       -3713509 -47.4%
inodeovf                          0              0              0
intr                       19106296       19644542         538246 +2.82%
intrblk                        1586           1557            -29 -1.83%
intrthread                   493581         420577         -73004 -14.8%
inv_swtch                   2356426        1806256        -550170 -23.3%
iowait                            0              0              0
kernel                      4643373        3873787        -769586 -16.6%
kernel_asflt                      0              0              0
lread                      66992163       66994000           1837 +0.00274%
lwrite                         1182           1082           -100 -8.46%
maj_fault                      1993           2010             17 +0.853%
mdmint                            0              0              0
modload                           0              1              1
modunload                         0              0              0
msg                               0              0              0
mutex_adenters             76371288       88458955       12087667 +15.8%
namei                      29235565       29228500          -7065 -0.0242%
nthreads                     393714         393768             54 +0.0137%
outch                       4110368        2622815       -1487553 -36.2%
pgfrec                     19481008       18565124        -915884 -4.7%
pgin                           3272           3303             31 +0.947%
pgout                             1              0             -1 -100%
pgpgin                         4418           4470             52 +1.18%
pgpgout                           1              0             -1 -100%
pgrec                      19481008       18565124        -915884 -4.7%
pgrrun                            0              0              0
pgswapin                          0              0              0
pgswapout                         0              0              0
phread                            8              0             -8 -100%
phwrite                           0              0              0
physio                            0              0              0
procovf                           0              0              0
prot_fault                 72607693       72819990         212297 +0.292%
pswitch                    23034918       14945499       -8089419 -35.1%
rawch                            50              2            -48 -96%
rcvint                            0              0              0
readch                  45305994866    45303289276       -2705590 -0.00597%
rev                               0              0              0
rw_rdfails                   423967         163481        -260486 -61.4%
rw_wrfails                    23041          30654           7613 +33%
scan                              0              0              0
sema                              0              0              0
softlock                         48             48              0 0 %
swap                              0              0              0
swapin                            0              0              0
swapout                           0              0              0
syscall                   234811783      234581520        -230263 -0.0981%
sysexec                      544018         544012             -6 -0.0011%
sysfork                      240134         240128             -6 -0.0025%
sysread                     6787944        6780676          -7268 -0.107%
sysvfork                     139524         139524              0 0 %
syswrite                    2429223        2413843         -15380 -0.633%
trap                      322828420      321855365        -973055 -0.301%
ufsdirblk                      3401           3493             92 +2.71%
ufsiget                        1094           1075            -19 -1.74%
ufsinopage                        0              0              0
ufsipage                          0              0              0
user                        7168200        7293837         125637 +1.75%
wait                              0              0              0
wait_io                           0              0              0
wait_pio                          0              0              0
wait_swap                         0              0              0
writech                  8835731373     8832322426       -3408947 -0.0386%
xcalls                     13440989       15144785        1703796 +12.7%
xmtint                            0              0              0
zfod                      114131950      114120115         -11835 -0.0104%

Comment by Patrick Mooney
Created at 2019-05-02T20:10:40.976Z

I hacked up a rather ugly D script to collect information about how long threads spend blocked on rwlocks, as they too are effected by the KPRI removal:

turnstile_block:entry
/arg2 && arg3 == (uint64_t)&`rw_sobj_ops/
{
        self->t = timestamp;
        self->q = ((arg1 == 0) ? 2 : 0) | ((*((uint64_t *)arg2) & 0x4) >> 2);
}

turnstile_block:return
/self->t/
{
        this->diff = timestamp - self->t;
        self->t = 0;

        if (self->q == 3) {
                @writer_writer = quantize(this->diff);
        } else if (self->q == 2) {
                @writer_reader = quantize(this->diff);
        } else if (self->q == 1) {
                @reader_writer = quantize(this->diff);
        } else {
                @reader_want = quantize(this->diff);
        }
}

END
{
        printf("writer blocked on other writer:\n");
        printa(@writer_writer);
        printf("writer blocked on readers:\n");
        printa(@writer_reader);
        printf("reader blocked on other writer:\n");
        printa(@reader_writer);
        printf("reader blocked on writer desire:\n");
        printa(@reader_want);
}

It hooks turnstile_block specifically, rather than rw_enter_sleep since the later is much too hot and causes excessive probe drops. Even with the constrained probes, drops were common during a loaded test (building illumos).

Results from a stock PI:

writer blocked on other writer:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@                               654
           16384 |@@@@@@@@@@@@                             823
           32768 |@@@@@@@                                  447
           65536 |@@                                       160
          131072 |@@                                       114
          262144 |@                                        36
          524288 |                                         12
         1048576 |                                         31
         2097152 |@                                        41
         4194304 |                                         28
         8388608 |@                                        44
        16777216 |@@                                       151
        33554432 |@                                        45
        67108864 |@                                        39
       134217728 |                                         26
       268435456 |                                         0

writer blocked on readers:


           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         3
            8192 |@@@@@@@@@@@@@@@@                         742
           16384 |@@@@@@@@@@@@@@@@                         721
           32768 |@@@                                      155
           65536 |@@                                       108
          131072 |@                                        25
          262144 |                                         5
          524288 |                                         3
         1048576 |                                         9
         2097152 |                                         13
         4194304 |                                         0
         8388608 |@                                        30
        16777216 |                                         0

reader blocked on other writer:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@                                     2175
           16384 |@@@@@@@@@@@@@@@                          8751
           32768 |@@@@@@@@@@@                              6744
           65536 |@@@@@@                                   3791
          131072 |@@@                                      1608
          262144 |                                         292
          524288 |                                         88
         1048576 |                                         130
         2097152 |                                         33
         4194304 |                                         0

reader blocked on writer desire:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         64
           16384 |@@@@@@@@                                 4496
           32768 |@@@@@@@@@@@@@@@@                         9088
           65536 |@@@@@@@@@@@                              6029
          131072 |@@@                                      1872
          262144 |                                         211
          524288 |                                         37
         1048576 |                                         37
         2097152 |                                         39
         4194304 |                                         27
         8388608 |                                         60
        16777216 |                                         48
        33554432 |                                         25
        67108864 |                                         10
       134217728 |                                         10
       268435456 |                                         2
       536870912 |                                         0

Results from a patched PI:

writer blocked on other writer:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@                               400
           16384 |@@@@@@@@@@@@@                            515
           32768 |@@@@@                                    213
           65536 |@@                                       93
          131072 |@@                                       76
          262144 |@                                        51
          524288 |@                                        56
         1048576 |@                                        36
         2097152 |@                                        21
         4194304 |                                         10
         8388608 |@                                        43
        16777216 |@                                        33
        33554432 |@                                        30
        67108864 |                                         1
       134217728 |                                         0

writer blocked on readers:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@@@@@@@@@@@                             444
           16384 |@@@@@@@@@@@                              435
           32768 |@@                                       71
           65536 |@                                        40
          131072 |@                                        38
          262144 |@                                        38
          524288 |@@                                       70
         1048576 |@@                                       68
         2097152 |@@                                       79
         4194304 |@@@                                      100
         8388608 |@@@                                      116
        16777216 |@                                        27
        33554432 |                                         8
        67108864 |                                         2
       134217728 |                                         0

reader blocked on other writer:


           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         11
            8192 |@@@@@                                    2845
           16384 |@@@@@@@@@@@@@@@                          9013
           32768 |@@@@@@@@@@                               6385
           65536 |@@@@@                                    3364
          131072 |@@@                                      1755
          262144 |@                                        409
          524288 |@                                        403
         1048576 |                                         257
         2097152 |                                         128
         4194304 |                                         59
         8388608 |                                         157
        16777216 |                                         30
        33554432 |                                         4
        67108864 |                                         0

reader blocked on writer desire:


           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         76
           16384 |@@@@@@                                   2535
           32768 |@@@@@@@@@@@                              4367
           65536 |@@@@@@@                                  2611
          131072 |@@@                                      1343
          262144 |@                                        550
          524288 |@                                        493
         1048576 |@@                                       624
         2097152 |@@                                       768
         4194304 |@@                                       699
         8388608 |@@@                                      1103
        16777216 |@@                                       651
        33554432 |                                         151
        67108864 |                                         42
       134217728 |                                         8
       268435456 |                                         1
       536870912 |                                         0

Comment by Patrick Mooney
Created at 2019-05-02T22:55:16.386Z

I ran a build on stock bits with some additional probes added to fss_preempt, designed to fire when a thread with KPRI set would have been preempted (or at least a chance of that happening, absent the artificially boosted priority). The results are attached in kpriempt.log.

Roughly 53% of the stacks were in segvn_fault, 25% in anon_zero as part of that. 10% were in the middle of exece syscalls. 5% in forksys


Comment by Mike Gerdts
Created at 2019-05-03T03:35:07.942Z

As I was looking into some soft-lockups of bhyve guests on PA5DWL842, Patrick suggested {{::stacks c vm_localize_resources}} for determining if this bug was in play. Indeed it was this shows that one of the threads in vm_localize_resources was sitting on CPU for nearly a minute during a guest pause that was long enough to cause an ssh session to drop.

> ::stacks -c vm_localize_resources | ::print kthread_t t_cpu | ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
 22 fffffe232216e000  1b    0    0  -1   no    no t-138  fffffcc268f10c20 (idle)
                       |
            RUNNING <--+
              READY
             EXISTS
             ENABLE

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
 36 fffffe2332b60000  1b    0    0  -1   no    no t-115  fffffcc268715c20 (idle)
                       |
            RUNNING <--+
              READY
             EXISTS
             ENABLE

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
 43 fffffe2353094000  1b    2    0  60   no    no t-59407 fffffe28affba440 qemu-system-x86_
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                59 fffffe23788c37e0 bhyve
             EXISTS                59 fffffe28cdfbb880 qemu-system-x86_
             ENABLE

Comment by Trent Mick
Created at 2019-05-08T21:19:19.388Z
Updated at 2019-05-10T03:19:30.357Z

nightly testing notes

platform TRY_BRANCH build to use for testing:

$ updates-imgadm -C '*' list name=platform version=~OS-7753
UUID                                  NAME      VERSION                   FLAGS  OS     PUBLISHED
9a144117-7c7a-4a6f-9b1c-a216fb976506  platform  OS-7753-20190508T191313Z  -      other  2019-05-08T21:00:35Z

We ran this through nightly-1's full reflash/setup/test (including a single CN Manta). Results: https://gist.github.com/trentm/81bfa2fb3c098e876352afe85f260db1

That is a pretty good run for current nightly state. From a reasonably quick look I believe most of the failures there are known issues affecting nightly that are being looked at (https://jira.joyent.us/issues/?filter=11605), and nothing stands out that I would blame on this changed platform.


Comment by Patrick Mooney
Created at 2019-05-22T18:20:48.556Z

@angela.fong spun up this PI in the lab to check that it didn't adversely impact workloads associated with the Manta metadata tier.

She noted:

I managed to get 4x workload (close to prod workload) on a metadata node with your PI, plus having the load generator running on it as well, making sure the cpus work hard.
And no observable difference in throughput.

In addition to that, @mike.zeller and @danmcd tested the change with their upcoming cloud firewall logging work. Similarly, they did not see a change in performance, positive or negative, when testing with the change.


Comment by Jira Bot
Created at 2019-05-23T03:24:16.730Z

illumos-joyent commit aa451f1e540c93bda3973fc20bad14188a136989 (branch master, by Patrick Mooney)

OS-7753 THREAD_KPRI_RELEASE does nothing of the sort
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>