OS-6390: hati_demap_func should take pagesize into account

Details

Issue Type:Bug
Priority:2 - Critical
Status:Resolved
Created at:2017-10-09T00:51:41.000Z
Updated at:2018-02-12T18:57:41.810Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-10-27T18:24:26.000Z)

Fix Versions

2017-11-09 Edge (Release Date: 2017-11-09)

Related Issues

Description

We deployed a new Manatee image which included a patch to PostgreSQL 9.6.3 to use ISM instead of POSIX shared memory. Most shards in the DC were fine, but shard 1 (SPC ap-southeast) was decidedly not.

The machine on which the primary for the shard is deployed (HA7S8HND2) has a high load average (~40+) without any processes (seen via prstat) obviously responsible. Looking at a 5 second mpstat sample:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  590   0  265  2372  153  225    6   54   50    0   449    2  74   0  24
  1  782   0   89   163    4   56    2   17   52    0   178    1  85   0  14
  2 1215   0  265   375   82  368   20   76  148    0  1147    4  66   0  30
  3  313   0   52   256   40  190    8   44   18    0   357    1  75   0  23
  4  811   0  168   272   39  207   15   55   44    0   556    1  83   0  16
  5  625   0   98   171    7   38    2   10   20    0   120    0  89   0  10
  6  277   0   87   255   34  167    4   34  274    0   272    1  74   0  25
  7  979   0  157   286   58  190    4   32  125    0   591    2  76   0  23
  8  469   0   82   227   27  112    1   19   61    0   336    1  80   0  19
  9  413   0   45   245   55   98    2   12   29    0   257    1  79   0  20
 10  471   0   50   224   28   98    2   21   70    0   243    1  79   0  20
 11  899   0  195   254   46  220    4   49  125    0   425    1  88   0  11
 12  555   0  264   312   67  280    7   69   70    0   363    1  75   0  24
 13 1592   0  214   294   62  191   17   29   45    0   706    2  75   0  23
 14  669   0  175   630  376  298   21   49  102    0   329    1  74   0  25
 15 1035   0  166   544  337  140   16   23  101    0   297    1  77   0  22
 16  330   0   73   562  349  199    5   30  150    0   120    0  79   0  20
 17  167   0   94   456  272  112    2   15   15    0    92    0  86   0  14
 18 1208   0  241   744  389  630   22  100  154    0  1157    3  61   0  36
 19  380   0   50   474  277  145    6   19   21    0   123    0  76   0  23
 20  776   0  225   610  368  160   30   24   90    0   340    2  87   0  11
 21  469   0   68   292  111  123    3   14   12    0   119    0  78   0  22
 22  437   0  108   388  139  260   15   50   47    0   690    3  69   0  29
 23  313   0   39   296  116   46    2   13   13    0   113    0  85   0  14
 24  470   0   44   282  107   53    1   17   40    0   267    1  82   0  17
 25  643   0   91   291   67  309    4   45  128    0   691    2  62   0  37
 26  961   0  282   262   39  261    7   59   92    0   931    4  56   0  40
 27 1103   0  143   424  123  508    7   79  159    0  1005    2  56   0  41
 28  691   0  115   503  286  192   36   51   64    0   560    3  63   0  34
 29  436   0   90   359  156  229    4   29   33    0   374    1  64   0  34
 30  598   0  180   361  139  208    8   53  151    0   477    2  64   0  35
 31  953   0  131   291  117  141    3   27   52    0   368    1  71   0  28
 32  795   0  249   277   38  306   11   60  138    0   658    3  59   0  38
 33  745   0  185   422  161  283   16   53   71    0   680    3  57   0  41
 34  853   0  172   257   33  171   17   32   38    0   441    2  72   0  26
 35  621   0   83   213   25  177    4   34  190    0   319    1  60   0  39
 36  533   0   98   225   25  101    2   14   36    0   329    1  78   0  20
 37  776   0   80   205   18  132    7   27   45    0   237    1  66   0  34
 38  349   0   62   297   55  162    1   16   14    0   195    0  80   0  20
 39  341   0   44   263   38  230    4   48   81    0   153    0  63   0  36
 40  469   0   85   190   11   60    9   19    9    0   400    1  89   0  10
 41  809   0  101   423  120  397   14   51   22    0  1284    4  58   0  38
 42  870   0  131   188   13   69    3   17   65    0   319    1  75   0  25
 43  687   0  193   290   63  281    2   48   53    0   453    1  54   0  45
 44  591   0  105   241   43  116    1   12   23    0   195    0  79   0  21
 45  527   0  171   267   45  208    6   43   37    0   639    2  69   0  30
 46  956   0  188   384  146  292    8   64  115    0   734    3  55   0  42
 47  647   0   95   213   34  181    2   35   63    0   464    1  48   0  51
 48  889   0  221   342   77  298   18   59   64    0   627    1  74   0  24
 49  156   0   75   296   81  253    2   34   34    0   313    1  61   0  39
 50  653   0  106   234   34  132    6   26   19    0   321    1  76   0  23
 51  284   0   32   190   22   96    1   23   27    0    97    0  65   0  35
 52  681   0  162   352   87  322   14   54   57    0   818    3  62   0  35
 53  873   0  160   418  122  454    6   40  194    0   603    2  62   0  36
 54  529   0  131   205   11   84    5   21  143    0   192    0  84   0  16
 55  844   0  189   489  149  595   19   79   98    0   962    2  32   0  66

In a 60 second kernel profiling, the most common stack was:

              unix`hati_demap_func+0x4c
              unix`hat_tlb_inval_range+0x1fe
              unix`handle_ranges+0x7c
              unix`hat_unload_callback+0x205
              unix`hat_unload+0x3e
              unix`hat_unshare+0x147
              genunix`segspt_shmunmap+0xa7
              genunix`as_unmap+0x19c
              shmsys`shm_detach+0x5b
              shmsys`shmexit+0x63
              genunix`relvm+0x19e
              genunix`proc_exit+0x4aa
              genunix`exit+0x15
              genunix`rexit+0x18
              unix`sys_syscall+0x19f
           365779

I took a look at shmexit() activity on the machine:

[root@HA7S8HND2 (ap-southeast-1a) /var/tmp]# dtrace -n 'fbt::shmexit:entry { self->t = timestamp; } fbt::shmexit:return /self->t/ { @[zonename, execname] = quantize(timestamp - self->t); self->t = 0; } tick-60s { printa(@); trunc(@); }' 
dtrace: description 'fbt::shmexit:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  8  85774                        :tick-60s 
  global                                              zabbix_agentd                                     
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  9        
            8192 |@@@@@@@@                                 3        
           16384 |@@@@@@@@@@                               4        
           32768 |                                         0        

  7f6d4352-c2d3-4441-8e72-37741d55d545                postgres                                          
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@                  1070     
      1073741824 |@@@@@@@@@@@@@@@@@                        825      
      2147483648 |                                         0        

Note that there are many PostgreSQL processes exiting, and each of them takes 500-1000ms in shmexit().

Looking at the PostgreSQL log to see the source of the connections that are disconnecting, causing postgres to exit:

[root@7f6d4352 (postgres) ~]$ tail -f /var/pg/postgresql.log | awk '/disconnection:/ { print($(NF-1)); t++ } t > 250 { exit(0) }' | sort | uniq -c | sort -n
     71 host=10.160.0.82
     80 host=10.164.0.216
    100 host=10.168.0.82

These are the Moray instances for this shard; one per DC. Looking at one of the Moray instances, I see a lot of this in /var/log/moray.log:

00:29:37.901Z FATAL moray/pgpool: unable to create PG client
    Error: Connection terminated unexpectedly
        at null.<anonymous> (/opt/smartdc/moray/node_modules/pg/lib/client.js:224:28)
        at g (events.js:180:16)
        at emit (events.js:92:17)
        at Socket.<anonymous> (/opt/smartdc/moray/node_modules/pg/lib/connection.js:90:10)
        at Socket.emit (events.js:95:17)
        at TCP.close (net.js:466:12)

It's not clear if this is cause or effect. The PostgreSQL log has lots of odd messages like these:

2017-10-09 00:30:04 UTC [27248]: [3-1] user=moray,db=moray,app=[unknown],client=10.160.0.82 LOG:  could not receive data from client: Connection reset by peer
...
2017-10-09 00:30:17 UTC [18277]: [180-1] user=,db=,app=,client= LOG:  could not accept new connection: Software caused connection abort
...
2017-10-09 00:30:18 UTC [27837]: [2-1] user=[unknown],db=[unknown],app=[unknown],client=10.160.0.82 LOG:  incomplete startup packet

Looking at one particular short-lived connection:

[root@7f6d4352 (postgres) ~]$ grep -w 31906 !$ 
grep -w 31906 /var/pg/postgresql.log 
2017-10-09 00:32:03 UTC [31906]: [1-1] user=[unknown],db=[unknown],app=[unknown],client=10.160.0.82 LOG:  connection received: host=10.160.0.82 port=32891
2017-10-09 00:32:03 UTC [31906]: [2-1] user=moray,db=moray,app=[unknown],client=10.160.0.82 LOG:  connection authorized: user=moray database=moray
2017-10-09 00:32:03 UTC [31906]: [3-1] user=moray,db=moray,app=[unknown],client=10.160.0.82 LOG:  could not receive data from client: Connection reset by peer
2017-10-09 00:32:03 UTC [31906]: [4-1] user=moray,db=moray,app=[unknown],client=10.160.0.82 LOG:  disconnection: session time: 0:00:00.005 user=moray database=moray host=10.160.0.82 port=32891

It is possible that if the PostgreSQL machine is overloaded by shmexit calls, it is unable to even fully accept each Moray connection until after Moray's perceived (and presumably hair-trigger) connection timeout has already expired.

Comments

Comment by Former user
Created at 2017-10-09T16:30:19.000Z

This sounds similar to the symptoms observed under MORAY-339 (though I have not seen the shmexit symptom before). PostgreSQL does a fair bit of work synchronously when accepting connections, and because Moray caps its connect timeout at 4 seconds, we've seen cases where modest delays in certain operations can bubble upstack to create this sort of mess. For example, if fork(2) starts to take even just one second, and Moray tries to create just four connections (across all processes in all zones for this shard), it's likely the fourth one will time out. If Morays try to create more than four connections (which was far more common prior to MORAY-397), this only gets worse as the aborted connections stack up in the accept queue and effectively guarantee that new connection attempts will time out. (It seems that the way Moray aborts these connections, the connection remains on the PostgreSQL accept queue.) Some of these cases were attributed to DTrace-related fork latency (e.g., OS-4742), but it's conceivable that shmexit() could get in the way as well.


Comment by Former user
Created at 2017-10-10T01:14:55.000Z

In investigating this, we were spending all of our time in TLB shootdown; time spent in hati_demap_func, by length:

root@HA7S8HND2 (ap-southeast-1a) ~]# dtrace -n hati_demap_func:entry'{self->in = timestamp; self->len = arg2}' -n hati_demap_func:return'/self->in/{@[self->len] = quantize(timestamp - self->in); self->in = 0; self->len = 0}'
dtrace: description 'hati_demap_func:entry' matched 1 probe
dtrace: description 'hati_demap_func:return' matched 1 probe
dtrace: 156 dynamic variable drops with non-empty dirty list
dtrace: 415 dynamic variable drops with non-empty dirty list
^C

           135168
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
            4096 |                                         0        

                0
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
            4096 |                                         0        

            49152
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4        
            2048 |                                         0        

            61440
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
            8192 |                                         0        

           233472
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
            8192 |                                         0        

           368640
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
            8192 |                                         0        

            53248
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 6        
            2048 |                                         0        

           122880
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3        
            4096 |                                         0        

            65536
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
            8192 |                                         0        

           163840
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           16384 |                                         0        

           598016
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           16384 |                                         0        

           167936
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@                     1        
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@                     1        
           16384 |                                         0        

           229376
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           32768 |                                         0        

           466944
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           32768 |                                         0        

           561152
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           32768 |                                         0        

          1015808
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           32768 |                                         0        

          1044480
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           32768 |                                         0        

           221184
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5        
            8192 |                                         0        

           102400
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     9        
            4096 |@@@@                                     1        
            8192 |                                         0        

           794624
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           65536 |                                         0        

          1003520
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
           65536 |                                         0        

           299008
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        5        
            8192 |                                         0        
           16384 |@@@@@@@                                  1        
           32768 |                                         0        

           393216
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        5        
            8192 |                                         0        
           16384 |@@@@@@@                                  1        
           32768 |                                         0        

            36864
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              14       
            2048 |@@                                       1        
            4096 |@@@@@@                                   3        
            8192 |@@@@@@                                   3        
           16384 |                                         0        

          2064384
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
          131072 |                                         0        

           864256
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@                     2        
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@                     2        
           65536 |                                         0        

          1048576
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@                               1        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3        
           65536 |                                         0        

            45056
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@                                       3        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          42       
            4096 |@@@@@@@                                  9        
            8192 |                                         0        

          2097152
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@                     2        
           65536 |@@@@@@@@@@@@@@@@@@@@                     2        
          131072 |                                         0        

            20480
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |                                         2        
            1024 |@@@@@@@                                  96       
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@                340      
            4096 |@@@@@@@@                                 106      
            8192 |                                         4        
           16384 |                                         1        
           32768 |                                         0        

           696320
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@@                    78       
           16384 |@@@@@@@                                  26       
           32768 |@@@@@@@@@@@@                             43       
           65536 |                                         0        

          7090176
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@                                 66       
          131072 |@@@@@@@@                                 66       
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@                 199      
          524288 |                                         0        

             4096
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@@@@@@@@@@@@@@@                          58044    
            1024 |@@@@@@@@@@@@@@                           53784    
            2048 |@@@@@@@@@                                36784    
            4096 |@                                        5338     
            8192 |                                         1737     
           16384 |                                         112      
           32768 |                                         63       
           65536 |                                         1        
          131072 |                                         0        

        874512384
           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@@@@@@@@@@@@                           64       
        16777216 |@@@@@                                    25       
        33554432 |@@@@@@@@@@@@@@@@@@@@@                    95       
        67108864 |                                         0        

      34359738368
           value  ------------- Distribution ------------- count    
       268435456 |                                         0        
       536870912 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              93       
      1073741824 |@@@@@@@@@@@@@                            45       
      2147483648 |                                         0        

This is despite the fact that the ISM segment is on large pages; from pmap -s output of one of the postgres processes:

FFFFBF7780000000   33554432K   1G rwxsR    [ ism shmid=0x4 ]
FFFFBF7F80000000     854016K   2M rwxsR    [ ism shmid=0x4 ]
FFFFBF7FB4200000        536K   4K rwxsR    [ ism shmid=0x4 ]

The issue here is that hati_demap_func does not take pagesize into account -- which means for the 32G segment that consists of 32 1G pages, we are performing over 8 million invlpg instructions where we should do only 32! (It is not surprising that these instructions take ~100 ns as this is not an instruction designed to be called millions of times per second.) As an experiment, we forced hati_demap_func to always just invalidate the entire TLB (regardless of length):

[root@HA7S8HND2 (ap-southeast-1a) ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd mm lofs idm mpt_sas sata crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]

> hati_demap_func::dis
hati_demap_func:                pushq  %rbp
hati_demap_func+1:              movq   %rsp,%rbp
hati_demap_func+4:              subq   $0x20,%rsp
hati_demap_func+8:              movq   %rdx,-0x18(%rbp)
hati_demap_func+0xc:            movq   %rsi,-0x10(%rbp)
hati_demap_func+0x10:           movq   %rdi,-0x8(%rbp)
hati_demap_func+0x14:           cmpq   %rdi,+0x4181f5(%rip)     <kas+0x10>
hati_demap_func+0x1b:           je     +0x1b    <hati_demap_func+0x38>
hati_demap_func+0x1d:           movq   %gs:0x10,%rax
hati_demap_func+0x26:           cmpq   %rdi,0x660(%rax)
hati_demap_func+0x2d:           je     +0x9     <hati_demap_func+0x38>
hati_demap_func+0x2f:           xorl   %eax,%eax
hati_demap_func+0x31:           leave  
hati_demap_func+0x32:           ret    
hati_demap_func+0x33:           nopl   0x0(%rax,%rax)
hati_demap_func+0x38:           cmpq   $-0x1,%rsi       <0xffffffffffffffff>
hati_demap_func+0x3c:           je     +0x22    <hati_demap_func+0x60>
hati_demap_func+0x3e:           testq  %rdx,%rdx
hati_demap_func+0x41:           je     -0x14    <hati_demap_func+0x2f>
hati_demap_func+0x43:           xorl   %eax,%eax
hati_demap_func+0x45:           nopl   (%rax)
hati_demap_func+0x48:           invlpg (%rax,%rsi)
hati_demap_func+0x4c:           addq   $0x1000,%rax
hati_demap_func+0x52:           cmpq   %rax,%rdx
hati_demap_func+0x55:           ja     -0xf     <hati_demap_func+0x48>
hati_demap_func+0x57:           xorl   %eax,%eax
hati_demap_func+0x59:           leave  
hati_demap_func+0x5a:           ret    
hati_demap_func+0x5b:           nopl   0x0(%rax,%rax)
hati_demap_func+0x60:           testb  $0x2,0x58(%rdi)
hati_demap_func+0x64:           je     +0x3f    <hati_demap_func+0xa5>
hati_demap_func+0x66:           movq   %gs:0x10,%rax
hati_demap_func+0x6f:           movq   0x668(%rax),%rax
hati_demap_func+0x76:           movq   0x90(%rdi),%rdx
hati_demap_func+0x7d:           movq   0x18(%rax),%rax
hati_demap_func+0x81:           movq   %rdx,(%rax)
hati_demap_func+0x84:           movq   0x98(%rdi),%rdx
hati_demap_func+0x8b:           movq   %rdx,0x8(%rax)
hati_demap_func+0x8f:           movq   0xa0(%rdi),%rdx
hati_demap_func+0x96:           movq   %rdx,0x10(%rax)
hati_demap_func+0x9a:           movq   0xa8(%rdi),%rdx
hati_demap_func+0xa1:           movq   %rdx,0x18(%rax)
hati_demap_func+0xa5:           movq   %cr3,%rax
hati_demap_func+0xa8:           movq   %rax,%cr3
>                                     
> hati_demap_func+0x3c/Bna
hati_demap_func+0x3c:           74      
hati_demap_func+0x3d:           
> hati_demap_func+0x3c/v eb
mdb: failed to write eb at address 0xfffffffffb8214dc: target is not open for writing
> $W
> hati_demap_func+0x3c/v eb
hati_demap_func+0x3c:           0x74    =       0xeb

This resolved the issue as immediately seen -- but some number of minutes later, the machine reset for reasons not understood (but may well be related to the in-place modification). Based on what we saw before the reset, we can conclude that -- at very least -- this is an exacerbating condition whereby an initial perturbation could result in the system in a state from which it couldn't recover. The source of any initial perturbation as well as the likelihood of this is not yet determined -- but in principle this was possible on 9.2, albeit less likely due to the smaller buffer size.


Comment by Former user
Created at 2017-10-13T22:57:46.000Z
Updated at 2017-12-14T17:35:44.747Z

To propagate the page size down to hati_demap_func I have slightly changed the behavior of hat_tlb_inval_range to accept a range_info_t, instead of an address and length. From the information in a range_info_t page size can be directly calculated.

Here is the CR: https://cr.joyent.us/#/c/2783/

Testing Plan

tlbshoot.c can be found here: https://gist.github.com/gwydirsam/6f25309b720350b6883dc2cd1f398349
The D script is a bit shorter:

#!/usr/sbin/dtrace -s

#pragma D option quiet

fbt::hat_tlb_inval_range:entry
/execname == "tlbshoot-1GB"/
{
        printf("hat_tlb_inval_range: range: cnt:%d lvl:%d\n",
            args[1]->rng_cnt, args[1]->rng_level);
}

fbt::hati_demap_func:entry
/execname == "tlbshoot-1GB"/
{
        self->ts = timestamp;
        printf("hati_demap_func: len:%u\n", args[2]);
}

fbt::hati_demap_func:return
/self->ts/
{
        @[execname, "ns"] = quantize(timestamp - self->ts);
        self->ts = 0;
}

These results are very promising. Before the fix we can see hati_demap_func always took ~.26 seconds to run.

  tlbshoot-1GB                                        ns
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@                 3016
            2048 |                                         6
            4096 |                                         8
            8192 |                                         1
           16384 |                                         1
           32768 |                                         0
           65536 |                                         0
          131072 |                                         0
          262144 |                                         0
          524288 |                                         1
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |@@@@@@@@                                 1000
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         0
       268435456 |@@@@@@@@                                 1000
       536870912 |                                         0

Afterwards we can see that hati_demap_func now takes ~16us!

  tlbshoot-1GB                                        ns
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@                 3028
            2048 |@@@@@@@@                                 1001
            4096 |                                         3
            8192 |                                         1
           16384 |@@@@@@@@                                 999
           32768 |                                         4
           65536 |                                         1
          131072 |                                         0
          262144 |                                         0
          524288 |                                         1
         1048576 |                                         0


Comment by Former user
Created at 2017-10-27T18:02:36.000Z

illumos-joyent commit 9b585df (branch master, by Sam Gwydir)

OS-6390 hati_demap_func should take pagesize into account
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>


Comment by Former user
Created at 2018-02-12T18:57:41.810Z

A small side-note: this change isn't quite right for the HAT_SHARED case, where we change "va" but not the range_info_t passed into the cross-call. This doesn't seem to have actually hurt, though, and besides I've indirectly fixed it as part of OS-6546