MANATEE-381: PostgreSQL fails to shmat(2): not enough space

Details

Issue Type:Bug
Priority:2 - Critical
Status:Resolved
Created at:2017-10-15T05:51:27.000Z
Updated at:2019-12-23T23:15:34.009Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-09-20T13:57:31.398Z)

Fix Versions

2019-10-10 Close Talker (Release Date: 2019-10-10)

Related Issues

Description

It would seem that Manatee images that use ISM instead of POSIX shared memory can hit a new failure mode when starting up:

05:42:00.971Z ERROR manatee-sitter/PostgresMgr: PostgresMgr._primary: error (standby=tcp://postgres@10.168.0.76:5432/postgres)
    VError: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL:  shmat(id=10, addr=0, flags=0x4000) failed: Not enough space
    2017-10-15 05:42:00 UTC [27792]: [2-1] user=,db=,app=,client= LOG:  database system is shut down
    
        at ChildProcess.<anonymous> (/opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:1656:19)
        at ChildProcess.EventEmitter.emit (events.js:98:17)
        at Process.ChildProcess._handle.onexit (child_process.js:797:12)
    --
    url: {
      "protocol": "tcp:",
      "slashes": true,
      "auth": "postgres",
      "host": "10.164.0.209:5432",
      "port": "5432",
      "hostname": "10.164.0.209",
      "hash": null,
      "search": null,
      "query": null,
      "pathname": "/postgres",
      "path": "/postgres",
      "href": "tcp://postgres@10.164.0.209:5432/postgres"
    }
05:42:00.973Z ERROR manatee-sitter/cluster:
    applying pg config: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL:  shmat(id=10, addr=0, flags=0x4000) failed: Not enough space
    2017-10-15 05:42:00 UTC [27792]: [2-1] user=,db=,app=,client= LOG:  database system is shut down
    
    --
    VError: applying pg config: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL:  shmat(id=10, addr=0, flags=0x4000) failed: Not enough space
    2017-10-15 05:42:00 UTC [27792]: [2-1] user=,db=,app=,client= LOG:  database system is shut down
    
        at finishPgApply (/opt/smartdc/manatee/node_modules/manatee/node_modules/manatee-state-machine/lib/manatee-peer.js:1087:10)
        at next (/opt/smartdc/manatee/node_modules/manatee/node_modules/manatee-state-machine/node_modules/vasync/lib/vasync.js:581:14)
        at /opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:637:17
        at onReconfigure (/opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:890:17)
        at /opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:1094:17
        at next (/opt/smartdc/manatee/node_modules/manatee/node_modules/vasync/lib/vasync.js:201:4)
        at f (/opt/smartdc/manatee/node_modules/manatee/node_modules/once/once.js:16:25)
        at /opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:1619:9
        at f (/opt/smartdc/manatee/node_modules/manatee/node_modules/once/once.js:16:25)
        at ChildProcess.<anonymous> (/opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:1696:16)
    Caused by: VError: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL:  shmat(id=10, addr=0, flags=0x4000) failed: Not enough space
    2017-10-15 05:42:00 UTC [27792]: [2-1] user=,db=,app=,client= LOG:  database system is shut down
    
        at ChildProcess.<anonymous> (/opt/smartdc/manatee/node_modules/manatee/lib/postgresMgr.js:1656:19)
        at ChildProcess.EventEmitter.emit (events.js:98:17)
        at Process.ChildProcess._handle.onexit (child_process.js:797:12)

This ticket covers working around the shmat() failure (in the Manatee sitter) by sleeping for a short period and restarting PostgreSQL when we see that log message on stderr of the postgres process.

Comments

Comment by Former user
Created at 2017-10-16T16:59:14.000Z
Updated at 2017-10-16T17:18:04.000Z

Indeed. We've seen this in the past quite a bit as well (documented under MANATEE-260 and MANATEE-304). Last week, we saw this in one of the SPC regions. I disabled manatee-sitter to keep it from triggering a rebuild. At that point, there was no existing SysV IPC usage:

[root@59ed68b0 (postgres) ~]$ ipcs -A
IPC status from <running system> as of Thu Oct 12 01:08:23 UTC 2017
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP CBYTES  QNUM QBYTES LSPID LRPID   STIME    RTIME    CTIME          PROJECT
Message Queues:
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP NATTCH      SEGSZ  CPID  LPID   ATIME    DTIME    CTIME  ISMATTCH         PROJECT
Shared Memory:
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP NSEMS   OTIME    CTIME          PROJECT
Semaphores:
[root@59ed68b0 (postgres) ~]$

And the limits seem plenty high for this 32GB of shared buffers:

[root@59ed68b0 (postgres) ~]$ prctl $$ 
process: 69862: -bash
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
process.max-locked-memory
        system          16.0EB    max   deny                                 -
process.max-sigqueue-size
        basic             128       -   deny                             69862
        privileged        512       -   deny                                 -
        system           100K     max   deny                                 -
process.max-port-events
        privileged      65.5K       -   deny                                 -
        system          2.15G     max   deny                                 -
process.max-msg-messages
        privileged      8.19K       -   deny                                 -
        system          4.29G     max   deny                                 -
process.max-msg-qbytes
        privileged      64.0KB      -   deny                                 -
        system          16.0EB    max   deny                                 -
process.max-sem-ops
        privileged        512       -   deny                                 -
        system          2.15G     max   deny                                 -
process.max-sem-nsems
        privileged        512       -   deny                                 -
        system          32.8K     max   deny                                 -
process.max-address-space
        privileged      16.0EB    max   deny                                 -
        system          16.0EB    max   deny                                 -
process.max-file-descriptor
        privileged      65.5K       -   deny                                 -
        system          2.15G     max   deny                                 -
process.max-core-size
        privileged      8.00EB    max   deny                                 -
        system          8.00EB    max   deny                                 -
process.max-stack-size
        basic           10.0MB      -   deny                             69862
        privileged      32.0TB      -   deny                                 -
        system          32.0TB    max   deny                                 -
process.max-data-size
        privileged      16.0EB    max   deny                                 -
        system          16.0EB    max   deny                                 -
process.max-file-size
        privileged      8.00EB    max   deny,signal=XFSZ                     -
        system          8.00EB    max   deny                                 -
process.max-cpu-time
        privileged      18.4Es    inf   signal=XCPU                          -
        system          18.4Es    inf   none                                 -
task.max-cpu-time
        usage               0s   
        system          18.4Es    inf   none                                 -
task.max-processes
        usage               3    
        system          2.15G     max   deny                                 -
task.max-lwps
        usage               4    
        system          2.15G     max   deny                                 -
project.max-contracts
        privileged      10.0K       -   deny                                 -
        system          2.15G     max   deny                                 -
project.max-locked-memory
        usage               0B   
        system          16.0EB    max   deny                                 -
project.max-port-ids
        privileged      8.19K       -   deny                                 -
        system          65.5K     max   deny                                 -
project.max-shm-memory
        usage               0B   
        privileged       128GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
project.max-shm-ids
        usage               0    
        privileged        128       -   deny                                 -
        system          16.8M     max   deny                                 -
project.max-msg-ids
        usage               0    
        privileged        128       -   deny                                 -
        system          16.8M     max   deny                                 -
project.max-sem-ids
        usage               0    
        privileged        128       -   deny                                 -
        system          16.8M     max   deny                                 -
project.max-crypto-memory
        usage               0B   
        privileged       128GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
project.max-tasks
        usage               1    
        system          2.15G     max   deny                                 -
project.max-processes
        usage               3    
        system          2.15G     max   deny                                 -
project.max-lwps
        usage               4    
        system          2.15G     max   deny                                 -
project.cpu-cap
        usage               0    
        system          4.29G     inf   deny                                 -
project.cpu-shares
        usage               1    
        privileged          1       -   none                                 -
        system          65.5K     max   none                                 -
zone.max-lofi
        usage               0    
        system          18.4E     max   deny                                 -
zone.max-physical-memory
        usage            171MB   
        privileged      64.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
zone.max-swap
        usage            134MB   
        privileged       128GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
zone.max-locked-memory
        usage               0B   
        privileged      64.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
zone.max-shm-memory
        usage               0B   
        privileged      64.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -
zone.max-shm-ids
        usage               0    
        privileged      4.10K       -   deny                                 -
        system          16.8M     max   deny                                 -
zone.max-sem-ids
        usage               0    
        privileged      4.10K       -   deny                                 -
        system          16.8M     max   deny                                 -
zone.max-msg-ids
        usage               0    
        privileged      4.10K       -   deny                                 -
        system          16.8M     max   deny                                 -
zone.max-processes
        usage              26    
        privileged       256K       -   deny                                 -
        system          2.15G     max   deny                                 -
zone.max-lwps
        usage              97    
        privileged       256K       -   deny                                 -
        system          2.15G     max   deny                                 -
zone.zfs-io-priority
        usage               0    
        privileged          0       -   none                                 -
        system          16.4K     max   none                                 -
zone.cpu-burst-time
        usage               0    
        system          2.15G     max   none                                 -
zone.cpu-baseline
        usage               0    
        system          4.29G     max   none                                 -
zone.cpu-cap
        usage               0    
        system          4.29G     inf   deny                                 -
zone.cpu-shares
        usage             256    
        privileged        256       -   none                                 -
        system          65.5K     max   none                                 -

Starting manatee-sitter again caused it to come up fine, which sounds like what Josh has seen as well.


Comment by Former user
Created at 2018-09-11T05:11:57.243Z

Looking at shmat(2) we see that the shmat() function will fail with this error (Not enough space is the message for ENOMEM) when the available data space is not large enough to accommodate the shared memory segment.

In this case, we have passed a flags value of SHM_SHARE_MMU (to get ISM) and an address hint of zero. Looking in the implementation of shmat(2), it appears that there are several opportunities to fail with ENOMEM:

We'll need to reproduce the failure with some instrumentation in place to determine which of these is the cause. I suspect we'll need to then dig in to why there isn't enough free memory on the system to create and attach to this ISM segment -- is it because we need to preallocate a number of lockable pages in advance? Did this used to work correctly pre-ARC because pages on the free/cache lists could be considered candidates to satisfy this allocation, but we would now instead need to trigger and block on, e.g., shrinking the ARC to make room?


Comment by Former user
Created at 2018-09-11T05:16:01.963Z

My suspicion is that there is an OS bug here that we should fix, but also that we'll need to work around it in Manatee for the time being. Options there include:

I'm also fairly certain we were only seeing this failure mode once we elected to use ISM (for the obvious benefit of shared page tables amongst the many PostgreSQL processes using the same shared buffer segment). Though it is likely not desirable, we may also be able to go back to not using ISM for the time being.


Comment by Former user
Created at 2018-09-11T06:11:22.371Z

I'm going to prototype a fix for retrying the PostgreSQL startup on detection of this specific error message. I think we should probably file a separate OS ticket for investigation into why shmat(2) appears to be failing in error.


Comment by Michael HIcks
Created at 2019-01-04T17:04:33.012Z
Updated at 2019-01-04T17:21:08.988Z

A hypothesis was floated that perhaps we are hitting this where someone has manually adjusted the zones memory settings but not adjusted all the params appropriately such that the shared memory was too small.

One data point on this idea is a shard member I rebuilt in OPS-4935

I encountered the shmat error starting Postgres up when building the pg on VM e1463a12-b362-4f8b-9d52-b77127d10d0b and the shmat errors are to be found in /poseidon/stor/logs/manatee-sitter/2018/12/04/12/e1463a12.log

Checking the instance's zone config max_shm_memory == max_physical_memory and max_swap is larger so they are not too small.

[root@HA8SCHRD2 (us-east-1b) ~]# vmadm get e1463a12-b362-4f8b-9d52-b77127d10d0b | json max_shm_ids max_physical_memory max_shm_memory max_swap
4096
65536
65536
131072

I do not fully understand what the max-shm-ids get used for but I expect its some quantity of shared memory block allocations of some minimum quantity that each get an id similar to max-sem-ids and max-lwps and this is just a guard on taking up too many system resources.

These parameters were not recently changed as they are reflected in the zone's xml config and it was last modified more than a month before this rebuild.

[root@HA8SCHRD2 (us-east-1b) ~]# stat /etc/zones/e1463a12-b362-4f8b-9d52-b77127d10d0b.xml 
  File: `/etc/zones/e1463a12-b362-4f8b-9d52-b77127d10d0b.xml'
  Size: 3323      Blocks: 4          IO Block: 3584   regular file
Device: 1690019h/23658521d	Inode: 114         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (   10/   staff)
Access: 2018-11-06 22:15:26.000000000 +0000
Modify: 2018-11-06 22:15:26.000000000 +0000
Change: 2018-11-06 22:15:26.862064625 +0000
 Birth: -

If they were changed and the zone needed to be restarted to properly pick them up, it might still be relevant, but if all that needs to happen for changes to the shared memory settings to be picked up is a manatee-sitter and postgresql process restart, both of those happened with the rebuild.


Comment by Former user
Created at 2019-01-07T16:07:04.313Z

I'm not really sold on the mal-adjusted parameters theory. We starting hitting this when we put back the change to PostgreSQL to use ISM in order to reduce the page table overhead for the shared memory segments. In all of the cases that I'm aware of where we've hit this error, restarting PostgreSQL shortly afterwards causes it to start up -- which implies that it's a transient resource issue, rather than a static configuration problem.


Comment by Former user
Created at 2019-01-11T17:34:40.625Z

I spent a while trying to reproduce something like the similar problem, without much success. Various failure modes happen during shmget() not shmat() time and I didn't have luck getting an error at the latter time.

Agreed on the parameters thing - at last max_shm_memory gets hit at shmget() time.

https://github.com/jlevon/grot/blob/master/manatee-381/manatee-381.d has a DTrace script which is hopefully low-impact enough to run long-term, and will give us some clearer indication of the failure point. Who can help with getting this in place on some systems we might hit this?


Comment by Former user
Created at 2019-01-18T12:32:55.967Z

Taking for now at least


Comment by Jon Burgoyne
Created at 2019-03-01T18:20:04.418Z

Running the above dtrace script on a bunch of CNs that were serving as sync in central 1a while upgrading to a new version
on CN JA501576, postgres failed to start up after the reprovision and dtrace output was captured in /var/tmp/manatee381.d.out

Started tracing...
T5368473752072:pid222241t1 page_needfree(18446744073709273297); `needfree = 278319 (+ 0), `availrmem = 8038501
T5368473777140:pid222241t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 8038501
T5368473854695:pid222241t1 anon_swap_adjust() -> 12
T5368473869185:pid222241t1 as_map() (create) -> 12
T5368474168230:pid222241t1 sptcreate() -> 12
T5368474185184:pid222241t1 shmat() -> c
T5359473556371:pid222241t1 shmat(8, 0)
T5359473570339:pid222241t1 ipc_lookup() -> -2442656931328
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffdc76c2a7571
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffdc76c2a7588
            struct list_node *list_prev = 0xfffffdc3e88ba120
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffdd949a6d648
        uint_t ipc_id = 0x8
        zoneid_t ipc_zoneid = 0x7
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffdc7e3480580
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xfffffdc76c9ec468
                struct list_node *list_prev = 0xfffffdc7e34805f8
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xfffffdc768a212e8
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0x36421
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c7975b1
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T5359473583727:pid222241t1 map_addr(align_hint = 40000000, size = 240000000)
T5359473595198:pid222241t1 map_addr() -> addr = fffff9fd80000000
T5359473603902:pid222241t1 sptcreate()
T5359473754890:pid222241t1 as_map() (create)
T5359473762013:pid222241t1 valid_va_range() -> 1
T5359473767661:pid222241t1 valid_usr_range() -> 0
T5359473773577:pid222241t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffdc76f55b7e8
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0xfffffdca80a06370 ]
        uintptr_t avl_pcb = 0xfffffdcac33d384a
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffdc75c759c48
        struct pcache_link *p_lprev = 0xfffffdc75c759c48
    }
}
T5359473781668:pid222241t1 as_addseg() -> 0
T5359473787138:pid222241t1 seg_alloc() -> fffffdc75c759bf8
T5359473793812:pid222241t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xfffffdc768a212e8
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T5359473802599:pid222241t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7eeba63
    pgcnt_t ani_phys_resv = 0x2fb87e
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T5359473811621:pid222241t1 page_reclaim_mem(2163198, 6153597, 1); availrmem = 7926832, t_minarmem = 25
T5359473825776:pid222241t1 page_needfree(389988); `needfree = 0 (+ 389988), `availrmem = 7926832
T5359473837590:pid222241t1 kmem_reap_common(0)
T5360473743511:pid222241t1 page_needfree(18446744073709161628); `needfree = 389988 (+ 0), `availrmem = 7942734
T5360473768052:pid222241t1 page_needfree(374086); `needfree = 0 (+ 374086), `availrmem = 7942735
T5360473777074:pid222241t1 kmem_reap_common(1)
T5361473751526:pid222241t1 page_needfree(18446744073709177530); `needfree = 374086 (+ 0), `availrmem = 8001825
T5361473773134:pid222241t1 page_needfree(314995); `needfree = 0 (+ 314995), `availrmem = 8001826
T5361473780932:pid222241t1 kmem_reap_common(1)
T5362473735358:pid222241t1 page_needfree(18446744073709236621); `needfree = 314995 (+ 0), `availrmem = 8044992
T5362473760409:pid222241t1 page_needfree(271828); `needfree = 0 (+ 271828), `availrmem = 8044992
T5362473768550:pid222241t1 kmem_reap_common(1)
T5365473725012:pid222241t1 page_needfree(18446744073709279789); `needfree = 271827 (+ 0), `availrmem = 8044993
T5365473737816:pid222241t1 page_needfree(271827); `needfree = 0 (+ 271827), `availrmem = 8044993
T5365473742671:pid222241t1 kmem_reap_common(1)
T5363473742889:pid222241t1 page_needfree(18446744073709279788); `needfree = 271828 (+ 0), `availrmem = 8044992
T5363473764664:pid222241t1 page_needfree(271828); `needfree = 0 (+ 271828), `availrmem = 8044992
T5363473773399:pid222241t1 kmem_reap_common(1)
T5364473737051:pid222241t1 page_needfree(18446744073709279788); `needfree = 271828 (+ 0), `availrmem = 8044993
T5364473757812:pid222241t1 page_needfree(271827); `needfree = 0 (+ 271827), `availrmem = 8044993
T5364473765449:pid222241t1 kmem_reap_common(1)
T5366473740970:pid222241t1 page_needfree(18446744073709279789); `needfree = 271827 (+ 0), `availrmem = 8038501
T5366473753774:pid222241t1 page_needfree(278319); `needfree = 0 (+ 278319), `availrmem = 8038501
T5366473757984:pid222241t1 kmem_reap_common(1)
T5367473736302:pid222241t1 page_needfree(18446744073709273297); `needfree = 278319 (+ 0), `availrmem = 8038501
T5367473757413:pid222241t1 page_needfree(278319); `needfree = 0 (+ 278319), `availrmem = 8038501
T5367473764987:pid222241t1 kmem_reap_common(1)

Comment by Jon Burgoyne
Created at 2019-03-01T18:32:28.796Z

had another node fail to start CN JA501583 (also in central 1a and pg serving as sync). Dtrace produced

Started tracing...
T5948242990117:pid515980t1 page_needfree(18446744073708979889); `needfree = 571727 (+ 0), `availrmem = 7759055
T5948243009018:pid515980t1 page_needfree(571727); `needfree = 0 (+ 571727), `availrmem = 7759055
T5948243016169:pid515980t1 kmem_reap_common(1)
T5949242992954:pid515980t1 page_needfree(18446744073708979889); `needfree = 571727 (+ 0), `availrmem = 7759056
T5949243011669:pid515980t1 page_needfree(571726); `needfree = 0 (+ 571726), `availrmem = 7759056
T5949243018378:pid515980t1 kmem_reap_common(1)
T5950242989054:pid515980t1 page_needfree(18446744073708979890); `needfree = 571726 (+ 0), `availrmem = 7759056
T5950243000028:pid515980t1 page_needfree(571726); `needfree = 0 (+ 571726), `availrmem = 7759056
T5950243005458:pid515980t1 kmem_reap_common(1)
T5951242989604:pid515980t1 page_needfree(18446744073708979890); `needfree = 571726 (+ 0), `availrmem = 7759056
T5951243000013:pid515980t1 page_needfree(571726); `needfree = 0 (+ 571726), `availrmem = 7759056
T5951243005793:pid515980t1 kmem_reap_common(1)
T5952242988261:pid515980t1 page_needfree(18446744073708979890); `needfree = 571726 (+ 0), `availrmem = 7759056
T5952242998601:pid515980t1 page_needfree(571726); `needfree = 0 (+ 571726), `availrmem = 7759056
T5952243004492:pid515980t1 kmem_reap_common(1)
T5953242994108:pid515980t1 page_needfree(18446744073708979890); `needfree = 571726 (+ 0), `availrmem = 7759056
T5953243014484:pid515980t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 7759056
T5953243066123:pid515980t1 anon_swap_adjust() -> 12
T5953243078943:pid515980t1 as_map() (create) -> 12
T5953243369606:pid515980t1 sptcreate() -> 12
T5953243384731:pid515980t1 shmat() -> c
T5945242994345:pid515980t1 page_needfree(18446744073708879417); `needfree = 672199 (+ 0), `availrmem = 7669334
T5945243015792:pid515980t1 page_needfree(661448); `needfree = 0 (+ 661448), `availrmem = 7669334
T5945243023504:pid515980t1 kmem_reap_common(1)
T5944243632309:pid515980t1 shmat(8, 0)
T5944243644655:pid515980t1 ipc_lookup() -> -2456478883328
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffdc75acd3551
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffdc75acd3568
            struct list_node *list_prev = 0xfffffdc75e83c800
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffdc7ba5de738
        uint_t ipc_id = 0x8
        zoneid_t ipc_zoneid = 0x8
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffdd3fb249000
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xfffffdcbfb0aeee8
                struct list_node *list_prev = 0xfffffdd3fb249078
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xfffffdc4c6bfb870
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0x7df8c
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c7977f9
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T5944243657623:pid515980t1 map_addr(align_hint = 40000000, size = 240000000)
T5944243671213:pid515980t1 map_addr() -> addr = fffff9fd80000000
T5944243679733:pid515980t1 sptcreate()
T5944243829767:pid515980t1 as_map() (create)
T5944243836738:pid515980t1 valid_va_range() -> 1
T5944243842414:pid515980t1 valid_usr_range() -> 0
T5944243849089:pid515980t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffdc6218941b0
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        uintptr_t avl_pcb = 0xfffffdc3ea30f569
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffdca28ed2428
        struct pcache_link *p_lprev = 0xfffffdca28ed2428
    }
}
T5944243857015:pid515980t1 as_addseg() -> 0
T5944243863060:pid515980t1 seg_alloc() -> fffffdca28ed23d8
T5944243870173:pid515980t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xfffffdc4c6bfb870
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T5944243879023:pid515980t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7ed2d1c
    pgcnt_t ani_phys_resv = 0x315cb6
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T5944243888380:pid515980t1 page_reclaim_mem(2163198, 6167559, 1); availrmem = 7658583, t_minarmem = 25
T5944243901390:pid515980t1 page_needfree(672199); `needfree = 0 (+ 672199), `availrmem = 7658583
T5944243913555:pid515980t1 kmem_reap_common(0)
T5946242989192:pid515980t1 page_needfree(18446744073708890168); `needfree = 661448 (+ 0), `availrmem = 7719181
T5946243004235:pid515980t1 page_needfree(611601); `needfree = 0 (+ 611601), `availrmem = 7719182
T5946243007947:pid515980t1 kmem_reap_common(1)
T5947242991806:pid515980t1 page_needfree(18446744073708940015); `needfree = 611601 (+ 0), `availrmem = 7759055
T5947243013303:pid515980t1 page_needfree(571727); `needfree = 0 (+ 571727), `availrmem = 7759055
T5947243020748:pid515980t1 kmem_reap_common(1)

@accountid:62562f1fcdc24000704b0435 was online and asked me to run the following as I noticed that zfs was very slow doing simple things like setting canmount.
I think that zfs had got faster by time I ran this (in middle of production CM), but here is the output

[root@JA501583 (eu-central-1a) /var/tmp]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci fctl stmf_sbd stmf zfs sd mm lofs idm sata mpt_sas i40e crypto random cpc fcp logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   28084334            109704   21%
Boot pages                  69810               272    0%
ZFS File Data            93636884            365769   70%
Anon                      3209809             12538    2%
Exec and libs               92624               361    0%
Page cache                  24819                96    0%
Free (cachelist)            23137                90    0%
Free (freelist)           8724812             34081    7%

Total                   133866229            522914
Physical                133866228            522914

Comment by Jon Burgoyne
Created at 2019-03-01T19:30:30.019Z

and another JA501589

Started tracing...
T6691726206425:pid158616t1 page_needfree(18446744073709094508); `needfree = 457108 (+ 0), `availrmem = 7865251
T6691726231698:pid158616t1 page_reclaim_mem() -> 0 after 14 iterations, availrmem now 7865251
T6691726279012:pid158616t1 anon_swap_adjust() -> 12
T6691726293970:pid158616t1 as_map() (create) -> 12
T6691726588512:pid158616t1 sptcreate() -> 12
T6691726606269:pid158616t1 shmat() -> c
T6684726687760:pid158616t1 shmat(8, 0)
T6684726702358:pid158616t1 ipc_lookup() -> -2442727058944
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffdc7658c15d9
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffdc7658c15f0
            struct list_node *list_prev = 0xfffffdc76b627e58
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffdc76a66f970
        uint_t ipc_id = 0x8
        zoneid_t ipc_zoneid = 0x7
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffe0039396000
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xfffffdc7670fbea8
                struct list_node *list_prev = 0xfffffe0039396078
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xfffffdc93126b8c8
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0x26b98
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c797ade
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T6684726714458:pid158616t1 map_addr(align_hint = 40000000, size = 240000000)
T6684726728038:pid158616t1 map_addr() -> addr = fffff9fd80000000
T6684726737270:pid158616t1 sptcreate()
T6684726888771:pid158616t1 as_map() (create)
T6684726895923:pid158616t1 valid_va_range() -> 1
T6684726902316:pid158616t1 valid_usr_range() -> 0
T6684726909004:pid158616t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffdc4160d6ab8
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0xfffffdc931e0ead8 ]
        uintptr_t avl_pcb = 0xfffffdc9316e681a
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffdc775983488
        struct pcache_link *p_lprev = 0xfffffdc775983488
    }
}
T6684726917864:pid158616t1 as_addseg() -> 0
T6684726923876:pid158616t1 seg_alloc() -> fffffdc775983438
T6684726930721:pid158616t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xfffffdc93126b8c8
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T6684726940691:pid158616t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7ec8ad8
    pgcnt_t ani_phys_resv = 0x32099a
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T6684726952016:pid158616t1 page_reclaim_mem(2163198, 6159135, 1); availrmem = 7770229, t_minarmem = 25
T6684726966154:pid158616t1 page_needfree(552129); `needfree = 0 (+ 552129), `availrmem = 7770229
T6684726978082:pid158616t1 kmem_reap_common(0)
T6685726220378:pid158616t1 page_needfree(18446744073708999487); `needfree = 552129 (+ 0), `availrmem = 7786058
T6685726239320:pid158616t1 page_needfree(536300); `needfree = 0 (+ 536300), `availrmem = 7786058
T6685726246859:pid158616t1 kmem_reap_common(1)
T6686726217568:pid158616t1 page_needfree(18446744073709015316); `needfree = 536300 (+ 0), `availrmem = 7861678
T6686726241383:pid158616t1 page_needfree(460680); `needfree = 0 (+ 460680), `availrmem = 7861679
T6686726248953:pid158616t1 kmem_reap_common(1)
T6687726211339:pid158616t1 page_needfree(18446744073709090936); `needfree = 460680 (+ 0), `availrmem = 7871686
T6687726221565:pid158616t1 page_needfree(450672); `needfree = 0 (+ 450672), `availrmem = 7871686
T6687726227225:pid158616t1 kmem_reap_common(1)
T6688726196425:pid158616t1 page_needfree(18446744073709100944); `needfree = 450672 (+ 0), `availrmem = 7865250
T6688726209138:pid158616t1 page_needfree(457108); `needfree = 0 (+ 457108), `availrmem = 7865250
T6688726212820:pid158616t1 kmem_reap_common(1)
T6689726206419:pid158616t1 page_needfree(18446744073709094508); `needfree = 457108 (+ 0), `availrmem = 7865250
T6689726228482:pid158616t1 page_needfree(457108); `needfree = 0 (+ 457108), `availrmem = 7865250
T6689726236492:pid158616t1 kmem_reap_common(1)
T6690726199497:pid158616t1 page_needfree(18446744073709094508); `needfree = 457108 (+ 0), `availrmem = 7865250
T6690726218335:pid158616t1 page_needfree(457108); `needfree = 0 (+ 457108), `availrmem = 7865250
T6690726225600:pid158616t1 kmem_reap_common(1)

Comment by Jon Burgoyne
Created at 2019-03-01T19:31:22.722Z

and JA501571

Started tracing...
T6689537674681:pid157204t1 page_needfree(18446744073709191124); `needfree = 360492 (+ 0), `availrmem = 7957803
T6689537697329:pid157204t1 page_reclaim_mem() -> 0 after 14 iterations, availrmem now 7957803
T6689537790827:pid157204t1 anon_swap_adjust() -> 12
T6689537804672:pid157204t1 as_map() (create) -> 12
T6689538294582:pid157204t1 sptcreate() -> 12
T6689538312918:pid157204t1 shmat() -> c
T6682537959567:pid157204t1 shmat(8, 0)
T6682537977148:pid157204t1 ipc_lookup() -> -2442756587008
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffdc762f62611
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffdc762f62628
            struct list_node *list_prev = 0xfffffdc768182f98
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffe062a0300f8
        uint_t ipc_id = 0x8
        zoneid_t ipc_zoneid = 0x7
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffdccb6910040
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xfffffdc76a7f79e8
                struct list_node *list_prev = 0xfffffdccb69100b8
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xfffffdc763b40158
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0x26614
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c797adc
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T6682537990224:pid157204t1 map_addr(align_hint = 40000000, size = 240000000)
T6682538004848:pid157204t1 map_addr() -> addr = fffff9fd80000000
T6682538013910:pid157204t1 sptcreate()
T6682538163922:pid157204t1 as_map() (create)
T6682538171122:pid157204t1 valid_va_range() -> 1
T6682538177671:pid157204t1 valid_usr_range() -> 0
T6682538184274:pid157204t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffdc41829c6c0
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        uintptr_t avl_pcb = 0xfffffdc9222df811
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffdcaa0020300
        struct pcache_link *p_lprev = 0xfffffdcaa0020300
    }
}
T6682538192892:pid157204t1 as_addseg() -> 0
T6682538199721:pid157204t1 seg_alloc() -> fffffdcaa00202b0
T6682538207342:pid157204t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xfffffdc763b40158
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T6682538216858:pid157204t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7f04f5b
    pgcnt_t ani_phys_resv = 0x2e8678
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T6682538227925:pid157204t1 page_reclaim_mem(2163198, 6155071, 1); availrmem = 7836548, t_minarmem = 25
T6682538243087:pid157204t1 page_needfree(481746); `needfree = 0 (+ 481746), `availrmem = 7836548
T6682538256632:pid157204t1 kmem_reap_common(0)
T6683537686811:pid157204t1 page_needfree(18446744073709069870); `needfree = 481746 (+ 0), `availrmem = 7853212
T6683537713059:pid157204t1 page_needfree(465082); `needfree = 0 (+ 465082), `availrmem = 7853213
T6683537722479:pid157204t1 kmem_reap_common(1)
T6684537681713:pid157204t1 page_needfree(18446744073709086534); `needfree = 465082 (+ 0), `availrmem = 7925310
T6684537703457:pid157204t1 page_needfree(392984); `needfree = 0 (+ 392984), `availrmem = 7925311
T6684537711364:pid157204t1 kmem_reap_common(1)
T6686537687210:pid157204t1 page_needfree(18446744073709196748); `needfree = 354868 (+ 0), `availrmem = 7957802
T6686537708784:pid157204t1 page_needfree(360492); `needfree = 0 (+ 360492), `availrmem = 7957802
T6686537717061:pid157204t1 kmem_reap_common(1)
T6687537671408:pid157204t1 page_needfree(18446744073709191124); `needfree = 360492 (+ 0), `availrmem = 7957802
T6687537683462:pid157204t1 page_needfree(360492); `needfree = 0 (+ 360492), `availrmem = 7957802
T6687537688853:pid157204t1 kmem_reap_common(1)
T6688537669216:pid157204t1 page_needfree(18446744073709191124); `needfree = 360492 (+ 0), `availrmem = 7957802
T6688537680207:pid157204t1 page_needfree(360492); `needfree = 0 (+ 360492), `availrmem = 7957802
T6688537685750:pid157204t1 kmem_reap_common(1)
T6685537676624:pid157204t1 page_needfree(18446744073709158632); `needfree = 392984 (+ 0), `availrmem = 7963426
T6685537698502:pid157204t1 page_needfree(354868); `needfree = 0 (+ 354868), `availrmem = 7963426
T6685537707266:pid157204t1 kmem_reap_common(1)

Comment by Jon Burgoyne
Created at 2019-03-01T19:32:08.467Z

and JA700150

Started tracing...
T8245878739827:pid363695t1 page_needfree(18446744073709182589); `needfree = 369027 (+ 0), `availrmem = 7976482
T8245878764684:pid363695t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 7976482
T8245878830802:pid363695t1 anon_swap_adjust() -> 12
T8245878843088:pid363695t1 as_map() (create) -> 12
T8245879145308:pid363695t1 sptcreate() -> 12
T8245879161014:pid363695t1 shmat() -> c
T8236879223522:pid363695t1 shmat(9, 0)
T8236879236562:pid363695t1 ipc_lookup() -> -2442919939520
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffdc75e620d91
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffdc75e620da8
            struct list_node *list_prev = 0xfffffdc3ead65440
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffdc976a46e88
        uint_t ipc_id = 0x9
        zoneid_t ipc_zoneid = 0x6
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffdce069a6ac0
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xfffffdc8f3591e68
                struct list_node *list_prev = 0xfffffdce069a6b38
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xfffffdc3e5e1be00
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0x58caf
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c7980ee
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T8236879247780:pid363695t1 map_addr(align_hint = 40000000, size = 240000000)
T8236879261514:pid363695t1 map_addr() -> addr = fffff9fd80000000
T8236879269606:pid363695t1 sptcreate()
T8236879515406:pid363695t1 as_map() (create)
T8236879522066:pid363695t1 valid_va_range() -> 1
T8236879528153:pid363695t1 valid_usr_range() -> 0
T8236879534650:pid363695t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffdc3e59d0d28
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        uintptr_t avl_pcb = 0xfffffdc75b676b79
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffdc757cc6d40
        struct pcache_link *p_lprev = 0xfffffdc757cc6d40
    }
}
T8236879542416:pid363695t1 as_addseg() -> 0
T8236879548313:pid363695t1 seg_alloc() -> fffffdc757cc6cf0
T8236879554823:pid363695t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xfffffdc3e5e1be00
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T8236879562408:pid363695t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7ee97a2
    pgcnt_t ani_phys_resv = 0x2fce4d
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T8236879571028:pid363695t1 page_reclaim_mem(2163198, 6182286, 1); availrmem = 7916950, t_minarmem = 25
T8236879584423:pid363695t1 page_needfree(428559); `needfree = 0 (+ 428559), `availrmem = 7916950
T8236879595546:pid363695t1 kmem_reap_common(0)
T8237878744918:pid363695t1 page_needfree(18446744073709123057); `needfree = 428559 (+ 0), `availrmem = 7981937
T8237878766965:pid363695t1 page_needfree(363572); `needfree = 0 (+ 363572), `availrmem = 7981937
T8237878774829:pid363695t1 kmem_reap_common(1)
T8238878749677:pid363695t1 page_needfree(18446744073709188044); `needfree = 363572 (+ 0), `availrmem = 7981937
T8238878770164:pid363695t1 page_needfree(363572); `needfree = 0 (+ 363572), `availrmem = 7981937
T8238878777862:pid363695t1 kmem_reap_common(1)
T8239878745222:pid363695t1 page_needfree(18446744073709188044); `needfree = 363572 (+ 0), `availrmem = 7981937
T8239878765549:pid363695t1 page_needfree(363572); `needfree = 0 (+ 363572), `availrmem = 7981937
T8239878773624:pid363695t1 kmem_reap_common(1)
T8240878744585:pid363695t1 page_needfree(18446744073709188044); `needfree = 363572 (+ 0), `availrmem = 7981937
T8240878766922:pid363695t1 page_needfree(363572); `needfree = 0 (+ 363572), `availrmem = 7981937
T8240878774942:pid363695t1 kmem_reap_common(1)
T8241878752143:pid363695t1 page_needfree(18446744073709188044); `needfree = 363572 (+ 0), `availrmem = 7981938
T8241878772271:pid363695t1 page_needfree(363571); `needfree = 0 (+ 363571), `availrmem = 7981938
T8241878780297:pid363695t1 kmem_reap_common(1)
T8242878743082:pid363695t1 page_needfree(18446744073709188045); `needfree = 363571 (+ 0), `availrmem = 7981938
T8242878764202:pid363695t1 page_needfree(363571); `needfree = 0 (+ 363571), `availrmem = 7981938
T8242878772249:pid363695t1 kmem_reap_common(1)
T8243878737095:pid363695t1 page_needfree(18446744073709188045); `needfree = 363571 (+ 0), `availrmem = 7976482
T8243878757193:pid363695t1 page_needfree(369027); `needfree = 0 (+ 369027), `availrmem = 7976482
T8243878764867:pid363695t1 kmem_reap_common(1)
T8244878737083:pid363695t1 page_needfree(18446744073709182589); `needfree = 369027 (+ 0), `availrmem = 7976482
T8244878760721:pid363695t1 page_needfree(369027); `needfree = 0 (+ 369027), `availrmem = 7976482
T8244878769550:pid363695t1 kmem_reap_common(1)

Comment by Jon Burgoyne
Created at 2019-03-01T19:33:05.091Z

and JA700140

Started tracing...
T8991085358881:pid55846t1 page_needfree(18446744073709029751); `needfree = 521865 (+ 0), `availrmem = 7487741
T8991085379174:pid55846t1 page_reclaim_mem() -> 0 after 10 iterations, availrmem now 7487741
T8991085427488:pid55846t1 anon_swap_adjust() -> 12
T8991085441462:pid55846t1 as_map() (create) -> 12
T8991085598936:pid55846t1 sptcreate() -> 12
T8991085614257:pid55846t1 shmat() -> c
T8986085707969:pid55846t1 shmat(5, 0)
T8986085727272:pid55846t1 ipc_lookup() -> -2109714009792
struct kshmid {
    kipc_perm_t shm_perm = {
        avl_node_t ipc_avl = {
            struct avl_node *[2] avl_child = [ 0, 0 ]
            uintptr_t avl_pcb = 0xfffffe1509eac669
        }
        list_node_t ipc_list = {
            struct list_node *list_next = 0xfffffe1509eac680
            struct list_node *list_prev = 0xfffffe14b0968810
        }
        uint_t ipc_ref = 0x1
        uid_t ipc_uid = 0x38b
        gid_t ipc_gid = 0
        uid_t ipc_cuid = 0x38b
        gid_t ipc_cgid = 0
        mode_t ipc_mode = 0x8180
        key_t ipc_key = 0x52e2c1
        kproject_t *ipc_proj = 0xfffffe14b8ebca80
        uint_t ipc_id = 0x5
        zoneid_t ipc_zoneid = 0x5
        zone_ref_t ipc_zone_ref = {
            struct zone *zref_zone = 0xfffffe11487e3a80
            list_node_t zref_linkage = {
                struct list_node *list_next = 0xffffff5242eb59a8
                struct list_node *list_prev = 0xfffffe11487e3af8
            }
        }
    }
    size_t shm_segsz = 0x2101fe000
    struct anon_map *shm_amp = 0xffffff0e35587d30
    ushort_t shm_lkcnt = 0
    pgcnt_t shm_lkpages = 0
    kmutex_t shm_mlock = {
        void *[1] _opaque = [ 0 ]
    }
    pid_t shm_lpid = 0
    pid_t shm_cpid = 0xda26
    ulong_t shm_ismattch = 0
    time_t shm_atime = 0
    time_t shm_dtime = 0
    time_t shm_ctime = 0x5c7983db
    struct sptinfo *shm_sptinfo = 0
    struct seg *shm_sptseg = 0
    ulong_t shm_opts = 0
}
T8986085748133:pid55846t1 map_addr(align_hint = 40000000, size = 240000000)
T8986085768103:pid55846t1 map_addr() -> addr = fffffa7d80000000
T8986085779085:pid55846t1 sptcreate()
T8986085886533:pid55846t1 as_map() (create)
T8986085899171:pid55846t1 valid_va_range() -> 1
T8986085908441:pid55846t1 valid_usr_range() -> 0
T8986085918086:pid55846t1 as_addseg()
struct seg {
    caddr_t s_base = 0
    size_t s_size = 0x240000000
    uint_t s_szc = 0
    uint_t s_flags = 0
    struct as *s_as = 0xfffffe11e885d280
    avl_node_t s_tree = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        uintptr_t avl_pcb = 0xfffffe14b611bf29
    }
    struct seg_ops *s_ops = 0
    void *s_data = 0
    kmutex_t s_pmtx = {
        void *[1] _opaque = [ 0 ]
    }
    pcache_link_t s_phead = {
        struct pcache_link *p_lnext = 0xfffffe14b5072758
        struct pcache_link *p_lprev = 0xfffffe14b5072758
    }
}
T8986085930211:pid55846t1 as_addseg() -> 0
T8986085938331:pid55846t1 seg_alloc() -> fffffe14b5072708
T8986085948216:pid55846t1 segspt_create()
struct segspt_crargs {
    struct seg *seg_spt = 0x4000
    struct anon_map *amp = 0xffffff0e35587d30
    uint_t prot = 0xf
    uint_t flags = 0x4000
    uint_t szc = 0x2
}
T8986085958764:pid55846t1 anon_swap_adjust(2163198)
struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7eed75d
    pgcnt_t ani_phys_resv = 0x2f71e7
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}
T8986085971455:pid55846t1 page_reclaim_mem(2163198, 5846383, 1); availrmem = 7483933, t_minarmem = 25
T8986085986902:pid55846t1 page_needfree(525673); `needfree = 0 (+ 525673), `availrmem = 7483933
T8986086001106:pid55846t1 kmem_reap_common(0)
T8987085357138:pid55846t1 page_needfree(18446744073709025943); `needfree = 525673 (+ 0), `availrmem = 7487741
T8987085377271:pid55846t1 page_needfree(521865); `needfree = 0 (+ 521865), `availrmem = 7487741
T8987085383730:pid55846t1 kmem_reap_common(1)
T8988085363917:pid55846t1 page_needfree(18446744073709029751); `needfree = 521865 (+ 0), `availrmem = 7487741
T8988085383535:pid55846t1 page_needfree(521865); `needfree = 0 (+ 521865), `availrmem = 7487741
T8988085390120:pid55846t1 kmem_reap_common(1)
T8990085363723:pid55846t1 page_needfree(18446744073709029751); `needfree = 521865 (+ 0), `availrmem = 7487741
T8990085383237:pid55846t1 page_needfree(521865); `needfree = 0 (+ 521865), `availrmem = 7487741
T8990085389968:pid55846t1 kmem_reap_common(1)
T8989085367901:pid55846t1 page_needfree(18446744073709029751); `needfree = 521865 (+ 0), `availrmem = 7487741
T8989085388323:pid55846t1 page_needfree(521865); `needfree = 0 (+ 521865), `availrmem = 7487741
T8989085395023:pid55846t1 kmem_reap_common(1)

Comment by Jon Burgoyne
Created at 2019-03-01T20:59:37.615Z
Updated at 2019-03-01T21:00:11.773Z

When a node fails to start postgres, it will assume it needs a new dataset and so you will see "zfs recv" in the process table
To get the node going again, I follow this "kickstart" procedure

svcadm disable pg_prefaulter
svcadm disable manatee-snapshotter
svcadm disable manatee-backupserver
svcadm disable manatee-sitter

zfs list
zfs rename ...
zfs set canmount=on zones/$(zonename)/data/manatee
zfs set mountpoint=/manatee/pg zones/$(zonename)/data/manatee
zfs mount  zones/$(zonename)/data/manatee

svcadm enable manatee-sitter

manatee-adm show

svcadm enable pg_prefaulter
svcadm enable manatee-snapshotter
svcadm enable manatee-backupserver

That will get postgres running again


Comment by Former user
Created at 2019-03-28T13:55:48.049Z

Note that previously we will have gone through shm_get() to create the segment.
For that to work, the segment size is checked against the rctls.
As can be seen above, these should be fine. We also, though pass through anon_resv().

In our case, this will basically look at:

pswap_pages = k_anoninfo.ani_max - k_anoninfo.ani_phys_resv

struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7ee97a2
    pgcnt_t ani_phys_resv = 0x2fce4d
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0
}

So, since we clearly have 8Gb of physical swap available, anon_resvmem() happily
adjusts k_anoninfo.ani_phys_resv and everything is hunky-dory. It has no way of knowing
that in fact, we are about to request non-SHM_PAGEABLE memory when we actually
try to attach to this. So we will not go down the path that reserves into ani_mem_resv here.

Note that when we have attached one of these segments, we should find that
those are covered by ani_locked_swap (and ani_mem_resv is at least that value too).

Taking one of these DTrace outputs and re-arranging in time order:

T8236879223522:pid363695t1 shmat(9, 0)
...
T8236879554823:pid363695t1 segspt_create()
...
T8236879562408:pid363695t1 anon_swap_adjust(2163198)

So we got as far as segspt__create() and:

 543         if ((sptcargs->flags & SHM_PAGEABLE) == 0) {                             
 544                 if (err = anon_swap_adjust(npages))                              
 545                         return (err);                                            
 546         }

$ unit 2163198 p | grep gigabytes
8.251945 gigabytes                                  

So we're asking to lock 8Gb in memory, and need to adjust for that in the memory swap accounting.

3517         unlocked_mem_swap = k_anoninfo.ani_mem_resv                              
3518             - k_anoninfo.ani_locked_swap;                                        
3519         if (npages > unlocked_mem_swap) {                                        
3520                 spgcnt_t adjusted_swap = npages - unlocked_mem_swap;             
3521                                                                                  
3522                 /*                                                               
3523                  * if there is not enough unlocked mem swap we take missing      
3524                  * amount from phys swap and give it to mem swap                 
3525                  */                                                              
3526                 if (!page_reclaim_mem(adjusted_swap, segspt_minfree, 1)) {       
3527                         mutex_exit(&anoninfo_lock);                              
3528                         return (ENOMEM);                                         
3529                 }                                                                

unlocked_mem_swap is zero, as we discussed above. So, we must now grab our 8Gb
from free memory i.e. by adjusting availrmem in page_reclaim_mem().


Comment by Former user
Created at 2019-03-28T14:51:13.947Z

We're here:

5706 int                                                                              
5707 page_reclaim_mem(pgcnt_t npages, pgcnt_t epages, int adjust)

T8236879571028:pid363695t1 page_reclaim_mem(2163198, 6182286, 1); availrmem = 7916950, t_minarmem = 25

So we're asking for our npages=2163198, with segspt_minfree=23.5Gb. We'll loop basically temporarily adjusting needfree, kicking off a kmem_reap, and hoping enough turns up in availrmem for our needs. Which is 2163198p+6182286p+tune.t_minarmem ~= 31Gb. `availrmem is at about 30Gb.

So we eventually find sadness:

T8245878764684:pid363695t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 7976482

So we reaped ... a bit? Either way, we're way off what we supposedly need here.

Now, this is all clearly a little bit bonkers because of segspt_minfree. It also explains why a simple reboot of the zone can cause our issue: imagine we have just enough to satisfy the allocation initially (that is, our 8Gb plus the segspt_minfree. After this, something else, such as kmem, eats into `availrmem. Now when we shutdown, we'll release the 8Gb back - so in theory we should be able to get it back. But in the meantime, the other user - that is not bound by segspt_minfree has changed the equation such that we can't get our 8Gb back with the segspt_minfree headroom.

This could happen generally, of course. But we're really not short on memory on these systems for what we're doing here. The problem appears to be all around an unnecessarily large segspt_minfree setting.


Comment by Former user
Created at 2019-03-28T14:58:31.582Z
  62 /*                                                                               
  63  * segspt_minfree is the memory left for system after ISM                        
  64  * locked its pages; it is set up to 5% of availrmem in                          
  65  * sptcreate when ISM is created.  ISM should not use more                       
  66  * than ~90% of availrmem; if it does, then the performance                      
  67  * of the system may decrease. Machines with large memories may                  
  68  * be able to use up more memory for ISM so we set the default                   
  69  * segspt_minfree to 5% (which gives ISM max 95% of availrmem.                   
  70  * If somebody wants even more memory for ISM (risking hanging                   
  71  * the system) they can patch the segspt_minfree to smaller number.              
  72  */                                                                              
  73 pgcnt_t segspt_minfree = 0;                                                      

Of course, this comment is pretty ancient. The old Solaris tuning doc says this:

When to Change

    On database servers with large amounts of physical memory using ISM, this parameter can be tuned downward. If ISM segments are not used, this parameter has no effect. A maximum value of 128 Mbytes (0x4000) is almost certainly sufficient on large memory machines.

which is... slightly different ... from the 23Gb we're getting right now.

It seems clear we can tune this down, especially on our systems where we should be protected by the zone rctls anyway. What's not yet obvious is what the calculation should be, and whether it's something we can deploy by default, or if we need /etc/system tweaks.


Comment by Former user
Created at 2019-03-28T15:41:32.876Z

Since this value is dynamically variable, one thing we might try is:

echo "segspt_minfree/z 0x40000" | mdb -kw

on some asyncs (that is 1Gb), and see how we go. We should probably verify what `availrmem is first though.


Comment by Former user
Created at 2019-03-28T22:19:14.719Z

This analysis looks great to me. I'm wondering if, instead of doing something in /etc/system, we could modify sptcreate() where it sets segspt_minfree? We could adjust the calculation so that if we're on a machine with a larger amount of memory (say 128GB or greater), we use a smaller percentage. Maybe instead of 5%, we could go to 2% (divide by 35), or even less. We would leave the current calculation for smaller memory systems. We could carry this small change as a smartos-specific fix without any issues. Of course, I'll defer to you if you have a better approach for fixing this.


Comment by Former user
Created at 2019-03-28T22:39:08.911Z

Thanks for taking a look Jerry. My plan is probably still to fix the segspt_minfree calculation at source, but it potentially has a wider impact. I was actually thinking of just clamping at some maximum value - perhaps 1Gb.


Comment by Former user
Created at 2019-03-29T14:04:27.271Z

One thing that's only just occurred to me as well: this limit is instantiated lazily on the first shmat(). So when it looks at `availrmem, in theory, this could happen at any point in the system. That we end up with a high value is to some degree happenstance. This doesn't seem good either.


Comment by Former user
Created at 2019-03-29T21:19:19.368Z

Yes, it seems like the late calculation could really cause erratic results here. Offhand I am not sure what the best solution is to this.


Comment by Former user
Created at 2019-07-08T15:12:59.809Z

@accountid:557058:1a9f285e-207c-42aa-8686-f3184827ffec collected script output across a whole set of systems in central-1a
under CM-2899

There were basically three classes of output:

=== Output from 00000000-0000-0000-0000-ac1f6b6c02e6 (JA501584):
Host UUID: UUID='00000000-0000-0000-0000-ac1f6b6c02e6'
Total memory pages: 133868288 ( 522923 Mb )
availrmem pages: 10237319 ( 39989 Mb )
largest_ism_pages: 2163198 ( 8449 Mb )
segspt_minfree_pages: 6204938 ( 24238 Mb )
{
    ani_max = 0x7fbffff
    ani_free = 0x7a8844e
    ani_phys_resv = 0x60af91
    ani_mem_resv = 0x210203
    ani_locked_swap = 0x210203
}
A restart would require: 8368136 ( 32688 Mb )
from availrmem: 10237319 ( 39989 Mb )
System seems OK?

This first variant has the far-too-high segspt_minfree, more or less as we'd expect.

=== Output from 00000000-0000-0000-0000-ac1f6b6c020c (JA501583):
Host UUID: UUID='00000000-0000-0000-0000-ac1f6b6c020c'
Total memory pages: 133868288 ( 522923 Mb )
availrmem pages: 5185914 ( 20257 Mb )
largest_ism_pages: 2163198 ( 8449 Mb )
segspt_minfree_pages: 6167559 ( 24092 Mb )
{
    ani_max = 0x7fbffff
    ani_free = 0x7ea4c76
    ani_phys_resv = 0x13527d
    ani_mem_resv = 0x210203
    ani_locked_swap = 0x210203
}
A restart would require: 8330757 ( 32542 Mb )
from availrmem: 5185914 ( 20257 Mb )
WARNING: restarting postgres might fail on this system

This second variant has a far too high setting, but also doesn't really have enough `availrmem around. We thus
predict that a restart will fail.

=== Output from 00000000-0000-0000-0000-ac1f6b0580aa (HA127636):
Host UUID: UUID='00000000-0000-0000-0000-ac1f6b0580aa'
Total memory pages: 134189056 ( 524176 Mb )
availrmem pages: 11337199 ( 44285 Mb )
largest_ism_pages: 2163198 ( 8449 Mb )
segspt_minfree_pages: 285374 ( 1114 Mb )
{
    ani_max = 0x7ffffff
    ani_free = 0x78e2f38
    ani_phys_resv = 0x7eb1ff
    ani_mem_resv = 0x210203
    ani_locked_swap = 0x210203
}
A restart would require: 2448572 ( 9564 Mb )
from availrmem: 11337199 ( 44285 Mb )
System seems OK?

These last ones are interesting, as somehow segspt_minfree_pages is much more reasonable. It's
possible that this is due to my previous comment regarding happenstance. Also of interest is that
value is very close to my suggested arbitrary clamp of 1Gb. This is hopefully reasonable evidence
that a clamp there is going to work OK.


Comment by Former user
Created at 2019-07-08T16:06:37.543Z

I put all of them through the attached awk script to produce nawk.out

All of these systems are ~ 512Gb RAM, though it varies slightly. For the most
part, the minfree sizing is the two values mentioned above, either ~ 1Gb
or ~24Gb. availrmem varies from ~20Gb to ~356Gb. In all FAILing cases,
the required amount is as expected ~32Gb.

In summary, then, the machine data from CM-2899 is as we expected. As Jon
mentions in the sub-tasks, all the failed restarts were predicted by the script
(there's a lot of false positives too, but that's not a surprise).

I think probably the next steps are to actually try out the PI with OS-7244 fix in place?


Comment by Former user
Created at 2019-08-09T14:59:46.084Z
Updated at 2019-08-09T15:21:51.995Z

@accountid:62562f1fcdc24000704b0435/@accountid:557058:477e0346-c475-4b70-af88-8c89654d6204 - What is the next step for the OS-7244 fix? If we've done the best we could in a test environment, I'll be open to reboot some asyncs into a PI with the fix in our datacenters. Have we observed the impact of an undersized segspt_minfree?


Comment by Former user
Created at 2019-08-09T15:03:33.259Z

I think Max reproduced the failure with an artificially high minfree setting, but I'm not too sure how useful that will prove. Ultimately I think it's time to try out the PI on an async or two and closely monitor them. I don't expect issues with an undersized minfree value, so it's hard to predict that if we do have problems there, what shape they might take. Certainly, monitoring of availrmem, generally memory usage, etc. is a good idea.


Comment by Former user
Created at 2019-08-09T15:06:47.760Z
Updated at 2019-08-09T15:06:59.405Z

Understood. Is the change hotpatchable or we need to reboot into a new PI to take the fix?


Comment by Former user
Created at 2019-08-09T15:10:25.941Z

So it's a bit of both. We could do:

https://github.com/jlevon/grot/blob/master/manatee-381/manatee-381.sh apply-workaround

on running systems. But that doesn't really test the initialization that we do on first attach in the "proper" OS-7244 fix - and that can't be meaningfully hot-fixed. It should stop the specific problem of MANATEE-381 though.

So we could maybe try the script first? It depends really on how invasive reboots would be from an ops POV.


Comment by Former user
Created at 2019-08-14T16:04:26.333Z

I'll build a PI based upon the release-20190718 branch and point to it here.

No additional steps for the workaround needed, though recording the output somewhere will be useful.
Not yet familiar with hotpatches but happy to help out wherever.


Comment by Former user
Created at 2019-08-15T09:19:17.595Z

I built a PI:

/jlevon/public/bits/MANATEE-381/platform-20190814T223007Z.tgz

This is origin/release-20190718 branch of all repos, except with my fix on top for illumos-joyent.
I've briefly tested this on my lab installation.

I realised that perhaps you need a buildstamp based on the original release version though? I can rebuild if so.


Comment by Former user
Created at 2019-09-20T13:57:31.409Z

(Hopefully) resolved via OS-7244


Comment by Jira Bot
Created at 2019-12-20T21:44:39.559Z

hotpatches commit f6e1b63612189ad2c4a4250a2c5517c3530defdb (branch master, by Michael Hicks)

added jlevon's original script with a check and hotpatch version
of the MANATEE-381 hotpatch


Comment by Jira Bot
Created at 2019-12-20T22:09:16.856Z

hotpatches commit b73c22f0c952c2bb6c3f2561bf7c0196e3e7b924 (branch master, by Michael Hicks)

added an unpatch and had the patcher output old val to a temp file
```
~/gits/joyent/hotpatches �� bash -n ./MANATEE-381/MANATEE-381-unpatch.sh
~/gits/joyent/hotpatches �� echo $?
0
~/gits/joyent/hotpatches �� bash -n ./MANATEE-381/MANATEE-381-unpatch.sh ; echo $?
0
~/gits/joyent/hotpatches �� bash -n ./MANATEE-381/MANATEE-381-hotpatch.sh ; echo $?
0
```


Comment by Jira Bot
Created at 2019-12-20T22:29:02.835Z

hotpatches commit 902e9b6cf8234701885645f08a9c7b2a59a7a941 (branch master, by Michael Hicks)

fixed missing old val stash path used for unpatching created in the MANATEE-381 hotpatch script


Comment by Jira Bot
Created at 2019-12-23T23:15:34.009Z

hotpatches commit 20b87ba484f84eadbff0b3cb3ed159c80e009b53 (branch master, by Michael Hicks)

Updated the MANATEE-381 check and patch scripts to go ahead
and patch for all values higher than we were setting