OS-6363: system went to dark side of moon for ~467 seconds

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2017-09-23T00:11:48.000Z
Updated at:2023-10-17T15:34:56.536Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-10-25T17:50:26.000Z)

Fix Versions

2017-10-26 Da-Chao (Release Date: 2017-10-26)

Description

As part of SCI-227, we had a system (HA8V3DRD2) that seemingly stopped for an extended period of time. Going through logs of various agents that run with some periodicity revealed that this period lasted at most 467 seconds, from 2017-09-21T04:29:05.015Z to 2017-09-21T04:36:53.644Z; here (for example) is cmon-agent log:

[root@HA8V3DRD2 (us-east-1b) ~]# cat /var/log/cmon-agent/cmon-agent_HA8V3DRD2_2017-09-21T05:00:00.log | json -ga time req.headers.date
...
2017-09-21T04:28:37.940Z Thu, 21 Sep 2017 04:28:37 GMT
2017-09-21T04:28:38.465Z Thu, 21 Sep 2017 04:28:38 GMT
2017-09-21T04:28:45.018Z Thu, 21 Sep 2017 04:28:45 GMT
2017-09-21T04:28:47.602Z Thu, 21 Sep 2017 04:28:47 GMT
2017-09-21T04:28:47.965Z Thu, 21 Sep 2017 04:28:47 GMT
2017-09-21T04:28:48.457Z Thu, 21 Sep 2017 04:28:48 GMT
2017-09-21T04:28:52.737Z Thu, 21 Sep 2017 04:28:52 GMT
2017-09-21T04:28:55.016Z Thu, 21 Sep 2017 04:28:55 GMT
2017-09-21T04:28:55.268Z Thu, 21 Sep 2017 04:28:55 GMT
2017-09-21T04:28:55.628Z Thu, 21 Sep 2017 04:28:55 GMT
2017-09-21T04:28:56.123Z Thu, 21 Sep 2017 04:28:56 GMT
2017-09-21T04:28:57.615Z Thu, 21 Sep 2017 04:28:57 GMT
2017-09-21T04:28:58.036Z Thu, 21 Sep 2017 04:28:58 GMT
2017-09-21T04:28:58.488Z Thu, 21 Sep 2017 04:28:58 GMT
2017-09-21T04:29:05.015Z Thu, 21 Sep 2017 04:29:05 GMT
2017-09-21T04:36:53.644Z Thu, 21 Sep 2017 04:32:35 GMT
2017-09-21T04:36:53.645Z Thu, 21 Sep 2017 04:32:35 GMT
2017-09-21T04:36:53.645Z Thu, 21 Sep 2017 04:32:35 GMT
2017-09-21T04:36:53.646Z Thu, 21 Sep 2017 04:32:32 GMT
2017-09-21T04:36:53.646Z Thu, 21 Sep 2017 04:32:28 GMT
2017-09-21T04:36:53.646Z Thu, 21 Sep 2017 04:32:27 GMT
2017-09-21T04:36:53.647Z Thu, 21 Sep 2017 04:32:27 GMT
2017-09-21T04:36:53.647Z Thu, 21 Sep 2017 04:32:25 GMT
2017-09-21T04:36:53.648Z Thu, 21 Sep 2017 04:32:18 GMT
...
...

Here are the two consecutive entries that represent the leap:

{
  "name": "cmon-agent",
  "hostname": "HA8V3DRD2",
  "pid": 8168,
  "req_id": "cbfead4c-37e2-4ff1-a15e-86a7eae8d36b",
  "route": "getmetricsforcontainer",
  "audit": true,
  "level": 30,
  "remoteAddress": "10.65.2.197",
  "remotePort": 47831,
  "req": {
    "query": "",
    "method": "GET",
    "url": "/v1/0a2134db-73f9-458c-9c79-e7e29234e566/metrics",
    "headers": {
      "accept": "text/plain",
      "user-agent": "restify/4.1.1 (x64-sunos; v8/4.5.103.46; OpenSSL/1.0.2k) node/4.8.1",
      "date": "Thu, 21 Sep 2017 04:29:05 GMT",
      "host": "10.65.1.113:9163",
      "connection": "keep-alive"
    },
    "httpVersion": "1.1",
    "trailers": {},
    "version": "*",
    "timers": {
      "basicResReq": 16,
      "bunyan": 75,
      "apiGetMetrics": 7904
    }
  },
  "res": {
    "statusCode": 200,
    "headers": {
      "content-type": "text/plain",
      "content-length": 1512,
      "date": "Thu, 21 Sep 2017 04:29:05 GMT",
      "server": "cmon-agent",
      "x-request-id": "cbfead4c-37e2-4ff1-a15e-86a7eae8d36b",
      "x-response-time": 8,
      "x-server-name": "HA8V3DRD2"
    },
    "trailer": false
  },
  "latency": 8,
  "_audit": true,
  "msg": "handled: 200",
  "time": "2017-09-21T04:29:05.015Z",
  "v": 0
}
{
  "name": "cmon-agent",
  "hostname": "HA8V3DRD2",
  "pid": 8168,
  "req_id": "960063c1-5c42-4a5d-a812-42c6f87b5a68",
  "route": "getmetricsforcontainer",
  "audit": true,
  "level": 30,
  "remoteAddress": "10.65.2.188",
  "remotePort": 43252,
  "req": {
    "query": "",
    "method": "GET",
    "url": "/v1/fb6d6ba1-bec3-46e8-8496-72a4ea458a61/metrics",
    "headers": {
      "accept": "text/plain",
      "user-agent": "restify/4.1.1 (x64-sunos; v8/4.5.103.46; OpenSSL/1.0.2k) node/4.8.1",
      "date": "Thu, 21 Sep 2017 04:32:35 GMT",
      "host": "10.65.1.113:9163",
      "connection": "keep-alive"
    },
    "httpVersion": "1.1",
    "trailers": {},
    "version": "*",
    "timers": {
      "basicResReq": 4,
      "bunyan": 15,
      "apiGetMetrics": 10771977
    }
  },
  "res": {
    "statusCode": 200,
    "headers": {
      "content-type": "text/plain",
      "content-length": 1520,
      "date": "Thu, 21 Sep 2017 04:36:53 GMT",
      "server": "cmon-agent",
      "x-request-id": "960063c1-5c42-4a5d-a812-42c6f87b5a68",
      "x-response-time": 10772,
      "x-server-name": "HA8V3DRD2"
    },
    "trailer": false
  },
  "latency": 10772,
  "_audit": true,
  "msg": "handled: 200",
  "time": "2017-09-21T04:36:53.644Z",
  "v": 0
}

Note that apiGetMetrics indicates that it took 10.77 seconds -- but that that still doesn't account for the ~4 minute delay between when this was sent and when we replied to it, nor the (reasonably inferred) ~7 minute gap between responding to requests that appear to be quite regular.

In terms of what happened here, it's unfortunately remains a mystery, though looking through kstats yields some important data points. First, we know that this box has spent more time in its kmem taskq than any other box in us-east-1b:

bcantrill@headnode (us-east-1b) ~]$ sdc-oneachnode -c "kstat -p -n kmem_taskq -s totaltime" | sort -n +2
...
HA8SCHRD2             unix:0:kmem_taskq:totaltime       1191967571064
HA8TYKRD2             unix:0:kmem_taskq:totaltime       1255331494271
HA8SJMRD2             unix:0:kmem_taskq:totaltime       1279988012649
HA8SJHRD2             unix:0:kmem_taskq:totaltime       1503290695784
HA8V3DRD2             unix:0:kmem_taskq:totaltime       2269241731871
...

This denotes that over its ~5 day uptime, the box spent 2,269 seconds running the kmem_taskq.

We also know that this box saw kmem reap activity in its abd_chunk cache -- which is significant because this box has very significant delta between the maximum number of buffers in the abd_chunk and the current number of buffers:

...
HA8V3MRD2 43784064
HA8SKDRD2 44533704
HA8V3FRD2 44561796
HA8TYKRD2 45114324
HA8SKFRD2 46961564
HA8SJHRD2 48019832
HA8SJNRD2 50316444
HA8VNCRD2 56125808
HA8V3LRD2 57432784
HA8V3DRD2 59033708

Which is to say: a kmem_reap may have freed at many as 59 million buffers. This still doesn't feel significant enough to explain the long gap, but the kmem reap activity may correlate to other activity on the machine during the its pass around the dark side of the moon.

To explore this, we are running the following D script:

#pragma D option bufpolicy=ring
#pragma D option bufsize=16m
#pragma D option quiet

BEGIN
{
        proc = timestamp;
}

profile-1hz
{
        printf("time=%d cpu=%d", walltimestamp / 1000000000, cpu);
        stack(50);
}

proc:::create
{
        proc = timestamp;
}

tick-1sec
/timestamp - proc > 20000000000/
{
        exit(0);
}

This is running as OS-6363-mon.d on all metadata nodes in all DCs.

Comments

Comment by Former user
Created at 2017-09-26T16:18:35.000Z

These files show flamegraphs for a protracted period of time of 1 hz samples for the metadata nodes (specifically, those running a Moray instance) for each DC.


Comment by Former user
Created at 2017-09-28T05:39:55.000Z
Updated at 2017-10-24T20:05:04.000Z

With a second machine having entered a period of prolonged duration (and with the script designed to capture this activity having exited some hours before spuriously), we are now going to run the following as OS-6363-kmem.d:

#pragma D option quiet
#pragma D option destructive

arc_kmem_reap_now:entry
{
	printf("%d ts=%d freemem=%d -> arc_kmem_reap_now\n",
	    walltimestamp / 1000000000, timestamp, `freemem);
}

arc_kmem_reap_now:return
{
	printf("%d ts=%d freemem=%d <- arc_kmem_reap_now\n",
	    walltimestamp / 1000000000, timestamp, `freemem);
}

kmem_cache_reap:entry
{
	self->cp = args[0];
	printf("%d freemem=%d -> cache reap %s\n", walltimestamp / 1000000000,
	    `freemem, stringof(args[0]->cache_name));
}

kmem_cache_reap:return
/self->cp != NULL/
{
	printf("%d freemem=%d <- cache reap %s\n", walltimestamp / 1000000000,
	    `freemem, stringof(self->cp->cache_name));
	self->cp = NULL;
}

kmem_depot_ws_reap:entry
{
	self->depot = args[0];
	printf("%d freemem=%d -> mag reap %s\n", walltimestamp / 1000000000,
	    `freemem, stringof(args[0]->cache_name));
}

kmem_depot_ws_reap:entry
/self->depot != NULL/
{
	printf("%d freemem=%d <- mag reap %s\n", walltimestamp / 1000000000,
	    `freemem, stringof(self->depot->cache_name));
	self->depot = NULL
}

htable_steal_active:entry
{
	steals++;
}

profile-1hz
/arg0 != 0 && curthread == `kmem_taskq->tq_thr._tq_thread/
{
	printf("%d kmem_taskq cpu=%d pc=%a", walltimestamp / 1000000000, cpu, arg0);
	stack(60);
}

tick-1sec
{
	printf("%d freemem=%d needfree=%d steals=%d arc_size=%d arc_c=%d waiters=%d load=%d tq=%d\n",
	    walltimestamp / 1000000000,
	    `freemem, `needfree, steals,
	    `arc_stats.arcstat_size.value.ui64,
	    `arc_stats.arcstat_c.value.ui64,
	    ((condvar_impl_t *)&`arc_reclaim_waiters_cv)->cv_waiters,
	    `loadavg.lg_total,
	    `kmem_taskq->tq_tasks - `kmem_taskq->tq_executed);
}

Comment by Former user
Created at 2017-09-29T16:19:29.000Z

While no machine has done a full kmem reap since deploying the script, there has been some ARC-induced reaping of kmem caches, which has allowed some opportunity to observe its behavior. This flamegraph shows time spent in the kmem_taskq, which may or may not be relevant to the black hole.


Comment by Former user
Created at 2017-10-03T07:25:03.000Z

We have observed full reaps on machines that had another interesting property: high numbers of waiters on the arc_reclaim_waiters_cv. This points to a value of zfs_arc_overflow_shift that may be too high: currently, we throttle all ARC users if arc_size is 1/256th greater than arc_c. This isn't much wiggle room, and it seems that we can give the system more room to work by setting this tunable to (say) 4, yielding a 1/16th margin on arc_size versus arc_c. To do this, we are going to run the following script as OS-6363-overflow.sh:

echo "zfs_arc_overflow_shift/W 4" | mdb -kw

Short and sweet!


Comment by Former user
Created at 2017-10-05T16:13:43.000Z

Looks like we saw this again on shard 15. I think we may want to change OS-6363-overflow.sh to set zfs_arc_overflow_shift to a lower value (3, 2, or even 1).


Comment by Former user
Created at 2017-10-16T16:08:03.000Z

A flame graph of the 19 minute kmem reap on HA99WLND2.


Comment by Former user
Created at 2017-10-16T18:41:21.000Z
Updated at 2017-10-31T18:14:27.000Z

The black hole from HA99WLND2 shows vividly what's happening: for reasons presumably related to a Moray restart, we see demand on system memory starting shortly after 18:47 UTC. This kicks off a kmem reap, as demonstrated by the shaded region (which denotes the number of tasks in the kmem_taskq). At the time the reap starts, we can see that the ARC size is very near its arc_c target -- but as the reap continues over the next ~19 minutes or so, the ARC continues to grow above its target. The ARC is growing because the ARC reclaim thread is itself attempting to reap its own kmem caches (pointlessly, of course, as a reap is already running) and -- importantly -- is blocked waiting for the reap to complete. This might seem sensible, but this case highlights the peril of that: a reap can take an exceedingly long time, and it's really designed to be purely asynchronous with respect to the rest of the system; blocking on a reap (if implicitly) is unbounded. That ARC reclaim is blocked on this wouldn't itself necessarily be so catastrophic were it not for the hard-throttle of arc_reclaim_waiters_cv: once the ARC is deemed to be overflowing (by arc_is_overflowing()), essentially all ARC transit will cease, corking much of the system. (On HA99WLND2, zfs_arc_overflow_shift has been tuned to 1 -- which was still not low enough to avert the hard-throttle.)

The fix here (at least) is that ARC reclaim-induced kmem reaping must not (ever) block on freemem-induced kmem reaping, which requires some rejiggering to assure that ARC reclaim doesn't overwhelm the kmem_taskq with work.


Comment by Former user
Created at 2017-10-18T21:33:08.000Z
Updated at 2017-10-18T23:35:50.000Z

I've been trying to test this fix. Basically, on a test system, I've been driving the system into a low-memory state while doing lots of writes of generated data.

To do this, I've been running multiple copies of a data-generator (I happened to pick the pg-tpch dbgen program: https://github.com/2ndQuadrant/pg-tpch ), concurrently with a program that just allocates memory and runs forever.

When running @accountid:62561aa04f1d57006a24d403's dtrace setup from the comments above on an unpatched system, I will get traces which look like:

1508353355 freemem=50066 needfree=0 steals=0 arc_size=3912302608 arc_c=3904304384 waiters=0 load=7605687317 tq=0
1508353356 freemem=49271 needfree=0 steals=0 arc_size=3919178944 arc_c=3904304384 waiters=0 load=8596622741 tq=0
1508353356 ts=394272655244 freemem=48892 -> arc_kmem_reap_now
1508353356 freemem=48896 -> mag reap zio_buf_512
[ . . . additional magazine reaps . . . ]
1508353356 freemem=59613 -> mag reap zfs_file_data_32768
1508353356 freemem=59613 <- mag reap zfs_file_data_32768
1508353356 ts=394343990205 freemem=59613 <- arc_kmem_reap_now
1508353357 freemem=56991 needfree=0 steals=0 arc_size=3910295928 arc_c=3904304384 waiters=0 load=7852120793 tq=0

which shows arc_kmem_reap_now() waiting for all of its reaps to complete.

post-patch a similar trace looks like:

1508358387 freemem=76037 needfree=0 steals=0 arc_size=3788281528 arc_c=3856797184 waiters=0 load=9757848245 tq=0
1508358388 ts=229028507313 freemem=49066 -> arc_kmem_reap_now
1508358388 freemem=49066 -> mag reap zio_buf_512
1508358388 freemem=49066 <- mag reap zio_buf_512
1508358388 freemem=49066 -> mag reap zio_data_buf_512
1508358388 freemem=49066 <- mag reap zio_data_buf_512
1508358388 ts=229028595357 freemem=49066 <- arc_kmem_reap_now
1508358388 freemem=49081 -> mag reap zio_buf_1024
1508358388 freemem=49081 <- mag reap zio_buf_1024
[ . . . additional magazine reaps . . . ]

which shows arc_kmem_reap_now() triggering reaps, but returning without waiting for them.

adding the task-queue-length to the arc_kmem_reap_now() dtrace entry/exit, we see this (patched system):

[ system is running low on memory ]
1508361426 freemem=27567 needfree=0 steals=0 arc_size=20455304 arc_c=3800827392 waiters=0 load=4058247801 tq=0
1508361426 freemem=21613 -> cache reap kmem_magazine_1
[ non-arc-triggered kmem-reap ongoing ]
1508361426 freemem=21218 <- mag reap kmem_bufctl_audit_cache
1508361427 ts=3267862029396 freemem=22767 tq=447 -> arc_kmem_reap_now
1508361427 ts=3267862040183 freemem=22768 tq=447 <- arc_kmem_reap_now
1508361427 freemem=26452 <- cache reap kmem_bufctl_audit_cache
1508361427 freemem=26452 -> cache reap kmem_va_4096
[ non-arc-triggered kmem-reap ongoing ]
1508361427 freemem=133165 needfree=0 steals=0 arc_size=20455304 arc_c=904207360 waiters=0 load=4020375233 tq=79
1508361438 freemem=759052 needfree=0 steals=0 arc_size=905221224 arc_c=904207360 waiters=0 load=5649129227 tq=0
[ task-queue-length back to zero, lots of memory has been freed, reap ran for twelve seconds ]

which shows that arc_kmem_reap_now() returns without putting more work onto the kmem-task-queue; and doesn't wait


Comment by Former user
Created at 2017-10-25T17:50:09.000Z

illumos-joyent commit daa3911 (branch master, by Tim Kordas)

OS-6363 system went to dark side of moon for ~467 seconds
OS-6404 ARC reclaim should throttle its calls to arc_kmem_reap_now()
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Bryan Cantrill <bryan@joyent.com>


Comment by Dan McDonald
Created at 2023-10-17T15:34:56.364Z

This issue is fixed upstream as https://illumos.org/issues/9284