We have convincing (but not conclusive) data to suggest that in electric-moray servers with large numbers of shards, the sheer number of fast clients, times the 30 recent requests held per client, adds up to a significant amount of memory that causes electric-moray to spend most of its time GC'ing.
I will add a more detailed analysis to this ticket later, but for now: this change is at least extremely low-risk and appears likely to help.
Former user commented on 2018-02-07T23:47:46.066-0500:
Although I had to bump up the server start timeout for my test machine, PS1 passes the moray test suite using Node v0.10.48.
Former user commented on 2018-02-07T23:59:56.321-0500:
I kicked off the stress-client.js test, let it run a few seconds, then took a gcore. Then I found the fast client for the "successful RPCs" test. It shows 5 recent RPCs as expected, despite having completed many more than that:
"fc_nrecent": 5,
...
"fc_nrpc_started": 75,
"fc_nrpc_done": 74,
"fc_recentrpc": [
[...],
[...],
[...],
[...],
[...],
],
That concludes the testing I planned to do for this change.
Jira Bot commented on 2018-02-08T15:06:05.580-0500:
node-moray commit 2a55cc89030dc0c7b488a515147cab4aac6041b0 (branch master, by David Pacheco)
MORAY-455#icft=MORAY-455 moray asks fast to hold onto too many recent RPC requests
Reviewed by: Cody Peter Mello <cody.mello@joyent.com>
Approved by: Cody Peter Mello <cody.mello@joyent.com>
Former user commented on 2018-02-09T17:31:26.836-0500 (edited 2018-02-09T17:32:15.576-0500):
We deployed this change yesterday, with good results. Muskie p90 latency dropped from almost 40s to about 700ms and stayed there. Inbound throughput increased by about 3x. I've attached some visuals.
Former user commented on 2018-02-09T19:14:13.657-0500:
Initial observations: The initial report was that a Manta deployment's throughput was significantly lower than expected.
Seeing high latency at Muskie and low latency at Moray, we suspected a problem at electric-moray. We found that a large number of electric-moray processes were 100% on-CPU, mostly in userland, with heaps upwards of 1.3GB, and over 50% of wall clock time spent doing GC. This indicates excessive memory usage and often a memory leak. We saved a core file, analyzed it with mdb_v8, and confirmed that there was quite a lot of memory in use, but there did not appear to be a leak. As mentioned in earlier comments, we reduced the memory usage, and the problem seems to be resolved.
The rest of this comment has details about the investigation.
From the Starfish dashboards, we confirmed that throughput was poor, and the 90th percentile Muskie request latency was upwards of 30 seconds. So what parts of request handling are taking so long? I used an old script, sampling recent requests in one Muskie:
[root@12c710f7 (webapi) ~]$ tail -n 1000 /var/log/muskie.log | json -ga req.timers | grep \" | tr -d '"' | tr -d ',' | tr -d ':' | awk '{ x[$1] += $2; ttl += $2; if ($1 == "setup") { num++ } } END{ for (i in x) { printf("%30s %5d us %3.3g%%\n", i, x[i] / num, (x[i] / ttl) * 100.0)} }' | sort -n +1
...
commit 49646 us 0.506%
finalizingState 52126 us 0.532%
validateParts 107817 us 1.1%
loadUploadFromUrl 224125 us 2.29%
uploadContext 267001 us 2.72%
enforceDirectoryCount 509946 us 5.2%
saveMetadata 803796 us 8.2%
mkdir 1168625 us 11.9%
getMetadata 6461046 us 65.9%
Those last four phases involve talking to Moray (and so PostgreSQL).
A lot of similar problems have been localized to a handful of shards. Is this problem affecting all shards, or just some? This (messy) data suggests that the problem is reasonably widespread, though we did not internalize that right away:
[root@12c710f7 (webapi) ~]$ tail -n 1000 /var/log/muskie.log | json -ga -c 'this.req && this.req.timers && this.req.timers.getMetadata' parentShard req.timers.getMetadata | column -t | sort -k2n,2
...
tcp://26.moray.us-east.scloud.host:2020 363016
tcp://5.moray.us-east.scloud.host:2020 367124
tcp://56.moray.us-east.scloud.host:2020 368866
tcp://19.moray.us-east.scloud.host:2020 370108
tcp://35.moray.us-east.scloud.host:2020 375937
tcp://26.moray.us-east.scloud.host:2020 383283
tcp://21.moray.us-east.scloud.host:2020 401681
tcp://86.moray.us-east.scloud.host:2020 402450
tcp://54.moray.us-east.scloud.host:2020 405683
tcp://10.moray.us-east.scloud.host:2020 405964
tcp://5.moray.us-east.scloud.host:2020 407500
tcp://61.moray.us-east.scloud.host:2020 408916
tcp://26.moray.us-east.scloud.host:2020 412143
tcp://46.moray.us-east.scloud.host:2020 413128
tcp://47.moray.us-east.scloud.host:2020 418513
tcp://26.moray.us-east.scloud.host:2020 418915
tcp://47.moray.us-east.scloud.host:2020 419380
tcp://77.moray.us-east.scloud.host:2020 419833
tcp://52.moray.us-east.scloud.host:2020 421722
tcp://8.moray.us-east.scloud.host:2020 423984
tcp://5.moray.us-east.scloud.host:2020 424352
tcp://5.moray.us-east.scloud.host:2020 425202
tcp://37.moray.us-east.scloud.host:2020 426466
tcp://56.moray.us-east.scloud.host:2020 427677
tcp://5.moray.us-east.scloud.host:2020 428106
tcp://67.moray.us-east.scloud.host:2020 429180
tcp://85.moray.us-east.scloud.host:2020 436967
tcp://51.moray.us-east.scloud.host:2020 437009
tcp://49.moray.us-east.scloud.host:2020 439774
tcp://47.moray.us-east.scloud.host:2020 440059
tcp://6.moray.us-east.scloud.host:2020 442011
tcp://46.moray.us-east.scloud.host:2020 447536
tcp://5.moray.us-east.scloud.host:2020 447769
tcp://74.moray.us-east.scloud.host:2020 451427
tcp://96.moray.us-east.scloud.host:2020 452438
tcp://5.moray.us-east.scloud.host:2020 456407
tcp://5.moray.us-east.scloud.host:2020 460179
tcp://75.moray.us-east.scloud.host:2020 461761
tcp://47.moray.us-east.scloud.host:2020 465151
tcp://45.moray.us-east.scloud.host:2020 469654
tcp://56.moray.us-east.scloud.host:2020 469819
tcp://3.moray.us-east.scloud.host:2020 475731
tcp://39.moray.us-east.scloud.host:2020 479237
tcp://5.moray.us-east.scloud.host:2020 482867
tcp://5.moray.us-east.scloud.host:2020 483865
tcp://4.moray.us-east.scloud.host:2020 485582
tcp://47.moray.us-east.scloud.host:2020 515240
tcp://72.moray.us-east.scloud.host:2020 524402
tcp://58.moray.us-east.scloud.host:2020 526243
tcp://47.moray.us-east.scloud.host:2020 530493
tcp://47.moray.us-east.scloud.host:2020 555140
tcp://26.moray.us-east.scloud.host:2020 560617
tcp://5.moray.us-east.scloud.host:2020 583658
tcp://5.moray.us-east.scloud.host:2020 589482
tcp://47.moray.us-east.scloud.host:2020 591113
tcp://56.moray.us-east.scloud.host:2020 596472
tcp://5.moray.us-east.scloud.host:2020 601205
tcp://75.moray.us-east.scloud.host:2020 605774
tcp://26.moray.us-east.scloud.host:2020 615104
tcp://5.moray.us-east.scloud.host:2020 621318
tcp://96.moray.us-east.scloud.host:2020 624142
tcp://80.moray.us-east.scloud.host:2020 631290
tcp://5.moray.us-east.scloud.host:2020 642458
tcp://5.moray.us-east.scloud.host:2020 645895
tcp://77.moray.us-east.scloud.host:2020 654660
tcp://47.moray.us-east.scloud.host:2020 659172
tcp://5.moray.us-east.scloud.host:2020 665612
tcp://93.moray.us-east.scloud.host:2020 670649
tcp://8.moray.us-east.scloud.host:2020 679947
tcp://19.moray.us-east.scloud.host:2020 710645
tcp://9.moray.us-east.scloud.host:2020 717478
tcp://19.moray.us-east.scloud.host:2020 723501
tcp://5.moray.us-east.scloud.host:2020 728770
tcp://5.moray.us-east.scloud.host:2020 776906
tcp://26.moray.us-east.scloud.host:2020 821125
tcp://5.moray.us-east.scloud.host:2020 845780
tcp://33.moray.us-east.scloud.host:2020 857100
tcp://92.moray.us-east.scloud.host:2020 872305
tcp://55.moray.us-east.scloud.host:2020 919903
tcp://47.moray.us-east.scloud.host:2020 929816
tcp://90.moray.us-east.scloud.host:2020 953224
tcp://5.moray.us-east.scloud.host:2020 958381
tcp://59.moray.us-east.scloud.host:2020 1134685
tcp://84.moray.us-east.scloud.host:2020 1156787
tcp://26.moray.us-east.scloud.host:2020 1156883
tcp://69.moray.us-east.scloud.host:2020 1180591
tcp://47.moray.us-east.scloud.host:2020 1183319
tcp://55.moray.us-east.scloud.host:2020 1194584
tcp://5.moray.us-east.scloud.host:2020 1209703
tcp://65.moray.us-east.scloud.host:2020 1225275
tcp://47.moray.us-east.scloud.host:2020 1233474
tcp://15.moray.us-east.scloud.host:2020 1239848
tcp://2.moray.us-east.scloud.host:2020 1249655
tcp://47.moray.us-east.scloud.host:2020 1353483
tcp://7.moray.us-east.scloud.host:2020 1437121
tcp://67.moray.us-east.scloud.host:2020 1439713
tcp://56.moray.us-east.scloud.host:2020 1518395
tcp://69.moray.us-east.scloud.host:2020 1538901
tcp://5.moray.us-east.scloud.host:2020 1540172
tcp://47.moray.us-east.scloud.host:2020 1551037
tcp://72.moray.us-east.scloud.host:2020 1580172
tcp://33.moray.us-east.scloud.host:2020 1650416
tcp://67.moray.us-east.scloud.host:2020 1652487
tcp://87.moray.us-east.scloud.host:2020 1669403
tcp://44.moray.us-east.scloud.host:2020 1672096
tcp://47.moray.us-east.scloud.host:2020 1763761
tcp://26.moray.us-east.scloud.host:2020 1769579
tcp://47.moray.us-east.scloud.host:2020 1806149
tcp://26.moray.us-east.scloud.host:2020 1908526
tcp://46.moray.us-east.scloud.host:2020 1940662
tcp://26.moray.us-east.scloud.host:2020 1985713
tcp://18.moray.us-east.scloud.host:2020 1990411
tcp://32.moray.us-east.scloud.host:2020 2016277
tcp://49.moray.us-east.scloud.host:2020 2057253
tcp://92.moray.us-east.scloud.host:2020 2064018
tcp://71.moray.us-east.scloud.host:2020 2084743
tcp://20.moray.us-east.scloud.host:2020 2289017
tcp://40.moray.us-east.scloud.host:2020 2554969
tcp://5.moray.us-east.scloud.host:2020 2574345
tcp://5.moray.us-east.scloud.host:2020 2619103
tcp://70.moray.us-east.scloud.host:2020 2626960
tcp://5.moray.us-east.scloud.host:2020 2666086
tcp://26.moray.us-east.scloud.host:2020 2730198
tcp://22.moray.us-east.scloud.host:2020 2844709
tcp://95.moray.us-east.scloud.host:2020 2907366
tcp://58.moray.us-east.scloud.host:2020 2946195
tcp://40.moray.us-east.scloud.host:2020 2957908
tcp://64.moray.us-east.scloud.host:2020 3039152
tcp://76.moray.us-east.scloud.host:2020 3057028
tcp://39.moray.us-east.scloud.host:2020 3077053
tcp://45.moray.us-east.scloud.host:2020 3124088
tcp://65.moray.us-east.scloud.host:2020 3128127
tcp://79.moray.us-east.scloud.host:2020 3245905
tcp://47.moray.us-east.scloud.host:2020 3322923
tcp://19.moray.us-east.scloud.host:2020 3467860
tcp://96.moray.us-east.scloud.host:2020 3906101
tcp://80.moray.us-east.scloud.host:2020 3925999
tcp://58.moray.us-east.scloud.host:2020 4014856
tcp://33.moray.us-east.scloud.host:2020 4064324
tcp://36.moray.us-east.scloud.host:2020 4069759
tcp://79.moray.us-east.scloud.host:2020 4246555
tcp://47.moray.us-east.scloud.host:2020 4275319
tcp://47.moray.us-east.scloud.host:2020 4941529
tcp://26.moray.us-east.scloud.host:2020 5593912
tcp://47.moray.us-east.scloud.host:2020 5642534
tcp://70.moray.us-east.scloud.host:2020 6360992
tcp://7.moray.us-east.scloud.host:2020 7173814
tcp://26.moray.us-east.scloud.host:2020 7690236
tcp://39.moray.us-east.scloud.host:2020 9061851
tcp://82.moray.us-east.scloud.host:2020 9131650
tcp://15.moray.us-east.scloud.host:2020 10068922
tcp://42.moray.us-east.scloud.host:2020 10217248
tcp://10.moray.us-east.scloud.host:2020 10616267
tcp://48.moray.us-east.scloud.host:2020 10752417
tcp://31.moray.us-east.scloud.host:2020 11644691
tcp://20.moray.us-east.scloud.host:2020 11656785
tcp://84.moray.us-east.scloud.host:2020 12991829
tcp://47.moray.us-east.scloud.host:2020 14627716
tcp://67.moray.us-east.scloud.host:2020 14627831
tcp://14.moray.us-east.scloud.host:2020 14630212
tcp://26.moray.us-east.scloud.host:2020 14792437
tcp://26.moray.us-east.scloud.host:2020 14834542
tcp://47.moray.us-east.scloud.host:2020 14900862
tcp://78.moray.us-east.scloud.host:2020 14903257
tcp://47.moray.us-east.scloud.host:2020 15101355
tcp://26.moray.us-east.scloud.host:2020 15468453
tcp://26.moray.us-east.scloud.host:2020 15523131
tcp://76.moray.us-east.scloud.host:2020 15584077
tcp://5.moray.us-east.scloud.host:2020 15608610
tcp://25.moray.us-east.scloud.host:2020 15616676
tcp://53.moray.us-east.scloud.host:2020 15655375
tcp://5.moray.us-east.scloud.host:2020 17347181
tcp://48.moray.us-east.scloud.host:2020 17606735
tcp://27.moray.us-east.scloud.host:2020 17730324
tcp://7.moray.us-east.scloud.host:2020 17733314
tcp://20.moray.us-east.scloud.host:2020 18333384
tcp://26.moray.us-east.scloud.host:2020 18353752
tcp://26.moray.us-east.scloud.host:2020 18512679
tcp://26.moray.us-east.scloud.host:2020 18720355
tcp://72.moray.us-east.scloud.host:2020 19163592
tcp://79.moray.us-east.scloud.host:2020 19195098
tcp://85.moray.us-east.scloud.host:2020 19223039
tcp://30.moray.us-east.scloud.host:2020 19438555
tcp://21.moray.us-east.scloud.host:2020 20875726
tcp://68.moray.us-east.scloud.host:2020 20879502
tcp://63.moray.us-east.scloud.host:2020 21893865
tcp://12.moray.us-east.scloud.host:2020 21929995
tcp://55.moray.us-east.scloud.host:2020 22780459
tcp://74.moray.us-east.scloud.host:2020 23148618
tcp://26.moray.us-east.scloud.host:2020 23987474
tcp://47.moray.us-east.scloud.host:2020 24369824
tcp://28.moray.us-east.scloud.host:2020 25505994
tcp://3.moray.us-east.scloud.host:2020 25565677
tcp://86.moray.us-east.scloud.host:2020 25638019
tcp://23.moray.us-east.scloud.host:2020 25652623
tcp://42.moray.us-east.scloud.host:2020 25739857
tcp://62.moray.us-east.scloud.host:2020 25920723
tcp://71.moray.us-east.scloud.host:2020 25939737
tcp://29.moray.us-east.scloud.host:2020 25968700
tcp://6.moray.us-east.scloud.host:2020 26026207
tcp://47.moray.us-east.scloud.host:2020 26195057
tcp://59.moray.us-east.scloud.host:2020 26645651
tcp://61.moray.us-east.scloud.host:2020 26748257
tcp://70.moray.us-east.scloud.host:2020 26932752
tcp://47.moray.us-east.scloud.host:2020 27154101
tcp://74.moray.us-east.scloud.host:2020 27382162
tcp://47.moray.us-east.scloud.host:2020 28532642
tcp://12.moray.us-east.scloud.host:2020 28595113
tcp://26.moray.us-east.scloud.host:2020 28714868
tcp://5.moray.us-east.scloud.host:2020 31763599
tcp://63.moray.us-east.scloud.host:2020 31997004
tcp://47.moray.us-east.scloud.host:2020 32238472
tcp://5.moray.us-east.scloud.host:2020 32246206
tcp://66.moray.us-east.scloud.host:2020 32361778
tcp://2.moray.us-east.scloud.host:2020 33957811
tcp://5.moray.us-east.scloud.host:2020 33959850
tcp://18.moray.us-east.scloud.host:2020 47990832
tcp://26.moray.us-east.scloud.host:2020 49338844
tcp://69.moray.us-east.scloud.host:2020 52199491
Another way to slice it: here's the distribution of shards with getMetadata operations taking over 20 seconds, sampling 1000 recent requests:
[root@12c710f7 (webapi) ~]$ tail -n 1000 /var/log/muskie.log | json -ga -c 'this.req && this.req.timers && this.req.timers.getMetadata > 20000000' entryShard | sort | uniq -c | sort -n
1 tcp://13.moray.us-east.scloud.host:2020
1 tcp://14.moray.us-east.scloud.host:2020
1 tcp://16.moray.us-east.scloud.host:2020
1 tcp://18.moray.us-east.scloud.host:2020
1 tcp://20.moray.us-east.scloud.host:2020
1 tcp://21.moray.us-east.scloud.host:2020
1 tcp://33.moray.us-east.scloud.host:2020
1 tcp://34.moray.us-east.scloud.host:2020
1 tcp://42.moray.us-east.scloud.host:2020
1 tcp://43.moray.us-east.scloud.host:2020
1 tcp://45.moray.us-east.scloud.host:2020
1 tcp://46.moray.us-east.scloud.host:2020
1 tcp://54.moray.us-east.scloud.host:2020
1 tcp://6.moray.us-east.scloud.host:2020
1 tcp://60.moray.us-east.scloud.host:2020
1 tcp://62.moray.us-east.scloud.host:2020
1 tcp://65.moray.us-east.scloud.host:2020
1 tcp://66.moray.us-east.scloud.host:2020
1 tcp://69.moray.us-east.scloud.host:2020
1 tcp://73.moray.us-east.scloud.host:2020
1 tcp://74.moray.us-east.scloud.host:2020
1 tcp://75.moray.us-east.scloud.host:2020
1 tcp://8.moray.us-east.scloud.host:2020
1 tcp://84.moray.us-east.scloud.host:2020
1 tcp://94.moray.us-east.scloud.host:2020
2 tcp://15.moray.us-east.scloud.host:2020
2 tcp://26.moray.us-east.scloud.host:2020
2 tcp://50.moray.us-east.scloud.host:2020
2 tcp://56.moray.us-east.scloud.host:2020
2 tcp://81.moray.us-east.scloud.host:2020
2 tcp://86.moray.us-east.scloud.host:2020
2 tcp://93.moray.us-east.scloud.host:2020
3 tcp://47.moray.us-east.scloud.host:2020
3 tcp://5.moray.us-east.scloud.host:2020
3 tcp://61.moray.us-east.scloud.host:2020
21
We also observed that the p90 latency at Moray was pretty quick (around 100ms) for all shards.
Nevertheless, we spent a bit too long attached to the idea that this might be shard-specific. I went to shard 5's primary and found an autovacuum running, but there was no evidence it was causing a problem. In fact, there were no long-running queries on that shard. Disks were busy, but not tapped out. There was CPU headroom. Most of PostgreSQL's time (even CPU time) was spent in "read(2)", which isn't too surprising. PostgreSQL appeared to be fine – to seal the deal, here are the transaction latencies at PostgreSQL in a 60-second period:
[root@bc91f46b (postgres) ~]$ dtrace -n 'postgresql*:::transaction-start{ self->start = timestamp; } postgresql*:::transaction-commit,postgresql*:::transaction-abort/self->start/{ @[probename] = quantize(timestamp - self->start); self->start = 0; }' -c 'sleep 60'
dtrace: description 'postgresql*:::transaction-start' matched 657 probes
dtrace: pid 49152 has exited
transaction-abort
value ------------- Distribution ------------- count
524288 | 0
1048576 |@@@@@@@@@@@@@@@@@@@@ 1
2097152 |@@@@@@@@@@@@@@@@@@@@ 1
4194304 | 0
transaction-commit
value ------------- Distribution ------------- count
8192 | 0
16384 |@ 5839
32768 |@@@@@@@@@@@@ 65729
65536 |@@@@@@ 34053
131072 | 1213
262144 | 332
524288 |@@@@@@@@@@ 53790
1048576 |@@@@@@@@@ 50893
2097152 | 2250
4194304 | 762
8388608 | 943
16777216 | 947
33554432 | 998
67108864 | 445
134217728 | 206
268435456 | 144
536870912 | 31
1073741824 | 301
2147483648 | 146
4294967296 | 0
2-4 second outliers aren't fast, but they can't easily account for 40-second Muskie requests. And average time is only 5.5ms:
[root@bc91f46b (postgres) ~]$ dtrace -n 'postgresql*:::transaction-start{ self->start = timestamp; } postgresql*:::transaction-commit,postgresql*:::transaction-abort/self->start/{ @[probename] = avg(timestamp - self->start); self->start = 0; }' -c 'sleep 60'
dtrace: description 'postgresql*:::transaction-start' matched 657 probes
dtrace: pid 50926 has exited
transaction-abort 1691374
transaction-commit 5508063
Could the latency be coming from Moray? Kody reported that the Moray queue depth was essentially zero on all shards, so there's no reason Moray should be adding much latency to PostgreSQL. Plus, we'd observed earlier that the latency reported from Moray was pretty quick. As one more sanity check, we looked at CPU usage on a shard 5 Moray and found that it was busy, but had headroom:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
38743 nobody 37 2.2 0.0 0.0 0.0 0.0 60 0.3 371 37 4K 0 node/1
38747 nobody 35 2.0 0.1 0.0 0.0 0.0 62 0.5 270 300 3K 0 node/1
38745 nobody 34 2.1 0.0 0.0 0.0 0.0 63 0.5 241 50 2K 0 node/1
38737 nobody 34 2.2 0.0 0.0 0.0 0.0 64 0.3 349 28 4K 0 node/1
38743 nobody 0.2 0.5 0.0 0.0 0.0 52 47 0.3 672 8 672 0 node/2
38747 nobody 0.2 0.3 0.0 0.0 0.0 56 43 0.2 608 2 607 0 node/2
38737 nobody 0.2 0.3 0.0 0.0 0.0 56 43 0.4 637 2 635 0 node/2
38745 nobody 0.2 0.2 0.0 0.0 0.0 58 41 0.3 552 9 550 0 node/2
85840 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 0 170 0 prstat/1
38536 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 0 105 0 nscd/6
38735 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 2 0 rsyslogd/7
38735 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 2 0 rsyslogd/2
38572 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 4 0 node/1
38743 nobody 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 1 0 node/3
38735 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 4 0 rsyslogd/5
Total: 25 processes, 98 lwps, load averages: 1.33, 1.51, 1.54
[root@8fd7baf1 (moray) ~]$
In this zone, "moraystat.d" reported that each process was using only 4-7 of its 16 connections, with 0 queue length. There's no evidence of saturation at Moray.
What about electric-moray? On looking at one zone, we immediately found 3 of the 4 processes at over 99% CPU usage, all in userland. They all had heaps over 1GB. I checked for the fix for another recent memory issue – node-fast#15 – but the fix was present. But still, a lot of time was being spent in GC:
[root@00b74968 (electric-moray) ~]$ dtrace -n 'node*:::gc-start{ gcstarts[pid] = timestamp; } node*:::gc-done/gcstarts[pid]/{ @[pid] = sum(timestamp - gcstarts[pid]); gcstarts[pid] = 0; }' -c 'sleep 10'
dtrace: description 'node*:::gc-start' matched 14 probes
dtrace: pid 25577 has exited
55859 5022876584
55828 5994427569
I collected heap size across the fleet:
[root@headnode (us-east-1a) ~]# manta-oneach -s electric-moray 'ps -ozone= -o pid= -o rss= -o vsz= -p "$(pgrep -c "$(svcs -H -o ctid electric-moray | tr "\n" " ")")"'
=== Output from 00b74968-5005-4dfb-9757-7a419101865d on HAA21413 (electric-moray):
00b74968-5005-4dfb-9757-7a419101865d 55801 1511388 1526228
00b74968-5005-4dfb-9757-7a419101865d 55828 1459600 1475656
00b74968-5005-4dfb-9757-7a419101865d 55859 1469328 1472584
00b74968-5005-4dfb-9757-7a419101865d 55904 1565264 1580512
=== Output from 013ff871-339a-4486-82bb-18f7e5f0e16b on HAA21425 (electric-moray):
013ff871-339a-4486-82bb-18f7e5f0e16b 2476 1448260 1466536
013ff871-339a-4486-82bb-18f7e5f0e16b 2528 1336492 1353964
013ff871-339a-4486-82bb-18f7e5f0e16b 2590 1268300 1285732
013ff871-339a-4486-82bb-18f7e5f0e16b 2698 1401412 1420496
=== Output from 02a4e9f0-de37-45af-a55f-643f87eb596e on HAA21386 (electric-moray):
02a4e9f0-de37-45af-a55f-643f87eb596e 23396 1364764 1380740
02a4e9f0-de37-45af-a55f-643f87eb596e 23429 1394656 1414152
02a4e9f0-de37-45af-a55f-643f87eb596e 23480 1235392 1252760
02a4e9f0-de37-45af-a55f-643f87eb596e 23523 1269884 1288168
=== Output from 12ded71d-d46e-44e7-b662-b631d276b2a1 on HAA19403 (electric-moray):
12ded71d-d46e-44e7-b662-b631d276b2a1 32806 1402324 1419644
12ded71d-d46e-44e7-b662-b631d276b2a1 32866 1449808 1466248
12ded71d-d46e-44e7-b662-b631d276b2a1 32897 1497820 1514084
12ded71d-d46e-44e7-b662-b631d276b2a1 32949 1366672 1383488
=== Output from 25af1fb5-a11e-492a-a287-96a4c8b81a94 on HAA19409 (electric-moray):
25af1fb5-a11e-492a-a287-96a4c8b81a94 43402 1427564 1444272
25af1fb5-a11e-492a-a287-96a4c8b81a94 43465 1411968 1429200
25af1fb5-a11e-492a-a287-96a4c8b81a94 43508 1249700 1265996
25af1fb5-a11e-492a-a287-96a4c8b81a94 43539 1375304 1408588
=== Output from 270a3a92-b978-4d92-b95c-4da2a4539983 on HA8VNJRD2 (electric-moray):
270a3a92-b978-4d92-b95c-4da2a4539983 9833 1346000 1363620
270a3a92-b978-4d92-b95c-4da2a4539983 9867 1352324 1369040
270a3a92-b978-4d92-b95c-4da2a4539983 9893 1447828 1466116
270a3a92-b978-4d92-b95c-4da2a4539983 9927 1502872 1519912
=== Output from 28ac1a90-c887-4534-91a1-5a9a783de946 on HAA19412 (electric-moray):
28ac1a90-c887-4534-91a1-5a9a783de946 45172 1423152 1440224
28ac1a90-c887-4534-91a1-5a9a783de946 45207 1346780 1380396
28ac1a90-c887-4534-91a1-5a9a783de946 45236 1225272 1241204
28ac1a90-c887-4534-91a1-5a9a783de946 45280 1334828 1351740
=== Output from 30f9910c-c7dc-4827-bcb9-baa94c9e4d3f on HAA21400 (electric-moray):
30f9910c-c7dc-4827-bcb9-baa94c9e4d3f 95344 1431736 1448720
30f9910c-c7dc-4827-bcb9-baa94c9e4d3f 95377 1485800 1501544
30f9910c-c7dc-4827-bcb9-baa94c9e4d3f 95410 1395372 1413228
30f9910c-c7dc-4827-bcb9-baa94c9e4d3f 95469 1563464 1581160
=== Output from 424e6d8d-bb0c-4645-ae20-d82d68fe80c2 on HAA19401 (electric-moray):
424e6d8d-bb0c-4645-ae20-d82d68fe80c2 14029 1475912 1491596
424e6d8d-bb0c-4645-ae20-d82d68fe80c2 14094 1313912 1330108
424e6d8d-bb0c-4645-ae20-d82d68fe80c2 14137 1246240 1263756
424e6d8d-bb0c-4645-ae20-d82d68fe80c2 14203 1410996 1427368
=== Output from 44f5bdf6-6e50-4114-a408-38e844b0030a on HAA21393 (electric-moray):
44f5bdf6-6e50-4114-a408-38e844b0030a 6637 1438660 1455696
44f5bdf6-6e50-4114-a408-38e844b0030a 6678 1274288 1290968
44f5bdf6-6e50-4114-a408-38e844b0030a 6712 1339348 1356812
44f5bdf6-6e50-4114-a408-38e844b0030a 6758 1445528 1463164
=== Output from 4950e48c-cc0b-4e57-aa64-9957fc0102cf on HAA19370 (electric-moray):
4950e48c-cc0b-4e57-aa64-9957fc0102cf 33690 1356316 1373188
4950e48c-cc0b-4e57-aa64-9957fc0102cf 33731 1468516 1485780
4950e48c-cc0b-4e57-aa64-9957fc0102cf 33801 1465208 1482964
4950e48c-cc0b-4e57-aa64-9957fc0102cf 33822 1493868 1510856
=== Output from 4fbd6707-ffe9-4eec-99ea-db95bcd77b63 on HAA19410 (electric-moray):
4fbd6707-ffe9-4eec-99ea-db95bcd77b63 52143 1527952 1545764
4fbd6707-ffe9-4eec-99ea-db95bcd77b63 52226 1488288 1505108
4fbd6707-ffe9-4eec-99ea-db95bcd77b63 52256 1452952 1471208
4fbd6707-ffe9-4eec-99ea-db95bcd77b63 52328 1348348 1365128
=== Output from 573d9843-1fa6-47c2-9553-00a870a2bc15 on HAA21387 (electric-moray):
573d9843-1fa6-47c2-9553-00a870a2bc15 4548 1261576 1278476
573d9843-1fa6-47c2-9553-00a870a2bc15 4594 1298680 1315420
573d9843-1fa6-47c2-9553-00a870a2bc15 4621 1323828 1340704
573d9843-1fa6-47c2-9553-00a870a2bc15 4655 1461560 1494924
=== Output from 8d4d02ab-5ecf-44bc-a87c-c5c988affa4b on HAA21383 (electric-moray):
8d4d02ab-5ecf-44bc-a87c-c5c988affa4b 34974 1373928 1390732
8d4d02ab-5ecf-44bc-a87c-c5c988affa4b 35028 1221552 1238560
8d4d02ab-5ecf-44bc-a87c-c5c988affa4b 35059 1340388 1358100
8d4d02ab-5ecf-44bc-a87c-c5c988affa4b 35088 1353520 1372048
=== Output from 90258031-2e63-4235-850c-23036c8810cd on HAA19411 (electric-moray):
90258031-2e63-4235-850c-23036c8810cd 87080 1276496 1293032
90258031-2e63-4235-850c-23036c8810cd 87109 1473088 1489560
90258031-2e63-4235-850c-23036c8810cd 87161 1464564 1481872
90258031-2e63-4235-850c-23036c8810cd 87193 1374532 1390764
=== Output from 9f98dbea-06e7-486a-8833-954b8700b448 on HA8V5JRD2 (electric-moray):
9f98dbea-06e7-486a-8833-954b8700b448 12372 1390024 1407280
9f98dbea-06e7-486a-8833-954b8700b448 12400 1415308 1435832
9f98dbea-06e7-486a-8833-954b8700b448 12453 1407736 1424956
9f98dbea-06e7-486a-8833-954b8700b448 12492 1422036 1450524
=== Output from a326b5ff-8e2f-4550-9d07-d85df1a8bc75 on HAA21411 (electric-moray):
a326b5ff-8e2f-4550-9d07-d85df1a8bc75 79361 1309896 1326476
a326b5ff-8e2f-4550-9d07-d85df1a8bc75 79395 1351828 1368148
a326b5ff-8e2f-4550-9d07-d85df1a8bc75 79425 1409232 1426628
a326b5ff-8e2f-4550-9d07-d85df1a8bc75 79454 1355024 1372740
=== Output from adfe3503-6317-466c-a5cb-b5b31b019e2e on HAA21379 (electric-moray):
adfe3503-6317-466c-a5cb-b5b31b019e2e 54716 1302064 1318884
adfe3503-6317-466c-a5cb-b5b31b019e2e 54796 1505372 1522988
adfe3503-6317-466c-a5cb-b5b31b019e2e 54876 1565448 1582868
adfe3503-6317-466c-a5cb-b5b31b019e2e 54937 1442532 1468428
=== Output from b5e8cf20-050b-4fc1-9409-9850826f38a2 on HA8VPCRD2 (electric-moray):
b5e8cf20-050b-4fc1-9409-9850826f38a2 99177 1355560 1372824
b5e8cf20-050b-4fc1-9409-9850826f38a2 99255 1425412 1441852
b5e8cf20-050b-4fc1-9409-9850826f38a2 99305 1408688 1425188
b5e8cf20-050b-4fc1-9409-9850826f38a2 99335 1354828 1373932
=== Output from c23085bc-1788-4c2d-9e36-1de5f30a4d56 on HA8V5MRD2 (electric-moray):
c23085bc-1788-4c2d-9e36-1de5f30a4d56 66299 1453184 1471100
c23085bc-1788-4c2d-9e36-1de5f30a4d56 66413 1170720 1190464
c23085bc-1788-4c2d-9e36-1de5f30a4d56 66485 1359860 1378504
c23085bc-1788-4c2d-9e36-1de5f30a4d56 66524 1290204 1307748
=== Output from c40d4ff2-0d81-4847-818f-b3b2f936a32a on HAA21399 (electric-moray):
c40d4ff2-0d81-4847-818f-b3b2f936a32a 90558 1215944 1232740
c40d4ff2-0d81-4847-818f-b3b2f936a32a 90699 1181704 1198248
c40d4ff2-0d81-4847-818f-b3b2f936a32a 90775 1384376 1401076
c40d4ff2-0d81-4847-818f-b3b2f936a32a 90824 1286200 1301880
=== Output from c85aef32-20ce-4f3c-af0f-8e2f8f99cdbd on HAA21408 (electric-moray):
c85aef32-20ce-4f3c-af0f-8e2f8f99cdbd 97417 1316528 1332044
c85aef32-20ce-4f3c-af0f-8e2f8f99cdbd 97529 1320996 1336724
c85aef32-20ce-4f3c-af0f-8e2f8f99cdbd 97590 1375764 1392008
c85aef32-20ce-4f3c-af0f-8e2f8f99cdbd 97616 1465076 1481128
=== Output from e085343a-5372-4f06-9022-8481eb23deed on HAA21396 (electric-moray):
e085343a-5372-4f06-9022-8481eb23deed 96413 1425516 1442896
e085343a-5372-4f06-9022-8481eb23deed 96496 1327048 1344360
e085343a-5372-4f06-9022-8481eb23deed 96550 1366316 1382780
e085343a-5372-4f06-9022-8481eb23deed 96602 1239280 1256304
=== Output from e930846e-bce6-4c9b-bd92-57f8ce9c10b7 on HAA21407 (electric-moray):
e930846e-bce6-4c9b-bd92-57f8ce9c10b7 23406 1223832 1241488
e930846e-bce6-4c9b-bd92-57f8ce9c10b7 23483 1337336 1354496
e930846e-bce6-4c9b-bd92-57f8ce9c10b7 23511 1448500 1465256
e930846e-bce6-4c9b-bd92-57f8ce9c10b7 23548 1464728 1481716
=== Output from efb3d3d5-c5c1-4513-8e97-c9224e3cbaef on HAA21377 (electric-moray):
efb3d3d5-c5c1-4513-8e97-c9224e3cbaef 43455 1332236 1350800
efb3d3d5-c5c1-4513-8e97-c9224e3cbaef 43535 1308816 1325544
efb3d3d5-c5c1-4513-8e97-c9224e3cbaef 43564 1312056 1328992
efb3d3d5-c5c1-4513-8e97-c9224e3cbaef 43595 1487104 1505124
=== Output from f08c8726-9227-4998-a1da-d28a557cac05 on HAA19393 (electric-moray):
f08c8726-9227-4998-a1da-d28a557cac05 74362 1301404 1318760
f08c8726-9227-4998-a1da-d28a557cac05 74416 1421032 1437784
f08c8726-9227-4998-a1da-d28a557cac05 74450 1422764 1439320
f08c8726-9227-4998-a1da-d28a557cac05 74480 1309048 1325632
=== Output from f8db1e0b-1620-42e2-95ee-5b20090c67ae on HAA19398 (electric-moray):
f8db1e0b-1620-42e2-95ee-5b20090c67ae 87 1367940 1385020
f8db1e0b-1620-42e2-95ee-5b20090c67ae 126 1423560 1440724
f8db1e0b-1620-42e2-95ee-5b20090c67ae 163 1389276 1407876
f8db1e0b-1620-42e2-95ee-5b20090c67ae 99990 1352512 1371432
=== Output from fb37df57-8764-4f68-9b1b-e7a3c5901141 on HA8SCNRD2 (electric-moray):
fb37df57-8764-4f68-9b1b-e7a3c5901141 81465 1530420 1547816
fb37df57-8764-4f68-9b1b-e7a3c5901141 81538 1547500 1564264
fb37df57-8764-4f68-9b1b-e7a3c5901141 81570 1423684 1442604
fb37df57-8764-4f68-9b1b-e7a3c5901141 81606 1418820 1437648
=== Output from fd9934fc-71b3-4e4a-ad9d-0aa5313789c2 on HAA21403 (electric-moray):
fd9934fc-71b3-4e4a-ad9d-0aa5313789c2 8768 1297152 1313940
fd9934fc-71b3-4e4a-ad9d-0aa5313789c2 8799 1387576 1404080
fd9934fc-71b3-4e4a-ad9d-0aa5313789c2 8834 1370328 1387800
fd9934fc-71b3-4e4a-ad9d-0aa5313789c2 8943 1431040 1448528
You have new mail in /var/mail/root
[root@headnode (us-east-1b) ~]# manta-oneach -s electric-moray 'ps -ozone= -o pid= -o rss= -o vsz= -p "$(pgrep -c "$(svcs -H -o ctid electric-moray | tr "\n" " ")")"'
=== Output from 04ed27e8-6a57-473e-802a-2c766b4c8126 on HAA18511 (electric-moray):
04ed27e8-6a57-473e-802a-2c766b4c8126 20255 1415500 1431568
04ed27e8-6a57-473e-802a-2c766b4c8126 20281 1419196 1435840
04ed27e8-6a57-473e-802a-2c766b4c8126 20317 1476400 1493720
04ed27e8-6a57-473e-802a-2c766b4c8126 20367 1406820 1424660
=== Output from 08d5b5b4-9311-482a-ae62-63eb0b252952 on HAA18517 (electric-moray):
08d5b5b4-9311-482a-ae62-63eb0b252952 45201 1383060 1399336
08d5b5b4-9311-482a-ae62-63eb0b252952 45247 1368968 1386032
08d5b5b4-9311-482a-ae62-63eb0b252952 45296 1598416 1615572
08d5b5b4-9311-482a-ae62-63eb0b252952 45342 1416240 1432824
=== Output from 0b849498-a284-4c9e-b277-a87999f5dd15 on HAB01434 (electric-moray):
0b849498-a284-4c9e-b277-a87999f5dd15 65882 1369376 1385620
0b849498-a284-4c9e-b277-a87999f5dd15 65915 1417820 1434656
0b849498-a284-4c9e-b277-a87999f5dd15 65965 1324532 1343752
0b849498-a284-4c9e-b277-a87999f5dd15 66002 1441680 1457872
=== Output from 1efbbe2c-57de-46e3-be32-99bd2a128332 on HAA19380 (electric-moray):
1efbbe2c-57de-46e3-be32-99bd2a128332 22711 1225708 1243408
1efbbe2c-57de-46e3-be32-99bd2a128332 22746 1315444 1332124
1efbbe2c-57de-46e3-be32-99bd2a128332 22784 1209544 1226976
1efbbe2c-57de-46e3-be32-99bd2a128332 22812 1377640 1395472
=== Output from 2382c8c0-0912-4e44-8f07-1492c7d1ab50 on HA8TZGRD2 (electric-moray):
2382c8c0-0912-4e44-8f07-1492c7d1ab50 37893 1355232 1371928
2382c8c0-0912-4e44-8f07-1492c7d1ab50 37934 1389208 1406116
2382c8c0-0912-4e44-8f07-1492c7d1ab50 37982 1441848 1458892
2382c8c0-0912-4e44-8f07-1492c7d1ab50 38132 1393264 1409772
=== Output from 2eb9353b-a8a5-425d-90b5-fe344ac8c399 on HAB01441 (electric-moray):
2eb9353b-a8a5-425d-90b5-fe344ac8c399 77575 1353584 1369732
2eb9353b-a8a5-425d-90b5-fe344ac8c399 77623 1264228 1280820
2eb9353b-a8a5-425d-90b5-fe344ac8c399 77659 1309628 1327900
2eb9353b-a8a5
Former user commented on 2018-02-09T19:19:38.696-0500:
There's some additional data related to a blind alley that I wanted to record here. At one point, I thought there might be a leak in the node-fast client when we close a connection, so I tried to induce that with this hacked-up program based on fastcall: it makes a number of requests on a single connection, and opens a fixed number of connections repeatedly, and repeats indefinitely.
#!/usr/bin/env node
/* vim: set ft=javascript: */
/*
* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at http://mozilla.org/MPL/2.0/.
*/
/*
* Copyright (c) 2018, Joyent, Inc.
*/
/*
* fast-test.js: tests a specific fast workload involving a large number of
* clients making a fixed number of requests each. The purpose is to observe
* the memory usage of this workload.
* XXX This is largely copied from "fastcall".
*/
var VError = require('verror');
var mod_assertplus = require('assert-plus');
var mod_bunyan = require('bunyan');
var mod_cmdutil = require('cmdutil');
var mod_fast = require('fast');
var mod_net = require('net');
var mod_vasync = require('vasync');
var nreqsperclient = 30;
var clientconcurrency = 1000;
var nclientcycles = 0;
function main()
{
var argv, host, port, rpcmethod, rpcargs, i;
mod_cmdutil.configure({
'synopses': [ 'HOST PORT METHOD ARGS' ],
'usageMessage': [
' HOST DNS name or IP address for remote server',
' PORT TCP port for remote server',
' METHOD Name of remote RPC method call',
' ARGS JSON-encoded arguments for RPC method call'
].join('\n')
});
mod_cmdutil.exitOnEpipe();
argv = process.argv.slice(2);
if (argv.length != 4) {
mod_cmdutil.usage('expected four non-option arguments');
}
host = argv[0];
port = parseInt(argv[1], 10);
if (isNaN(port) || port < 1 || port > 65535) {
mod_cmdutil.usage('invalid TCP port: %s\n', argv[1]);
}
rpcmethod = argv[2];
try {
rpcargs = JSON.parse(argv[3]);
} catch (ex) {
mod_cmdutil.usage(new VError(ex, 'parsing RPC arguments'));
}
if (!Array.isArray(rpcargs)) {
mod_cmdutil.usage(new Error('RPC arguments: expected array'));
}
for (i = 0; i < clientconcurrency; i++) {
runOneClient({
'host': host,
'port': port + (i % 8),
'rpcmethod': rpcmethod,
'rpcargs': rpcargs,
'reqsPerClient': nreqsperclient
});
}
console.log('pid %d', process.pid);
setImmediate(tick);
}
function tick()
{
console.log('%s: cycles done: %d',
new Date().toISOString(), nclientcycles);
setTimeout(tick, 5000);
}
function runOneClient(args)
{
var log, conn;
var rpcmethod, rpcargs;
mod_assertplus.object(args, 'args');
mod_assertplus.string(args.host, 'args.host');
mod_assertplus.number(args.port, 'args.port');
mod_assertplus.optionalNumber(args.timeout, 'args.timeout');
mod_assertplus.string(args.rpcmethod, 'args.rpcmethod');
mod_assertplus.array(args.rpcargs, 'args.rpcargs');
mod_assertplus.number(args.reqsPerClient, 'args.reqsPerClient');
rpcmethod = args.rpcmethod;
rpcargs = args.rpcargs;
log = new mod_bunyan({
'name': 'fasttest',
'level': process.env['LOG_LEVEL'] || 'fatal'
});
conn = mod_net.createConnection(args.port, args.host);
conn.on('connect', function onConnect() {
var fastconn, req, i;
var barrier;
fastconn = new mod_fast.FastClient({
'log': log,
'transport': conn,
'nRecentRequests': args.reqsPerClient
});
barrier = mod_vasync.barrier();
for (i = 0; i < args.reqsPerClient; i++) {
(function (i) {
barrier.start('request ' + i);
req = fastconn.rpc({
'rpcmethod': rpcmethod,
'rpcargs': rpcargs
});
req.on('error', function (err) {
barrier.done('request ' + i);
});
req.on('data', function (message) {});
req.on('end', function () {
barrier.done('request ' + i);
});
})(i);
}
barrier.on('drain', function () {
nclientcycles++;
conn.destroy();
setImmediate(runOneClient, args);
});
});
}
main();
I monitored this with:
while sleep 10; do printf "%s %s\n" "$(date -u +%FT%TZ)" "$(ps -o pid= -o etime= -o rss= -o vsz= -p 68694)"; done
The data suggested no leak.
Former user commented on 2018-02-12T19:11:33.141-0500:
The core file for this analysis is in thoth dump f2aae714bfaf6ba7.