SmartOS training from Joyent!

Using flamegraph to Solve IP Scaling Issue (DCE)

The two flamegraphs to the right are taken from 2 machines that are
running basically the same set of processes, and experiencing the same amount of network traffic. Read on to find out how these were created, what problem they were used to solve, and how you can do it too.


Slow Flame Graph

Recently I worked on a fun performance problem (they’re all fun once they’re solved) at a customer site. The customer was using Zeus load balancers on SmartMachines. External internet facing traffic was routed by the Zeus load balancers to back-end apache engines. Occasionally, Zeus would fail over (another Zeus SmartMachine would take over the IP address and traffic would then be handled by this Zeus).


Fast Flame Graph

The problem was to identify why Zeus was failing over. Basically, Zeus fails over when it determines that an active Zeus is not responding. Internally, Zeus uses UDP to determine if other Zeuses in the configuration are up and running. If a Zeus engine does not respond within some configurable time, it initiates fail over. This was occurring in a production environment, and resulted in dropped connections due to time outs. Often, Zeus would fail from an active to passive engine, almost immediately determine that the active engine was actually up, and fail back. Very soon after, the passive node would again determine that the active Zeus was not responding, and fail over again, resulting in a “flip-flop” between active and passive Zeus engines.

Attempts to reproduce the problem in a test environment were generally unsuccessful, or would occur during the last hour that I was on site. Fixing the problem generally meant rebooting the Zeus SmartMachine that had failed over.

Much time was spent determining if the configuration of Zeus was correct. During this work, the customer noted that the load (as measured by uptime or prstat) on SmartMachines running active Zeus modules would slowly increase over time. It would start at approximately 0.2, and after 2-3 weeks, would be up around 2.0. When it got to around 2.0, failover would typically start occurring. In addition to running Zeus, the SmartMachine was running Nagios for monitoring. If either Zeus or Nagios were stopped, the load would come down, but get nowhere near the 0.2 load at startup. If both were stopped, the load would go to 0. If either were then started, the load would be well above the initial start of 0.2. The only way to get back to the initial load was to reboot the SmartMachine.

Here is the uptime for one of these SmartMachines, a few hours after boot:

# uptime
6:11pm up 2:33, 2 users, load average: 0.19, 0.18, 0.17
#

And here is the load for a SmartMachine that was up for a while:

# uptime
6:56pm up 14 days 7:24, 2 users, load average: 0.70, 0.92, 1.09
#

mpstat also showed some interesting results. Here is mpstat output for the SmartMachine that was recently booted:

# mpstat 10
...
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 299 0 2 403 125 446 0 30 14 8 1243 1 1 0 98
1 106 0 0 114 44 329 0 9 14 6 890 0 1 0 99
2 116 0 17 124 18 323 0 19 12 3 699 0 2 0 97
3 5 0 23 164 72 405 0 10 17 6 1135 0 0 0 99
4 56 0 31 97 18 271 0 10 14 5 732 0 0 0 99
5 32 0 0 74 27 179 1 7 11 6 393 0 0 0 100
6 21 0 0 73 18 273 0 8 12 2 950 0 0 0 99
7 6 0 0 128 51 361 0 9 11 7 936 0 0 0 99
8 75 0 0 84 26 238 0 5 11 2 697 0 0 0 99
9 42 0 0 86 42 208 1 7 8 3 539 0 0 0 99
10 8 0 0 87 30 319 0 6 16 4 885 0 1 0 99
11 7 0 25 357 168 466 0 9 12 3 758 0 1 0 99
12 3 0 0 66 24 280 0 7 13 5 766 0 0 0 99
13 5 0 0 138 61 364 0 10 17 4 768 0 0 0 99
14 3 0 0 46 13 155 0 3 7 6 370 0 0 0 100
15 4 0 2 145 64 298 0 7 10 7 544 0 0 0 99
16 5 0 0 75 26 303 0 7 12 6 843 0 0 0 99
17 10 0 0 178 80 438 0 9 11 7 968 0 0 0 99
18 1 0 24 44 8 128 0 3 6 2 504 0 0 0 99
19 65 0 2 981 900 285 0 10 10 1 783 0 1 0 99
20 164 0 0 1193 1073 308 0 8 36 5 862 0 2 0 97
21 10 0 29 130 66 370 0 7 19 3 1012 1 1 0 99
22 5 0 0 2828 2784 140 0 4 6 5 329 0 1 0 99
23 454 0 9 3584 3329 403 1 14 20 5 741 1 5 0 94
...

And output from the SmartMachine that has been up for a while.

# mpstat 10
...
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 8850 0 4 425 118 514 5 59 21 1 1444 4 31 0 65
1 817 0 101 115 35 300 2 29 20 0 1557 6 4 0 90
2 608 0 45 133 28 731 8 77 32 2 2095 11 5 0 84
3 271 0 27 128 43 429 2 31 19 1 1346 3 4 0 93
4 149 0 24 90 22 420 5 42 26 1 1080 7 2 0 90
5 46 0 4 95 29 274 1 21 15 0 646 1 3 0 96
6 222 0 25 117 28 550 5 50 27 1 1593 9 4 0 86
7 8 0 0 109 31 366 1 28 20 1 1054 2 4 0 93
8 42 0 35 70 24 430 4 27 21 2 1195 5 3 0 92
9 0 1 40 106 30 204 0 19 18 0 573 1 6 0 93
10 1 0 1 108 30 556 5 46 22 2 1436 5 4 0 91
11 54 0 1 125 28 372 2 27 14 1 991 2 5 0 93
12 30 0 20 73 21 382 4 35 17 2 1043 5 3 0 93
13 29 1 10 133 44 357 1 29 19 1 718 3 4 0 93
14 43 0 0 53 15 314 3 27 12 2 687 4 3 0 93
15 41 0 2 86 22 273 1 22 23 0 720 2 2 0 97
16 6 0 11 105 29 580 6 46 23 3 1394 6 6 0 88
17 61 0 2 106 32 346 1 26 18 1 1044 5 5 0 90
18 1 0 0 62 21 365 3 27 19 0 846 7 3 0 90
19 140 0 9 1626 1544 338 1 24 21 1 1061 3 3 0 93
20 6 0 0 1913 1698 536 6 48 41 1 1305 11 8 0 81
21 196 0 33 338 173 410 2 27 31 2 1264 3 4 0 93
22 55 0 0 2088 2024 457 6 39 18 2 1388 9 4 0 88
23 28 0 1 1412 1172 122 2 16 57 3 151 1 71 0 28
...

Both machines have a lot of interrupt activity on cpu 23. But the one that was recently booted is handling more than twice the number of interrupts as the one that has been up for a while, but the idle time for cpu 23 on the recently booted machine is 94% compared with only 28% on the one that has been up for a while. And, the one that has been up for a while is spending most of its time (71%) in system mode.

For more information on uptime and mpstat, take a look at http://dtrace.org/blogs/brendan/2011/06/24/load-average-video/ and http://dtrace.org/blogs/brendan/2011/06/18/mpstat-videos/.

Both systems show that cpu 23 is handling igb interrupts (i.e., in this case, interrupts from the external facing network controller). Note that mdb -k can be run only from within the global zone.

# echo "::interrupts" | mdb -k
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
...
64 0x64 6 PCI Edg MSI-X 19 1 - igb_intr_tx_other
65 0x65 6 PCI Edg MSI-X 20 1 - igb_intr_rx
66 0x66 6 PCI Edg MSI-X 22 1 - igb_intr_tx_other
67 0x67 6 PCI Edg MSI-X 23 1 - igb_intr_rx
160 0xa0 0 Edg IPI all 0 - poke_cpu
208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr
209 0xd3 14 Edg IPI all 1 - cbe_fire
...

At this point, I decided to use DTrace to determine what was causing the load on cpu 23.

I ran the following on both the machine with low load and the machine with higher load:

 dtrace -x stackframes=100 -n 'profile-997 /arg0 && cpu==23/ { @[stack()] = count(); } tick-60s { exit(0); }' -o out.kern_stacks 

This does sampling over a 1 minute interval for whatever is running on cpu 23. At this point, looking at the output file (out.kern_stacks) would probably have shown me what I needed to know, but I wanted to use flamegraph. You can read about flamegraph, (and download the necessary files) here.

First, the system with the low load average:

#
# ./stackcollapse.pl out.kern_stacks > out.kern_folded
# ./flamegraph.pl out.kern_folded > kernel.fast.svg
#

And now the system with the high load average:

#
# ./stackcollapse.pl out.kern_stacks > out.kern_folded
# ./flamegraph.pl out.kern_folded > kernel.slow.svg
#

And here are the resulting flame graphs. Click for the SVG, where you can mouse over elements and see percentages.

From the above, cpu 23 on the system with the high load average is spending most of its time in ip`dce_lookup_and_add_v4(). On the low load average system, this routine is one of the thinnest lines on the graph. A quick look at the source code for this routine shows that it hashes into a table (of 256 entries), and follows a linked list from a hash bucket. The code is in usr/src/uts/common/inet/ip/ip_dce.c, for those interested. The entries in the linked list are dce_t entries. Let’s see how many dce_t entries are in use.

# netstat -dn
24973067 24973067 424542139
#

When I first started looking at this, I was using mdb to determine the number of entries in the DCE cache. The “-d” option to the netstat command is not documented on the man page, but it lists DCE entries.

So approximately 25 million entries! With 256 hash buckets, each hash chain is ~100,000 entries (assuming the hashing algorithm is reasonable). Using DTrace, one can see that the dce_lookup_and_add_v4, in addition to being called at interrupt time, is called at various other times. Each call is taking approximately 3 milliseconds to complete.

# dtrace -n 'dce_lookup_and_add_v4:entry{@[stack()]=count();}'
dtrace: description 'dce_lookup_and_add_v4:entry' matched 1 probe
...
ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_connect_ipv4+0x135
ip`tcp_do_connect+0x1d1
ip`tcp_connect+0x9f
sockfs`so_connect+0xf8
sockfs`socket_connect+0x47
sockfs`connect+0xec
unix`sys_syscall+0x17a
1216

ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_input_listener+0x7ca
ip`tcp_input_listener_unbound+0x16e
ip`squeue_drain+0x1f8
ip`squeue_worker+0x132
unix`thread_start+0x8
2211

ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_connect_ipv4+0x135
ip`tcp_do_connect+0x20f
ip`tcp_connect+0x9f
sockfs`so_connect+0xf8
sockfs`socket_connect+0x47
sockfs`connect+0xec
unix`sys_syscall+0x17a
5826

ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_input_listener+0x7ca
ip`tcp_input_listener_unbound+0x16e
ip`squeue_enter+0x440
ip`ip_fanout_v4+0x48d
ip`ire_recv_local_v4+0x366
ip`ill_input_short_v4+0x6ce
ip`ip_input+0x23b
dls`i_dls_link_rx+0x2e7
mac`mac_rx_deliver+0x5d
mac`mac_rx_soft_ring_process+0x17a
mac`mac_rx_srs_proto_fanout+0x4e5
mac`mac_rx_srs_drain+0x26e
mac`mac_rx_srs_process+0x1a9
mac`mac_rx_classify+0x159
mac`mac_rx_flow+0x54
mac`mac_rx_common+0x1f6
14081

The above is after ~30 seconds.

# dtrace -n 'dce_lookup_and_add_v4:entry{self->t = timestamp;} dce_lookup_and_add_v4:return/self->t/{@=avg(timestamp-self->t);} tick-30s{printa("average: %@d nsn", @);trunc(@);}'
dtrace: description 'dce_lookup_and_add_v4:entry' matched 3 probes
CPU ID FUNCTION:NAME

22 69877 :tick-30s average: 2903005 ns

22 69877 :tick-30s average: 3010001 ns
...

So, when do entries get freed? According to the comment above ip_dce_reclaim() (in ip_dce.c), this only occurs when the system is running low on memory. So, if there is lots of memory, this may never occur.

The proposed fix is to add an ndd tunable that allows the number of hash slots to be increased, and another tunable that causes the old entries in the cache to be reaped earlier. A work around is to make the system think it is short on memory by using mdb to set the lotsfree variable larger than freemem (and wait a second or two, then set lotsfree back to its normal value).

Interestingly, the problem was also found on Solaris 11, one day before. See http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/14278 for details.

Using flame graphs enabled me to quickly identify the problem. A “good” flame graph will either have lots of short lived peaks, or a large plateau for idle time. Lots of short lived peaks means the processor is busy, but not stuck doing one thing. They allow one to easily visualize where the cpu(s) are spending their time.

Share this post:

Vote on HN