OS-7751: Soft Ring Sets switch from polling to interrupts too readily


Issue Type:Task
Priority:4 - Normal
Created at:2019-04-17T21:31:59.088Z
Updated at:2019-11-12T06:13:50.626Z


Created by:Former user
Reported by:Former user
Assigned to:Former user


While doing some performance regression testing for OS-7334, I noticed that when running iperf with a single connection, it would start out transferring at around 3-4 Gbits/sec and then eventually jump to 16-17 Gbits/sec after an unpredictable period of time, ranging from 3 seconds to sometimes more than a minute. I at first that perhaps it was due to a small initial congestion window that grew over time, but connstat(1M) showed that to not be the case. Instead of a changing cwnd, it revealed that the RTT was actually dropping from 2000-3000 microseconds to about 500 microseconds. Since the two boxes were directly connected to each other, with no intermediate switch and no serious system activity on either box beyond iperf, this was unexpected.

Looking at the RTT measurements being passed to tcp_set_rto, I noticed that short RTOs were measured when packets were dequeued during polling, and the long ones when they were dequeued during interrupts. To help visualize this, I collected data using rtt-statemap.d and processed it iwth rtt-statemap.awk to generate input for statemap. If you look at the output statemaps, you can see the correlation between shorter RTTs and polling:

Statemap Stable At (second offset)
Trace 1 16.451 seconds
Trace 2 35.176 seconds
Trace 3 11.366 seconds
Trace 4 0.595 seconds
Trace 5 1.939 seconds

(The rtt-statemap.cache file contains the mappings from assigned stack IDs to actual stacks.)

It makes sense that the short RTT eventually stabilizes, since it allows for sending packets more frequently, which results in more frequent acknowledgements, which in turn helps sustain a nonzero sr_poll_pkt_cnt. It also explains why running with 2 iperf connections doesn't show the same fluctuations: it's generating twice as many acknowledgments. What I didn't understand though was what triggered the stabilization, and why the periods with short RTTs before stabilization didn't last, until I looked at the percentage of nonzero measurements over time:

[Attachment: rtt-dtrace-1-srs.png]
[Attachment: rtt-dtrace-2-srs.png]
[Attachment: rtt-dtrace-3-srs.png]
[Attachment: rtt-dtrace-4-srs.png]
[Attachment: rtt-dtrace-5-srs.png]

Looking at the above graphs, you can see that while getting to at least 35% time polling is enough to improve the polling time, it's not until we reach ~55% that the system is able to stay there.