OS-7151: ZFS loading and unloading metaslabs at audible frequency

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-12-17T22:09:13.865Z)

Fix Versions

2018-12-20 Hazel Wassername (Release Date: 2018-12-20)

Related Links

Description

We recently experienced a serious performance problem on a number of storage servers with large ZFS pools. These pools are on the order of 250TB, and they had been filled to around 93% capacity. Previous testing in MANTA-3571 had lead us to believe that we would be able to reach 96% capacity before experiencing performance issues, but that testing was very limited in scope and did not account for things like fragmentation in the pool.

While the performance issue was ongoing, even asynchronous writes were taking a long time.

Comments

Comment by Joshua M. Clulow
Created at 2018-08-20T18:02:52.634Z
I looked closer at one of the misbehaving machines to attempt to characterise the work it was doing. The machine was mostly idle; both CPU and disks. Kernel profiling revealed most of the work was happening in the routines responsible for assembling the in-memory representation of the space map for ZFS metaslabs.

!oncpu.png|thumbnail!

Note that we are in the metaslab_group_alloc_normal() routine, attempting to allocate blocks in the ZFS pool to which to write. Most of the time we are either sitting on the mutex waiting, or doing work in space_map_load() where we reconstitute the in-memory representation (AVL-based range tree) of the on-disk space map for some metaslab or other.

Comment by Joshua M. Clulow
Created at 2018-08-20T18:08:02.759Z
The storage pool in server MSB05760 has three raidz2 vdevs, each with 160 metaslabs. When I was looking at the server, it appeared that around half of the metaslabs were loaded. With mdb -k you can get a census of loaded vs unloaded slabs:

    > ::walk spa | ::print spa_t spa_normal_class->mc_rotor | ::list metaslab_group_t mg_next | ::print -ta metaslab_group_t mg_metaslab_tree | ::walk avl | ::print metaslab_t ms_loaded ! sort | uniq -c | sort -n
     216 ms_loaded = 0 (0)
     264 ms_loaded = 0x1 (B_TRUE)

(Note that spa_normal_class is all non-slog vdevs in the pool.)

There are a number of heuristic thresholds for metaslab and pool capacity which can change the way the allocator works. Looking at the pool with zdb -mmm zones, it seems that most of the metaslabs have:

156KB is still bigger than metaslab_df_alloc_threshold (128KB), below which we apparently use a more aggressive (pesimal?) allocation strategy. 10% is bigger than metaslab_df_free_pct (4%), below which we will switch from first-fit to best-fit.

Comment by Joshua M. Clulow
Created at 2018-08-20T18:17:08.402Z
I did a quick survey of machines that had very little data in their ZFS pools, and even under some write load those pools had only a handful (e.g., 3-6) metaslabs loaded at any particular time.

We can trace load/unload activity with this D script:
#!/usr/sbin/dtrace -qCs

#define NOW     ((walltimestamp - START) / 1000000)

BEGIN
{
        START = walltimestamp;
}

fbt::metaslab_load:entry
{
        self->ts = timestamp;
        self->ms_id = args[0]->ms_id;
        self->vd_id = args[0]->ms_group->mg_vd->vdev_id;

        printf("%8u E   load vd %4u ms %4u\n", NOW, self->vd_id,
            self->ms_id);
}

fbt::metaslab_load:return
/self->ts/
{
        this->d = (timestamp - self->ts) / 1000000; /* milliseconds */

        printf("%8u R   load vd %4u ms %4u dur %u\n", NOW, self->vd_id,
            self->ms_id, this->d);

        self->ts = 0;
        self->ms_id = 0;
        self->vd_id = 0;
}

fbt::metaslab_unload:entry
{
        self->ts = timestamp;
        self->ms_id = args[0]->ms_id;
        self->vd_id = args[0]->ms_group->mg_vd->vdev_id;

        printf("%8u E unload vd %4u ms %4u\n", NOW, self->vd_id,
            self->ms_id);
}

fbt::metaslab_unload:return
/self->ts/
{
        this->d = (timestamp - self->ts) / 1000000; /* milliseconds */

        printf("%8u R unload vd %4u ms %4u dur %u\n", NOW, self->vd_id,
            self->ms_id, this->d);

        self->ts = 0;
        self->ms_id = 0;
        self->vd_id = 0;
}

fbt::spa_sync:entry
{
        printf("%8u E   sync ---- txg %u\n", NOW, args[1]);

        self->spa_ts = timestamp;
        self->txg = args[1];
}

fbt::spa_sync:return
/self->spa_ts/
{
        this->d = (timestamp - self->spa_ts) / 1000000; /* milliseconds */

        printf("%8u R   sync ---- txg %u dur %u\n", NOW, self->txg, this->d);

        self->spa_ts = 0;
        self->txg = 0;
}
When we are experiencing this performance problem, there are some set of metaslabs that are being loaded and unloaded constantly. It helps to add the SPA sync entry/exit and duration, because much of the unloading happens during the flush for reasons such as the passing of metaslab_unload_delay transaction groups without allocating from a particular metaslab.

It is also important to note that loading metaslabs on these systems (3.5" 7200RPM SAS/SATA) can take around a second at times, and when we do a lot of it in SPA sync context, the flush time for a transaction group can blow out past a minute. This can have a serious impact on the system's ability to service even asynchronous writes.

I was able to increase flush time from under a second to 75 seconds just by writing into /var/tmp with, e.g., dd if=/dev/urandom of=/var/tmp/rando bs=1M count=200.

We can prevent the unloading of metaslabs by setting metaslab_debug_unload to 1. In some basic testing this appears to improve the situation, possibly even resolving the performance issue entirely. I don't think we want to just set this flag and forget it, because some pools have a large number of metaslabs and this may substantially increase the memory footprint of the metaslab allocator.

Comment by Joshua M. Clulow
Created at 2018-08-20T18:26:12.493Z
When allocations in a metaslab fail, they generally do so with a reason code. We can trace that with something like this D script:

#!/usr/sbin/dtrace -qCs

#define NOW     ((walltimestamp - START) / 1000000)

BEGIN
{
        START = walltimestamp;
}

inline string atype[int64_t t] =
    (t == -1) ? "ALLOC_FAILURE" :
    (t == -2) ? "TOO_SMALL" :
    (t == -3) ? "FORCE_GANG" :
    (t == -4) ? "NOT_ALLOCATABLE" :
    (t == -5) ? "GROUP_FAILURE" :
    (t == -6) ? "ENOSPC" :
    (t == -7) ? "CONDENSING" :
    (t == -8) ? "VDEV_ERROR" :
    "?";

fbt::metaslab_group_alloc_normal:entry
{
        self->mgan = 1;
}

fbt::metaslab_group_alloc_normal:return
{
        self->mgan = 0;
}

fbt::metaslab_trace_add:entry
/self->mgan && ((int64_t)arg5) < 0/
{
        this->ms_id = args[2]->ms_id;
        this->vd_id = args[2]->ms_group->mg_vd->vdev_id;

#if 0
        printf("%8u E  trace vd %4u ms %4u reas %s\n", NOW, this->vd_id,
            this->ms_id, atype[args[5]]);
#endif
        @trace[this->vd_id, this->ms_id, atype[(int64_t)arg5]] = count();
}

tick-1s
{
        printf("%Y\n", walltimestamp);
        printa("%@10d   vd %4u ms %4u reas %s\n", @trace);
        trunc(@trace);
}

This was towards the end of the time I had to investigate, but I recall seeing a lot of TOO_SMALL failures on many metaslabs which, as I recall, triggers a switch to a different metaslab in the rotor.

One last thing to understand is that metaslabs are stored in memory sorted by their "weight". For modern ZFS pools, where each metaslab has a spacemap histogram, the weight is effectively a combination of (from most to least significant):

On one system, I located the AVL in which the metaslabs are sorted and walked it, printing the weights. It seemed that there were a few metaslabs that had just a handful (1-10) regions of one weight (e.g., 256KB), and then a large population of metaslabs with the next smaller weight (e.g., 128KB) with a large number of regions. It's still just conjecture, but it's possible we're confusing the heuristics here by oscillating between two rungs on the power-of-two ladder. When we resort the metaslab list (on allocation failure?) the non-continuous nature of the sort order may move some set of metaslabs wildly up and down, especially if we're both allocating and freeing regions.

Comment by Jerry Jelinek
Created at 2018-08-23T12:17:14.274Z
I think a better alternative to metaslab_debug_unload will be to tune metaslab_unload_delay. metaslab_unload_delay is currently only 8 txgs and during my testing so far, I see that we are frequently unloading a metaslab that is in the 10-100 txg range. This can happen pretty fast under a heavy write load. I need to collect more data, but setting metaslab_unload_delay to at least 100 seems reasonable, but we might want to go even higher, especially as we're pushing all of the metaslabs down below 10% free. It might makes sense for metaslab_unload_delay to automatically tune itself higher once most of the metaslabs are in the single-digit precent free range, but manually setting this higher is a reasonable first step.

Comment by Jerry Jelinek
Created at 2018-08-23T19:48:15.835Z
Updated at 2018-08-23T20:15:58.181Z
I ran an llquantize on the milliseconds taken for spa_sync with metaslab_unload_delay at the default 8 txg value, and then again a little later with it set to 100. I had a continuous dd write workload running during each test. Here are the results.
metaslab_unload_delay: 8
           value  ------------- Distribution ------------- count    
               0 |                                         0        
             100 |@                                        1        
             200 |@@@@@                                    6        
             300 |@@@@@                                    6        
             400 |@@                                       2        
             500 |@@@@@                                    7        
             600 |@@@@                                     5        
             700 |@@                                       3        
             800 |@@                                       3        
             900 |@@@@@                                    6        
            1000 |                                         0        
            1100 |@@                                       2        
            1200 |@                                        1        
            1300 |@@                                       2        
            1400 |@@                                       2        
            1500 |                                         0        
            1600 |                                         0        
            1700 |@@                                       2        
            1800 |                                         0        
            1900 |                                         0        
            2000 |                                         0        
            2100 |                                         0        
            2200 |                                         0        
            2300 |                                         0        
            2400 |                                         0        
            2500 |@                                        1        
            2600 |@                                        1        
            2700 |                                         0        
            2800 |                                         0        
            2900 |@                                        1        
            3000 |                                         0        
            3100 |                                         0        
            3200 |                                         0        
            3300 |                                         0        
            3400 |@                                        1        
            3500 |                                         0        
            3600 |                                         0        
            3700 |                                         0        
            3800 |                                         0        
            3900 |@                                        1        
            4000 |                                         0        
metaslab_unload_delay: 100
           value  ------------- Distribution ------------- count    
               0 |                                         0        
             100 |@@                                       4        
             200 |@@@@@                                    9        
             300 |@@@@                                     8        
             400 |@@@@@@@@                                 16       
             500 |@@@@@@@@                                 16       
             600 |@@@@@                                    10       
             700 |@@@@                                     7        
             800 |@                                        1        
             900 |@                                        1        
            1000 |@                                        1        
            1100 |@                                        1        
            1200 |@                                        1        
            1300 |@                                        1        
            1400 |                                         0        
            1500 |                                         0        
            1600 |                                         0        
            1700 |                                         0        
            1800 |                                         0        
            1900 |@                                        1        
            2000 |                                         0        
So, that definitely tightens up the outliers. I'd be interested to see this on a machine under an actual SPC write load.

Its important to note that increasing metaslab_unload_delay won't reduce the spa_sync time if we have to load a metaslab, so we can still hit outliers, but a larger metaslab_unload_delay will reduce how often we have to load metaslabs.

In terms of the kernel memory cost for this, the amount of memory consumed by the AVL tree for each slab will vary. On the system I was testing, the zpool is at 90% full and it looks like the average in-memory space for the AVL tree for each loaded metaslab is in the 50-60MB range. We have 3 top-level vdevs with 160 metaslabs each, so if we lock all of the metaslabs in memory using metaslab_debug_unload, that would consume around 27GB of kernel memory in the current state (which can vary a lot, depending on the free space on each disk).

Comment by Jerry Jelinek
Created at 2018-10-26T23:50:32.264Z
Updated at 2018-10-26T23:52:35.957Z
One of the eu-central-1a machines which was hitting this problem is MSA08605. I have an SPC lab machine setup with an identical zpool. I have filled the zpool (using a file size distribution similar to production), but even fuller than the production machine, and I have similar fragmentation.

MSA08605
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   240T   219T  21.3T         -    59%    91%  1.00x  ONLINE  -
Lab Machine
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   240T   227T  13.4T        -         -    56%    94%  1.00x  ONLINE  -
However, I have tried many different workload combinations on the lab machine (including the dd test above) and I cannot induce this metaslab load/unload issue.

I think the only remaining difference is related to the details of the fragmentation. The top-level zpool fragmentation number is a summary of the actual spacemap histogram details. Looking at the production machine (using zdb -mm), the spacemap histogram data is very uniform. Here is a sample:
        metaslab      0   offset            0   spacemap     80   free    43.8G
        On-disk histogram:              fragmentation 59
                         13:  68459 ************
                         14: 113621 *******************
                         15: 150264 *************************
                         16: 245511 ****************************************
                         17:  85277 **************
All of the spacemaps consistently have free space in the 45-55GB range and fragmentation consistent across metaslabs, ranging from 59 to 60%.

However, on the lab machine, the spacemaps are not nearly as consistent. Here are 3 examples.
        metaslab     0   offset            0   spacemap     80   free    4.48G
        On-disk histogram:              fragmentation 63
                         13:  20290 *****************************
                         14:  28925 ****************************************
                         15:  26907 **************************************
                         16:  16538 ***********************
                         17:   6832 **********
                         18:     11 *
                         19:      5 *
                         20:      4 *
                         21:      2 *
                         22:      1 *
...
        metaslab      8   offset  40000000000   spacemap    739   free     145G
        On-disk histogram:              fragmentation 6
                         13:  40986 *****************************
                         14:  56555 ****************************************
                         15:  28976 *********************
                         16:  19005 **************
                         17:   9178 *******
                         18:   2836 ***
                         19:   6534 *****
                         20:   4005 ***
                         21:     62 *
                         22:      1 *
                         23:      0
                         24:      0
                         25:      0
                         26:      0
                         27:      0
                         28:      0
                         29:      0
                         30:      0
                         31:      0
                         32:      0
                         33:      0
                         34:      0
                         35:      0
                         36:      1 *
...
        metaslab     40   offset 140000000000   spacemap    737   free     460G
        On-disk histogram:              fragmentation 0
                         13:   3738 **************
                         14:   3131 ************
                         15:   3099 ***********
                         16:   2553 **********
                         17:   1115 ****
                         18:    416 **
                         19:    174 *
                         20:    343 **
                         21:    544 **
                         22:    697 ***
                         23:   3226 ************
                         24:  11292 ****************************************
                         25:   3977 ***************
                         26:      7 *
Most of the spacemaps on the lab machine are in the 4-7GB free range, but then there are some with big chunks free (as shown above). Fragmentation across metaslabs ranges from the high 60% range to 0.

Thus, I think this metaslab load/unload problem is closely related to the metaslab spacemap profile as I've shown here. My next step is to do more analysis across production to see if all of the storage nodes are consistently similar to MSA08605. Also, I need to figure out how to fill/fragment my lab machine with a similar spacemap profile to production.

Comment by Jerry Jelinek
Created at 2018-10-29T20:38:52.098Z
Updated at 2018-10-29T22:34:35.270Z
The profile of fragmentation appears different on the newer shrimps vs. the older ones. I wrote a little script to consolidate the zdb -mm slab free & fragmentation information for the top-level vdevs, but ignoring the slog vdev. Here is the slab data for the original node we were looking at, MSA08605.
# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   240T   219T  21.3T         -    59%    91%  1.00x  ONLINE  -

free:
     10G 1
     20G 2
     30G 17
     40G 66
     50G 299
     60G 94
     70G 1
frag:
     59% 311
     60% 152
     61% 12
     62% 1
     63% 1
     64% 3
Here is the data for one of the newer shrimps (MS114899) which is filled to the same level.
# zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   300T   273T  26.7T         -    42%    91%  1.00x  ONLINE  -

free:
     30G 46
     40G 31
     50G 22
     60G 11
     70G 15
     80G 10
     90G 7
    100G 5
    110G 13
    120G 40
    130G 48
    140G 30
    150G 4
    160G 1
    170G 3
    200G 2 
    220G 2
    230G 1
    240G 2
    250G 1
    260G 2
    270G 1
    300G 2
    390G 1
frag:
     25% 2
     26% 4
     27% 11
     28% 27
     29% 26
     30% 36
     31% 28
     32% 8
     33% 8
     34% 7
     35% 6
     36% 5
     37% 1
     38% 2
     39% 2
     40% 2
     41% 3
     42% 2
     43% 7
     44% 6
     45% 4
     46% 6
     47% 1
     48% 1
     49% 3
     51% 3
     53% 1
     56% 1
     59% 1
     60% 2
     61% 1
     62% 3
     63% 4
     64% 20
     65% 46
     66% 10
The top-level fragmentation value reflects this to some extent, but we can see that the profile summary shows a much wider spread than on the older shrimp. This node would be able to absorb the current write workload without any issue, even if metaslab_debug_unload wasn't set.

For comparison, here is the lab machine I setup to try to replicate this issue.
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   240T   227T  13.4T        -         -    56%    94%  1.00x  ONLINE  -

free:
      0G 5
     10G 422
     20G 3
     30G 5
     40G 2
     50G 2
     60G 1
     80G 2
    100G 1
    110G 1
    130G 1
    150G 6
    170G 2
    200G 1
    210G 1
    220G 1
    230G 2
    250G 1
    270G 1
    300G 2
    370G 5
    380G 4
    470G 5
    490G 1
    500G 3
frag:
      0% 14
      1% 10
      2% 4
      3% 3
      4% 4
      5% 1
      6% 3
      8% 1
     12% 1
     16% 1
     18% 1
     23% 1
     24% 1
     27% 1
     28% 1
     31% 1
     33% 2
     35% 2
     40% 1
     45% 1
     48% 1
     51% 1
     52% 3
     53% 1
     54% 2
     55% 3
     56% 2
     57% 16
     58% 23
     59% 27
     60% 31
     61% 43
     62% 32
     63% 45
     64% 31
     65% 48
     66% 108
     67% 5
     68% 2
     69% 1
     72% 1
This helps illustrate why I have been unable to reproduce the problem in the lab, even though the top-level fragmentation is comparable.

Looking at east, I don't see the problematic fragmentation profile that we see in eu-central. In east-1a, I only see one shrimp that is similar. The other shrimps that are getting close to full look more like the second shrimp example below.
MS941003
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones   240T   221T  19.2T         -    59%    91%  1.00x  ONLINE  -

free:
     20G 5
     30G 53
     40G 158
     50G 167
     60G 95
     70G 1
     80G 1
frag:
     56% 1
     57% 5
     58% 17
     59% 297
     60% 133
     61% 21
     62% 3
     65% 3
     
MS941005
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT  
zones   240T   217T  22.6T         -    57%    90%  1.00x  ONLINE  -

free:
     10G 3
     20G 8
     30G 41
     40G 103
     50G 147
     60G 90
     70G 41
     80G 20
     90G 16
    100G 3
    110G 3
    120G 3
    130G 1
    160G 1
frag:
     44% 1
     45% 2
     46% 3
     47% 2
     48% 9
     49% 11
     50% 13
     51% 9
     52% 10
     53% 13
     54% 19
     55% 13
     56% 11
     57% 25
     58% 49
     59% 159
     60% 103
     61% 16
     62% 5
     63% 1
     65% 1
     66% 1
     69% 1
     70% 1
     84% 1
     87% 1

Comment by Jerry Jelinek
Created at 2018-11-21T15:26:28.770Z
Updated at 2018-11-21T15:50:22.942Z
This is a summary of testing my prototype fix on the production machine MSA08605. This machine has the "clustered" fragmentation profile which seems to be related to the metaslab load/unload oscillation. Also, this machine was one on which we could easily reproduce this behavior with the simple 'dd' test described above. With the test platform, we're running with metaslab_debug_unload set back to the default of 0 (so metaslab unloading is allowed).

As expected, the fix has tamped down the oscillation. I can see that metaslabs are being unloaded periodically, but only after a reasonable period when they have not been used. I ran many iterations of the 'dd' test. This initially caused the metaslab_unload_delay to be raised to 16 (from the default of 8). I left my monitoring running overnight while the machine was running the normal storage server workload, and I can see the expected periodic behavior of metaslab loading and unloading, but again only after acceptable periods of residency. The metaslab_unload_delay did get automatically increased to 64 overnight. This tamps down oscillation even more.

There was one unusual event of long zil_commit and spa_sync latency during the initial testing when I was running the 'dd' load. This unusual event is not related to metaslab load latency, so I'm going to open a separate ticket for that. I only saw this event one time, despite repeated 'dd' test runs and it never occurred again overnight. I'm not yet sure what caused that anomaly since sd_send_scsi_SYNCHRONIZE_CACHE latency was normal. I opened OS-7390 for this issue.

Comment by Jerry Jelinek
Created at 2018-12-06T16:54:27.528Z
Updated at 2018-12-06T16:54:58.413Z
The latest round of testing using the new, linear ramping algorithm gives the same basic behavior as the previous scaling algorithm. On MSA08605 we see that the metaslab_unload_delay has ramped up to 38 under the normal write load this box is experiencing. Hitting the box with the 'dd' write load we've used hasn't cause the delay to go up any further. We still can see a spike in the spa_sync latency when we hit the box with the 'dd' write load and there are not enough metaslabs loaded. For example, I saw this when I did the first 'dd' shortly after the box was booted. I did not see it again when I did more 'dd' after and hour or several hours, but if I come back the next day and do it again, it looks like enough metaslabs were unloaded in that intervening time and we get another spa_sync latency spike due to metaslab loading.

Comment by Jerry Jelinek
Created at 2018-12-17T14:02:35.818Z
Updated at 2018-12-17T14:18:59.820Z
We ran the test platform on a metadata tier async node (JA701919). There were several periods when this node would have started thrashing on unload/load, but instead it ramped up the delay. After running over the weekend, the delay is currently 118. There were a few instances of longer spa_sync latency when metaslabs had to be loaded, but just one instance where the latency was in the 2.5 second range. For comparison, there are 10 instances in the baseline where the latency is 2.5 seconds or more. The baseline data was collected with metaslab unloading disabled. Of course, we could experience similar latency results with the new algorithm, but the point is that the new algorithm is not showing worse results than the baseline.

Comment by Jira Bot
Created at 2018-12-17T16:33:47.423Z
illumos-joyent commit f4cb444376b6efe8770e521115e58d5df2379c97 (branch master, by Jerry Jelinek)

OS-7151 ZFS loading and unloading metaslabs at audible frequency
Reviewed by: Joshua Clulow <jmc@joyent.com>
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Approved by: Kody Kantor <kody.kantor@joyent.com>


Comment by Jira Bot
Created at 2019-03-12T16:44:48.071Z
hotpatches commit 6fcc5d5c21ae68cd8698126a5c2d0a3981384971 (branch master, by Angela Fong)

Add still in-use manta patches into hotpatches.git
(MANTA-2948, MANTA-3970, MANATEE-343, OS-7151)