OS-7917: excessive zil_commit latency

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-07-31T21:05:44.988Z
Updated at:2019-08-16T13:56:46.277Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-08-16T13:56:46.262Z)

Fix Versions

2019-08-29 Zoo York (Release Date: 2019-08-29)

Related Issues

Description

Any application that does synchronous writes, or issues an fsync, to ensure data is on stable storage will go through the zil_commit code path in ZFS. This writes the ZIL to stable storage before returning to the application. In particular, in manta nginx will always issue an fsync at the end of an object write and this blocks the nginx process until the syscall returns.

We usually deploy our servers with an SSD slog device, so we would expect zil_commit to be fast, and it usually is. However, we have recently observed very long zil_commit times on storage servers which are ~96% full. Since the slog is a separate device, it seems logical to assume that a relatively full zpool should be irrelevant.

However, what is happening is that both the zil_commit and spa_sync code share the same set of write queues for the zio pipeline. If there is spa_sync zio's ahead of the zil_commit zio's, this will block the zil_commit until the spa_sync zio's are done. With a relatively full zpool, we can hit the case in spa_sync where we have to load multiple metaslabs in the zio_dva_allocate pipeline stage. As in the past with OS-7151, we have seen that loading metaslabs can take an exorbitant amount of time, particularly on zpools that are very full.

OS-7151 was an attempt to alleviate that, but it still allows metaslabs to be unloaded, so once we hit a long, slow set of metaslab loads, we will cause a long delay for any concurrent zil_commit that is subsequently enqueued.

Comments

Comment by Former user
Created at 2019-08-09T20:13:04.414Z

I have considered a variety of ways to approach a fix to this issue, but most of them require relatively invasive code changes which would have to first go through a lot of testing and then in to upstream. In the end, I think we should stick with the simple approach for SmartOS. The behavior around zil_slog_bulk seems misguided to me. It might make a small improvement under some conditions, but the increased risk of bad application latency behavior is severe and relatively easy to hit, as we have seen throughout production. To keep our code changes simple, we'll essentially just disable this behavior on SmartOS. If someone really needs the old behavior, they can still tune it in to the appropriate level.


Comment by Former user
Created at 2019-08-14T21:58:41.580Z

This sounds like a good approach to me. During code review I was curious what sort of zl_cur_used values we see on our storage nodes and metadata nodes under constant write load.

[root@storage_node ~]# dtrace -n 'zil_lwb_write_open:entry { @ = quantize(args[0]->zl_cur_used)}' -c 'sleep 30'
dtrace: description 'zil_lwb_write_open:entry ' matched 1 probe
dtrace: pid 419122 has exited


           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |                                         271
               1 |                                         0
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         0
              32 |                                         0
              64 |                                         194
             128 |                                         159
             256 |                                         13
             512 |                                         2
            1024 |                                         0
            2048 |                                         1
            4096 |                                         2
            8192 |                                         139
           16384 |@                                        302
           32768 |@                                        544
           65536 |@@                                       898
          131072 |@@                                       1414
          262144 |@@@                                      2025
          524288 |@@@@@@                                   3719
         1048576 |@@@@@@@@@@                               5670
         2097152 |@@@@@@@@@@@                              6273
         4194304 |@@@                                      1763
         8388608 |                                         0

So the storage nodes seem to hit this condition fairly regularly. We already knew this from what Jerry found in SCI-799.

The Postgres checkpointer also does some surprisingly large sync writes that will hit this condition:

[root@metadata_node ~]# dtrace -n 'zil_lwb_write_open:entry /args[0]->zl_cur_used > 786432/{ @[ustack()] = count()}' -c 'sleep 30'
dtrace: description 'zil_lwb_write_open:entry ' matched 1 probe
dtrace: pid 693661 has exited


              libc.so.1`__fdsync+0xa
              postgres`pg_fsync_no_writethrough+0x20
              postgres`pg_fsync+0x15
              postgres`FileSync+0x40
              postgres`mdsync+0x1e8
              postgres`smgrsync+0x52
              postgres`CheckPointBuffers+0x41
              postgres`CheckPointGuts+0x46
              postgres`CreateCheckPoint+0x64d
              postgres`CheckpointerMain+0x51d
              postgres`AuxiliaryProcessMain+0x4c7
              postgres`StartChildProcess+0xc8
              postgres`reaper+0x244
              libc.so.1`__sighndlr+0x6
              libc.so.1`call_user_handler+0x1db
              libc.so.1`sigacthandler+0x116
              libc.so.1`__pollsys+0xa
              libc.so.1`pselect+0x26b
              libc.so.1`select+0x5a
              postgres`ServerLoop+0xde
             2312

That being said, the proposed value in code review looks good to me!


Comment by Former user
Created at 2019-08-15T14:06:49.320Z
Updated at 2019-08-15T14:08:13.242Z

Because the proposed fix is just using a setting for zil_slog_bulk we have essentially already tested this with our tuning throughout the SE DC. We first observed the production behavior without the tuning, then with a 64MB tuning value (which helped but we still saw some latency due to some larger writes up to the 512MB bucket), and finally with a 2GB tuning (which eliminated all latency issues).


Comment by Jira Bot
Created at 2019-08-15T22:33:15.028Z

illumos-joyent commit adf665b5e735d87cf7d6b4e8932fb7dce88e42ad (branch master, by Jerry Jelinek)

OS-7917 excessive zil_commit latency
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Approved by: Kody Kantor <kody.kantor@joyent.com>