Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2019-07-31T21:05:44.988Z |
Updated at: | 2019-08-16T13:56:46.277Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-08-16T13:56:46.262Z)
2019-08-29 Zoo York (Release Date: 2019-08-29)
Any application that does synchronous writes, or issues an fsync
, to ensure data is on stable storage will go through the zil_commit
code path in ZFS. This writes the ZIL to stable storage before returning to the application. In particular, in manta nginx
will always issue an fsync
at the end of an object write and this blocks the nginx
process until the syscall returns.
We usually deploy our servers with an SSD slog device, so we would expect zil_commit
to be fast, and it usually is. However, we have recently observed very long zil_commit
times on storage servers which are ~96% full. Since the slog is a separate device, it seems logical to assume that a relatively full zpool should be irrelevant.
However, what is happening is that both the zil_commit
and spa_sync
code share the same set of write queues for the zio pipeline. If there is spa_sync
zio's ahead of the zil_commit
zio's, this will block the zil_commit
until the spa_sync
zio's are done. With a relatively full zpool, we can hit the case in spa_sync
where we have to load multiple metaslabs in the zio_dva_allocate
pipeline stage. As in the past with OS-7151, we have seen that loading metaslabs can take an exorbitant amount of time, particularly on zpools that are very full.
OS-7151 was an attempt to alleviate that, but it still allows metaslabs to be unloaded, so once we hit a long, slow set of metaslab loads, we will cause a long delay for any concurrent zil_commit
that is subsequently enqueued.
I have considered a variety of ways to approach a fix to this issue, but most of them require relatively invasive code changes which would have to first go through a lot of testing and then in to upstream. In the end, I think we should stick with the simple approach for SmartOS. The behavior around zil_slog_bulk
seems misguided to me. It might make a small improvement under some conditions, but the increased risk of bad application latency behavior is severe and relatively easy to hit, as we have seen throughout production. To keep our code changes simple, we'll essentially just disable this behavior on SmartOS. If someone really needs the old behavior, they can still tune it in to the appropriate level.
This sounds like a good approach to me. During code review I was curious what sort of zl_cur_used
values we see on our storage nodes and metadata nodes under constant write load.
[root@storage_node ~]# dtrace -n 'zil_lwb_write_open:entry { @ = quantize(args[0]->zl_cur_used)}' -c 'sleep 30' dtrace: description 'zil_lwb_write_open:entry ' matched 1 probe dtrace: pid 419122 has exited value ------------- Distribution ------------- count -1 | 0 0 | 271 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 194 128 | 159 256 | 13 512 | 2 1024 | 0 2048 | 1 4096 | 2 8192 | 139 16384 |@ 302 32768 |@ 544 65536 |@@ 898 131072 |@@ 1414 262144 |@@@ 2025 524288 |@@@@@@ 3719 1048576 |@@@@@@@@@@ 5670 2097152 |@@@@@@@@@@@ 6273 4194304 |@@@ 1763 8388608 | 0
So the storage nodes seem to hit this condition fairly regularly. We already knew this from what Jerry found in SCI-799.
The Postgres checkpointer also does some surprisingly large sync writes that will hit this condition:
[root@metadata_node ~]# dtrace -n 'zil_lwb_write_open:entry /args[0]->zl_cur_used > 786432/{ @[ustack()] = count()}' -c 'sleep 30' dtrace: description 'zil_lwb_write_open:entry ' matched 1 probe dtrace: pid 693661 has exited libc.so.1`__fdsync+0xa postgres`pg_fsync_no_writethrough+0x20 postgres`pg_fsync+0x15 postgres`FileSync+0x40 postgres`mdsync+0x1e8 postgres`smgrsync+0x52 postgres`CheckPointBuffers+0x41 postgres`CheckPointGuts+0x46 postgres`CreateCheckPoint+0x64d postgres`CheckpointerMain+0x51d postgres`AuxiliaryProcessMain+0x4c7 postgres`StartChildProcess+0xc8 postgres`reaper+0x244 libc.so.1`__sighndlr+0x6 libc.so.1`call_user_handler+0x1db libc.so.1`sigacthandler+0x116 libc.so.1`__pollsys+0xa libc.so.1`pselect+0x26b libc.so.1`select+0x5a postgres`ServerLoop+0xde 2312
That being said, the proposed value in code review looks good to me!
Because the proposed fix is just using a setting for zil_slog_bulk
we have essentially already tested this with our tuning throughout the SE DC. We first observed the production behavior without the tuning, then with a 64MB tuning value (which helped but we still saw some latency due to some larger writes up to the 512MB bucket), and finally with a 2GB tuning (which eliminated all latency issues).
illumos-joyent commit adf665b5e735d87cf7d6b4e8932fb7dce88e42ad (branch master, by Jerry Jelinek)
OS-7917 excessive zil_commit latency
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Approved by: Kody Kantor <kody.kantor@joyent.com>