OS-7314: zil_commit should omit cache thrash

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-12-03T15:49:59.751Z)

Fix Versions

2018-12-06 Grizz (Release Date: 2018-12-06)

Related Issues

Description

The changes and testing required for OS-7300 have spurred a renewed focus on ZFS performance in the face of large async-write workloads featuring interspersed fsync() calls to flush pending data to the disks. In typical test scenarios, when the async writes were progressing, IO to disks underlying the pool (3 pairs of mirrored SSDs, no SLOG) would look good, each disk pushing ~100MB/s at latencies around 1ms. With a large amount of data still outstanding, the calls to fsync() would change the performance profile drastically. The disks still report 100% busy via iostat, but their IOPS and throughput would fall dramatically, with average latency shooting up to 50ms or more. When the workload was updated to omit the fsync() (or at least delay it until the cached writes had completed), throughput to the disks would remain high and the total write workload would complete in a timely fashion.

Comments

Comment by Bryan Cantrill
Created at 2018-10-19T21:29:46.702Z
Updated at 2018-10-19T21:46:42.211Z
One peculiar aspect of the observed pathological performance was its asymmetry with respect to devices. Some devices seemed to be exhibiting normal (if slightly degraded) latency, while other devices exhibiting shockingly poor latency (image link):

!PC2SWBC42-all.png!

We also observed that there were more I/O operations queued from the device layer than we would expect (namely, more than the I/O scheduler should allow). Further investigation revealed that these were tremendous numbers of SYNCHRONIZE_CACHE commands emanating from zil_commit via repeated calls to zio_flush.

To explore the performance consequences of the SYNCHRONIZE_CACHE commands, the following script was written:

#pragma D option quiet
#pragma D option bufsize=8m
#pragma D option strsize=16
#pragma D option dynvarsize=16m

BEGIN
{
        stime = timestamp;
}

io:::start
{
        printf("%d -> dev=%s type=%c blkno=%d size=%d %p\n",
            timestamp - stime, args[1]->dev_statname,
            args[0]->b_flags & B_READ ? 'R' : 'W',
            args[0]->b_lblkno, args[0]->b_bufsize,
            args[0]->b_addr);
        start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/this->ts = start[args[0]->b_edev, args[0]->b_blkno]/
{
        start[args[0]->b_edev, args[0]->b_blkno] = 0;
        printf("%d <- dev=%s type=%c blkno=%d size=%d latency=%d %p\n",
            timestamp - stime, args[1]->dev_statname,
            args[0]->b_flags & B_READ ? 'R' : 'W',
            args[0]->b_lblkno, args[0]->b_bufsize,
            timestamp - this->ts,
            args[0]->b_addr);
}

sd_xbuf_strategy:entry
{
        this->u = (struct sd_lun *)arg2;
        this->dip = (struct dev_info *)this->u->un_sd->sd_dev;

        printf("%d dev=sd%d cmds=%d\n", timestamp - stime,
            this->dip->devi_instance, this->u->un_ncmds_in_transport);
}

sd_send_scsi_SYNCHRONIZE_CACHE:entry
{
        self->un = args[0];
}

sd_send_scsi_SYNCHRONIZE_CACHE:return
/self->un/
{
        self->un = NULL;
}

sd_uscsi_strategy:entry
/(this->u = self->un) != NULL/
{
        fstart[(caddr_t)args[0]] = timestamp;
        un[args[0]] = this->u;
        this->dip = (struct dev_info *)this->u->un_sd->sd_dev;
        printf("%d -> dev=sd%d SYNCHRONIZE_CACHE cmds=%d %p\n",
            timestamp - stime, this->dip->devi_instance,
            this->u->un_ncmds_in_transport, args[0]);
        self->un = NULL;
}

sd_send_scsi_SYNCHRONIZE_CACHE_biodone:entry
/this->start = fstart[(caddr_t)args[0]]/
{
        this->u = un[args[0]];
        this->dip = (struct dev_info *)this->u->un_sd->sd_dev;

        printf("%d <- dev=sd%d SYNCHRONIZE_CACHE cmds=%d latency=%d %p\n",
            timestamp - stime, this->dip->devi_instance,
            this->u->un_ncmds_in_transport, timestamp - this->start, args[0]);

        fstart[(caddr_t)args[0]] = 0;
        un[args[0]] = NULL;
}

The full trace is attached (external link) -- and the results are damning in that it is clear that periods of poor performance correlate with tremendous numbers of outstanding SYNCHRONIZE_CACHE commands, e.g. (image link):

!PC2SWBC42-sync-sd6.png!

(For the sake of completeness, here are external links for sd1, sd2, sd3, sd4, sd5, and sd6.)

It appears that the I/Os themselves are having to be enqueued behind these commands; as each one takes ~5 ms even in the best case on this particular SSD, being enqueued behind 10, 20 or more of them quickly spikes I/O latency and corks the device -- and increases the likelihood that additional SYNCHRONIZE_CACHE commands will arrive.

Clearly, this is deeply suboptimal. zil_commit should be reworked such that exactly one SYNCHRONIZE_CACHE command is issued per device after all other I/O has been issued.

Comment by Patrick Mooney
Created at 2018-12-03T15:49:43.693Z
Addressed by illumos#9962, which was merged as git cab3a55.