OS-1952

sdopen spends 4.7s in cmlb with mislabeled disks

Status:: Open
Created:: 2013-02-25T13:52:01.000-0500
Updated:: 2014-05-16T16:09:40.000-0400

Description

It appears that a certain combination of bad disk labels will cause cmlb_partinfo() to take about 4.7s and issue 344 SCSI commands. On a working system, this takes 0.02s and issues 7 SCSI commands.

Unfortunately, it appears that the labels are at both the beginning and end of the device; we have only the label from the beginning, which is not by itself sufficient to trigger the problem. However, we do also have a trace from the broken labels showing the exact control flow, as well as a trace with only the first label bad. This should be enough to make progress; see attached.

The workaround for this, which may manifest itself as timeouts and failures during node setup, is to simply relabel all affected disks using echo "label y quit" | format /dev/rdsk/<disk>p0.

Comments (2)

Former user commented on 2013-04-22T20:03:16.000-0400:

This is even more unlikely than previously believed. First, this problem occurs only with disks larger than 2 TB in size. Smaller disks will enter the VTOC path and have a default label cons'd up, avoiding most of the problem since this need only be done once rather than on every open of any partition. Second, the painfully slow total time requires that the disk be extremely slow. Using a 3 TB HGST disk, the only way I was able to approach the 4.7s time noted here was to disable the read cache (thus forcing the partition table reads to miss in the cache every time). By doing so, I achieved a 2.9s per disk time for disk_size.

Given the difficulty of hitting this, I'm lowering it to an RFE. Specifically, we should cons up a phony label for large disks the way we do for small ones.

Former user commented on 2013-11-19T20:57:26.000-0500:

After exhaustive analysis, I've learned the following:

1. Opening a "p0" device with O_NDELAY will bypass the geometry and label validation checks until you actually try to read or write the device. This would allow tools that only do ioctls, specifically disk_size and removable_disk (but probably not diskinfo, since it needs the topo map which needs the devid) to run quickly even when the disks are unlabeled. Since there is no drawback to doing so, we should enhance these tools accordingly. A separate bug will cover this.

2. Even when opened O_NDELAY, even "p0" devices will attempt to forcibly revalidate the geometry and label via cmlb_validate() if the device did not previously have a valid label as determined by cmlb. On every read. And every write. And many ioctls. Since every attempted validation of a large, empty disk requires hundreds of I/Os, the effective throughput of the device becomes approximately 50 bytes per second. Bytes, not KB or MB. 50 B/s.

3. The zpool import code that searches disks for pools takes about 1 minute 35 seconds to run on a system with 40 unlabeled disks, despite it using 64 threads to check devices in parallel. This is a direct result of (2). Obviously, our startup code that can do this several times (such as in fs-joyent and joysetup) compounds this.

4. The code path in cmlb that creates labels does so only for disks with fdisk (MBR) type labels, and then only if there is a "SOLARIS2" fdisk partition in which to create the traditional VTOC16. Otherwise, it leaves the device unlabeled and in the same broken state we are seeing with the larger disks that cannot use MBR-style labels. So this path is not really a solution unless we want to risk writing to any blank writable media that the system finds, which seems profoundly risky although maybe not so terrible as one might imagine.

Out of all that, it seems like the simplest and lowest-risk path forward here is to do something like the idea in (4) but from userland, during startup before we do anything else that might possibly touch disks. Any non-removable device with an invalid label would be given a simple EFI label. This doesn't help with enumeration, which it seems will still go though the slow path in the kernel. Because of (2) above, simply implementing a negative cache for label status is insufficient without further changes. This approach seems obvious, but many of the code paths we follow during startup will send sd back through the force-revalidation path in cmlb, so without a major rework of cmlb to avoid revalidating when the geometry and label cannot possibly have changed, this would be ineffective. It appears that much of this is the way it is to handle a case that can never occur in our world: the remotely resized SAN LU. We could remove much of this code from cmlb and replace it with a simple cache invalidated by sd (and cmdk, I suppose) when the device is written to or a removable device is changed. The danger of course is overlooking the decades of accumulated edge cases that cmlb currently addresses. Hence the userland approach.