OS-7981: mpt_sas hangs after config header request timeout

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-09-06T11:34:06.035Z
Updated at:2019-09-11T14:20:14.312Z

People

Created by:Hans Rosenfeld
Reported by:Hans Rosenfeld
Assigned to:Hans Rosenfeld

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-09-11T14:20:14.297Z)

Fix Versions

2019-09-12 Art Vandelay (Release Date: 2019-09-12)

Related Links

Description

On a customer system, we've ran into these warnings at boot roughly once in ten reboots:

WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       config header request timeout
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       Failed to get sas phy page 0 for each phy
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       mptsas phy update failed

The system would then take hours to complete its boot and not find any disks/SSDs attached to the affected SAS HBA.

As it turns out this is caused by a command timeout in mptsas_access_config_page(), as indicated by the first warning. The remaining warnings are directly caused by this failure. The timeout for config header requests (and also config page requests later in the function) are hardcoded to 60s.

It turns out there's an oversight in mptsas_access_config_page(): just as in other places in mpt_sas it already contains code to reset the I/O controller when a command timeout happened. This code is just never executed, leaving the controller in a hung state.

This can be easily fixed. It is still unclear what causes these timeouts in the first place. With the IOC reset being done properly the timeouts may still happen every now and then at reboot, but they won't cause any more harm than a single 60s delay at boot. All disks/SSDs are visible and usable.

Comments

Comment by Hans Rosenfeld
Created at 2019-09-10T16:13:04.510Z

Testing: I've manually rebooted the customers system a few times until the timeouts happened at boot. Previously a timeout meant the system would take ages to boot and then see none of its SSDs. With these changes implemented the system did experience a short delay (60s) due to the timeout, but once it finished booting all SSDs were visible and usable, the system behaved as expected.


Comment by Jira Bot
Created at 2019-09-11T14:16:56.286Z

illumos-joyent commit 526bf199ed8900945f9dffc3041ec322a758285c (branch master, by Hans Rosenfeld)

OS-7981 mpt_sas hangs after config header request timeout
Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>