OS-3506: dls and mac lock ordering isn't honored

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-01-30T20:38:36.000Z)

Fix Versions

2017-02-02 ICELAND ESCALATION (Release Date: 2017-02-02)

Related Issues

Description

mac.c defines the lock ordering between mac and dls as that one should grab the i_dls_devnet_lock before the mac perimeter. However, this also means that one needs to grab a dls hold before grabbing the perimeter. This is important because we definitely don't do this in a lot of places. Consider things like:

owner of the master mac perimeter:

> ffffff443362b068::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print
 mac_client_impl_t mci_mip->mi_perim_owner | ::findstack -v
stack pointer for thread ffffff4360d34520: ffffff01e9fbc810
[ ffffff01e9fbc810 _resume_from_idle+0xf4() ]
  ffffff01e9fbc840 swtch+0x141()
  ffffff01e9fbc8f0 turnstile_block+0x262(ffffff437c9e4688, 0, fffffffffbd096e0,
  fffffffffbc08cc0, 0, 0)
  ffffff01e9fbc960 rw_enter_sleep+0x19b(fffffffffbd096e0, 0)
  ffffff01e9fbc9d0 dls_devnet_hold_common+0x57(24, ffffff01e9fbca18, 1)
  ffffff01e9fbc9f0 dls_devnet_hold_tmp+0x1a(24, ffffff01e9fbca18)
  ffffff01e9fbca50 dls_devnet_hold_link+0x2b(24, ffffff01e9fbcad0, ffffff01e9fbcac8)
  ffffff01e9fbca90 stubs_common_code+0x51()
  ffffff01e9fbcb20 mac_link_flow_walk+0x57(24, fffffffff79b7480, ffffff01e9fbcb40)
  ffffff01e9fbcb90 dld_walk_flow+0xca(ffffff444feaa780, 8066a30, ffffff43e60934c0)
  ffffff01e9fbcbd0 drv_ioc_walkflow+0x24(ffffff444feaa780, 8066a30, 100003,
  ffffff43e60934c0, ffffff01e9fbce58)
  ffffff01e9fbcc70 drv_ioctl+0x1e4(1200000000, d1d0019, 8066a30, 100003, ffffff43e60934c0
  , ffffff01e9fbce58)
  ffffff01e9fbccb0 cdev_ioctl+0x39(1200000000, d1d0019, 8066a30, 100003, ffffff43e60934c0
  , ffffff01e9fbce58)
  ffffff01e9fbcd00 spec_ioctl+0x60(ffffff431de03440, d1d0019, 8066a30, 100003,
  ffffff43e60934c0, ffffff01e9fbce58, 0)
  ffffff01e9fbcd90 fop_ioctl+0x55(ffffff431de03440, d1d0019, 8066a30, 100003,
  ffffff43e60934c0, ffffff01e9fbce58, 0)
  ffffff01e9fbceb0 ioctl+0x9b(4, d1d0019, 8066a30)
  ffffff01e9fbcf10 _sys_sysenter_post_swapgs+0x153()


The holder of the dls devnet, that wants the mac perimeter is:

> ffffff43e72b9b80::findstack -v
stack pointer for thread ffffff43e72b9b80: ffffff01f28f6880
[ ffffff01f28f6880 _resume_from_idle+0xf4() ]
  ffffff01f28f68b0 swtch+0x141()
  ffffff01f28f68f0 cv_wait+0x70(ffffff431f436c04, ffffff431f436bf0)
  ffffff01f28f6930 i_mac_perim_enter+0x63(ffffff443362b068)
  ffffff01f28f6960 mac_perim_enter_by_mh+0x23(ffffff443362b068, ffffff01f28f69e8)
  ffffff01f28f69b0 mac_perim_enter_by_macname+0x33(ffffff443558e91c, ffffff01f28f69e8)
  ffffff01f28f6a50 i_dls_devnet_setzid+0x6b(ffffff443558e8f8, 0, 0, 0)
  ffffff01f28f6ab0 dls_devnet_unset+0x24a(ffffff443362b080, ffffff01f28f6b4c, 1)
  ffffff01f28f6b20 dls_devnet_destroy+0x46(ffffff443362b068, ffffff01f28f6b4c, 1)
  ffffff01f28f6b90 vnic_dev_delete+0x96(68, 0, ffffff443aad0ed0)
  ffffff01f28f6bd0 vnic_ioc_delete+0x28(ffffff435ce8faa0, fe364d14, 100003,
  ffffff443aad0ed0, ffffff01f28f6e58)
  ffffff01f28f6c70 drv_ioctl+0x1e4(1200000000, 1710002, fe364d14, 100003, ffffff443aad0ed0
  , ffffff01f28f6e58)
  ffffff01f28f6cb0 cdev_ioctl+0x39(1200000000, 1710002, fe364d14, 100003, ffffff443aad0ed0
  , ffffff01f28f6e58)
  ffffff01f28f6d00 spec_ioctl+0x60(ffffff431de03440, 1710002, fe364d14, 100003,
  ffffff443aad0ed0, ffffff01f28f6e58, 0)
  ffffff01f28f6d90 fop_ioctl+0x55(ffffff431de03440, 1710002, fe364d14, 100003,
  ffffff443aad0ed0, ffffff01f28f6e58, 0)
  ffffff01f28f6eb0 ioctl+0x9b(0, 1710002, fe364d14)
  ffffff01f28f6f10 sys_syscall32+0x109()

Problems like this are littered all over the place. It's really unfortunate, but something we need to go work out.

Comments

Comment by Robert Mustacchi
Created at 2016-03-04T23:48:27.000Z
Updated at 2016-03-04T23:48:55.000Z
We hit another case where everything appears to be stuck making calls to
dlmgmtd. Lots of things are backed up, including flowadm and dladm.
Here's an example thread:

[root@RM08212 (us-east-3) ~]# ptree 94842
44085 zoneadmd -z 73e67324-6c13-4d1a-83a2-85dd3af40c82
  45344 /bin/ksh -p /usr/lib/brand/joyent-minimal/poststate 73e67324-6c13-4d1a-83a2-85d
    45401 /bin/ksh -p /usr/lib/brand/joyent-minimal/statechange post 73e67324-6c13-4d1a-8
      94842 dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a2-85dd3af

If we look at that dladm:

[root@RM08212 (us-east-3) ~]# pstack 94842
94842:  dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a
 feec02d8 door     (7, 80475b8, 0, 0, 0, 3)
 feeace72 door_call (7, 80475b8, 50, fed6d22e) + ed
 fed6d296 dladm_door_call (8105fd8, 8047638, 28, 8047620, 804761c, 8047838) + 76
 fed6dda4 dladm_zname2info (8105fd8, 0, 8047838, 8047858, 0, 0) + 6c
 fed6de45 dladm_name2info (8105fd8, 8047838, 8047858, 0, 0, 0) + 2c
 0805f69f do_create_vnic (b) + 5b5
 08058f2e main     (80478ac, fef38728, 80478e0, 8056ac7, c, 80478ec) + b9
 08056ac7 _start   (c, 8047a38, 8047a3e, 8047a4a, 8047a4d, 8047a50) + 83
[root@RM08212 (us-east-3) ~]# pfiles 94842
94842:  dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a
  Current rlimit: 65536 file descriptors
   0: S_IFIFO mode:0000 dev:550,0 ino:205840741 uid:0 gid:0 rdev:0,0
      O_RDWR
   1: S_IFIFO mode:0000 dev:550,0 ino:205990788 uid:0 gid:0 rdev:0,0
      O_RDWR
   2: S_IFIFO mode:0000 dev:550,0 ino:205990788 uid:0 gid:0 rdev:0,0
      O_RDWR
   3: S_IFCHR mode:0666 dev:552,0 ino:9437188 uid:0 gid:3 rdev:18,0
      O_RDWR|O_LARGEFILE
      /devices/pseudo/dld@0:ctl
      offset:0
   4: S_IFCHR mode:0666 dev:552,0 ino:9437188 uid:0 gid:3 rdev:18,0
      O_RDWR
      /devices/pseudo/dld@0:ctl
      offset:0
   5: S_IFDOOR mode:0644 dev:563,0 ino:63 uid:15 gid:65 rdev:562,0
      O_RDONLY|O_LARGEFILE  door to dlmgmtd[21]
   6: S_IFCHR mode:0666 dev:552,0 ino:19922952 uid:0 gid:3 rdev:38,2
      O_RDWR|O_LARGEFILE
      /devices/pseudo/mm@0:null
      offset:0
   7: S_IFDOOR mode:0644 dev:563,0 ino:63 uid:15 gid:65 rdev:562,0
      O_RDONLY  door to dlmgmtd[21]

So, dlmgmtd is our problem. We already know that it's likely not going to be
something we can use pstack on.

THREAD           STATE    SOBJ                COUNT
ffffff431db7d3e0 STOPPED  <NONE>               1360
                 swtch+0x141
                 stop+0x386
                 issig_forreal+0x3e4
                 issig+0x25
                 door_return+0x3ef
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff42a9e33840 SLEEP    CV                    219
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 lwp_park+0x15e
                 syslwp_park+0x63
                 sys_syscall32+0x14a

ffffff43607d1c40 SLEEP    SHUTTLE               104
                 swtch+0x141
                 shuttle_swtch+0x203
                 door_return+0x214
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff4531383400 STOPPED  <NONE>                  2
                 swtch+0x141
                 stop+0x386
                 issig_forreal+0x3e4
                 issig+0x25
                 cv_wait_sig_swap_core+0x303
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 lwp_park+0x15e
                 syslwp_park+0x63
                 sys_syscall32+0x14a

ffffff46904e8ae0 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait+0x70
                 i_mac_perim_enter+0x63
                 mac_perim_enter_by_mh+0x23
                 mac_perim_enter_by_macname+0x33
                 i_dls_devnet_setzid+0x6b
                 dls_devnet_unset+0x24a
                 dls_devnet_destroy+0x46
                 vnic_dev_delete+0x96
                 vnic_ioc_delete+0x28
                 drv_ioctl+0x1e4
                 cdev_ioctl+0x39
                 spec_ioctl+0x60
                 fop_ioctl+0x55
                 ioctl+0x9b
                 sys_syscall32+0x14a

ffffff616a3bf7e0 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig+0x185
                 lwp_suspend+0xa4
                 syslwp_suspend+0x48
                 sys_syscall32+0x14a

ffffff431de30b60 STOPPED  <NONE>                  1
                 swtch+0x141
                 stop+0x386
                 issig_forreal+0x3e4
                 issig+0x25
                 cv_wait_sig_swap_core+0x303
                 cv_wait_sig_swap+0x17
                 pause+0x45
                 sys_syscall32+0x14a

Well, thread ffffff46904e8ae0 is not in a happy place. It's blocked on
the perimeter, trying to see a zone id. So, let's see whose perimeter it
is and what exactly is going on:

> ffffff46904e8ae0::findstack -v
stack pointer for thread ffffff46904e8ae0: ffffff0207706880
[ ffffff0207706880 _resume_from_idle+0x112() ]
  ffffff02077068b0 swtch+0x141()
  ffffff02077068f0 cv_wait+0x70(ffffff431f2b9c04, ffffff431f2b9bf0)
  ffffff0207706930 i_mac_perim_enter+0x63(ffffff4475ea8bb8)
  ffffff0207706960 mac_perim_enter_by_mh+0x23(ffffff4475ea8bb8, ffffff02077069e8)
  ffffff02077069b0 mac_perim_enter_by_macname+0x33(ffffff455198d78c, ffffff02077069e8)
  ffffff0207706a50 i_dls_devnet_setzid+0x6b(ffffff455198d768, 0, 0, 0)
  ffffff0207706ab0 dls_devnet_unset+0x24a(ffffff4475ea8bd0, ffffff0207706b4c, 1)
  ffffff0207706b20 dls_devnet_destroy+0x46(ffffff4475ea8bb8, ffffff0207706b4c, 1)
  ffffff0207706b90 vnic_dev_delete+0x96(19d34, 0, ffffff651be2fe10)
  ffffff0207706bd0 vnic_ioc_delete+0x28(ffffff447cdb25f8, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706c70 drv_ioctl+0x1e4(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706cb0 cdev_ioctl+0x39(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706d00 spec_ioctl+0x60(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0)
  ffffff0207706d90 fop_ioctl+0x55(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0)
  ffffff0207706eb0 ioctl+0x9b(0, 1710002, d0921d04)
  ffffff0207706f10 sys_syscall32+0x14a()

Because this is a VNIC the real MAC is the lower MAC:

> ffffff4475ea8bb8::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print mac_client_impl_t mci_mip
mci_mip = 0xffffff431f2b9b18
> 0xffffff431f2b9b18::print mac_impl_t mi_perim_owner
mi_perim_owner = 0xffffff4497847820
> 0xffffff431f2b9b18::print mac_impl_t mi_perim_owner | ::findstack -v
stack pointer for thread ffffff4497847820: ffffff020457e950
[ ffffff020457e950 _resume_from_idle+0x112() ]
  ffffff020457e980 swtch+0x141()
  ffffff020457ea20 turnstile_block+0x21a(ffffff46902ee1f0, 0, fffffffffbd09740, fffffffffbc08ce0, 0, 0)
  ffffff020457ea90 rw_enter_sleep+0x19b(fffffffffbd09740, 0)
  ffffff020457eb60 dls_devnet_rename+0xc0(19daf, 0, ffffff4dfb659158, 1)
  ffffff020457ebd0 drv_ioc_rename+0xbb(ffffff4dfb659150, 80477d4, 100003, ffffff811a484648, ffffff020457ee58)
  ffffff020457ec70 drv_ioctl+0x1e4(1200000000, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58)
  ffffff020457ecb0 cdev_ioctl+0x39(1200000000, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58)
  ffffff020457ed00 spec_ioctl+0x60(ffffff431dcfa680, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58, 0)
  ffffff020457ed90 fop_ioctl+0x55(ffffff431dcfa680, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58, 0)
  ffffff020457eeb0 ioctl+0x9b(4, d1d0011, 80477d4)
  ffffff020457ef10 _sys_sysenter_post_swapgs+0x153()

So we're trying to rename a device. It's a sleep. Why is it asleep on the rw
lock, who owns it?

> fffffffffbd09740::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffffffbd09740 ffffff46904e8ae0  B111 ffffff4561bb8040 (W)
                                    ||| ffffff479544e7a0 (W)
                 WRITE_LOCKED ------+|| ffffff44862964c0 (W)
                 WRITE_WANTED -------+| ffffff4497847820 (W)
                  HAS_WAITERS --------+ ffffff447c087740 (W)
                                        ffffff4484d2f0a0 (W)
> ffffff46904e8ae0::findstack -v
stack pointer for thread ffffff46904e8ae0: ffffff0207706880
[ ffffff0207706880 _resume_from_idle+0x112() ]
  ffffff02077068b0 swtch+0x141()
  ffffff02077068f0 cv_wait+0x70(ffffff431f2b9c04, ffffff431f2b9bf0)
  ffffff0207706930 i_mac_perim_enter+0x63(ffffff4475ea8bb8)
  ffffff0207706960 mac_perim_enter_by_mh+0x23(ffffff4475ea8bb8, ffffff02077069e8)
  ffffff02077069b0 mac_perim_enter_by_macname+0x33(ffffff455198d78c, ffffff02077069e8)
  ffffff0207706a50 i_dls_devnet_setzid+0x6b(ffffff455198d768, 0, 0, 0)
  ffffff0207706ab0 dls_devnet_unset+0x24a(ffffff4475ea8bd0, ffffff0207706b4c, 1)
  ffffff0207706b20 dls_devnet_destroy+0x46(ffffff4475ea8bb8, ffffff0207706b4c, 1)
  ffffff0207706b90 vnic_dev_delete+0x96(19d34, 0, ffffff651be2fe10)
  ffffff0207706bd0 vnic_ioc_delete+0x28(ffffff447cdb25f8, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706c70 drv_ioctl+0x1e4(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706cb0 cdev_ioctl+0x39(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58)
  ffffff0207706d00 spec_ioctl+0x60(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0)
  ffffff0207706d90 fop_ioctl+0x55(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0)
  ffffff0207706eb0 ioctl+0x9b(0, 1710002, d0921d04)
  ffffff0207706f10 sys_syscall32+0x14a()

So what we hae here is a problem. We have one thread that owns the
perimeter and is trying to go through and grab the rw lock as a writer.
On the other hand we have a thread that has the rw lock as a writer and
is trying to grab the mac perimeter. That can never end well.


Comment by Robert Mustacchi
Created at 2016-04-25T21:58:49.000Z
We found another machine that appears to be in this state. So here's
it's dlmgmtd:

> ::pgrep dlmgmtd | ::walk thread | ::stacks
THREAD           STATE    SOBJ                COUNT
ffffff431dc72000 SLEEP    SHUTTLE              2412
                 swtch+0x141
                 shuttle_swtch+0x203
                 door_return+0x214
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff43576e94e0 SLEEP    SHUTTLE                35
                 swtch_to+0xb6
                 shuttle_resume+0x2af
                 door_return+0x3e0
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff44a67c9800 SLEEP    CV                      9
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 nanosleep+0x19f
                 sys_syscall32+0x14a

ffffff4597fea0c0 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait+0x70
                 i_mac_perim_enter+0x63
                 mac_perim_enter_by_mh+0x23
                 mac_perim_enter_by_macname+0x33
                 i_dls_devnet_setzid+0x6b
                 dls_devnet_unset+0x24a
                 dls_devnet_destroy+0x46
                 vnic_dev_delete+0x96
                 vnic_ioc_delete+0x28
                 drv_ioctl+0x1e4
                 cdev_ioctl+0x39
                 spec_ioctl+0x60
                 fop_ioctl+0x55
                 ioctl+0x9b
                 sys_syscall32+0x14a

ffffff431de58b60 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 pause+0x45
                 sys_syscall32+0x14a

> ::walk thread | ::stacks
THREAD   STATE    SOBJ        COUNT
2        UNPARKED <NONE>       2440
         dlmgmt_handler+0xf4
         libc.so.1`__door_return+0x3d

cd       UNPARKED <NONE>          9
         libc.so.1`sleep+0x2d
         dlmgmt_zonehalt+0x6f
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

98d      UNPARKED <NONE>          5
         libc.so.1`door_create_func+0x29
         libc.so.1`_thrp_setup+0x8c
         libc.so.1`_lwp_start

5b1      UNPARKED <NONE>          2
         dlmgmt_handler+0x10b
         libc.so.1`__door_return+0x3d

525      UNPARKED <NONE>          1

1        UNPARKED <NONE>          1
         main+0xe2
         _start+0x83

Obviously seeing ourselves blocked on the mac perimeter is not a good
sign. So let's verify what's going on there.

> ffffff4597fea0c0::findstack -v
stack pointer for thread ffffff4597fea0c0: ffffff0205cf9880
[ ffffff0205cf9880 _resume_from_idle+0x112() ]
  ffffff0205cf98b0 swtch+0x141()
  ffffff0205cf98f0 cv_wait+0x70(ffffff431f28cc04, ffffff431f28cbf0)
  ffffff0205cf9930 i_mac_perim_enter+0x63(ffffff89b8df26d0)
  ffffff0205cf9960 mac_perim_enter_by_mh+0x23(ffffff89b8df26d0, ffffff0205cf99e8)
  ffffff0205cf99b0 mac_perim_enter_by_macname+0x33(ffffff449a8aa034, ffffff0205cf99e8)
  ffffff0205cf9a50 i_dls_devnet_setzid+0x6b(ffffff449a8aa010, 0, 0, 0)
  ffffff0205cf9ab0 dls_devnet_unset+0x24a(ffffff89b8df26e8, ffffff0205cf9b4c, 1)
  ffffff0205cf9b20 dls_devnet_destroy+0x46(ffffff89b8df26d0, ffffff0205cf9b4c, 1)
  ffffff0205cf9b90 vnic_dev_delete+0x96(29257, 0, ffffff72e441caf8)
  ffffff0205cf9bd0 vnic_ioc_delete+0x28(ffffff465732d968, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9c70 drv_ioctl+0x1e4(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9cb0 cdev_ioctl+0x39(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9d00 spec_ioctl+0x60(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0)
  ffffff0205cf9d90 fop_ioctl+0x55(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0)
  ffffff0205cf9eb0 ioctl+0x9b(0, 1710002, ac841d04)
  ffffff0205cf9f10 sys_syscall32+0x14a()
> ffffff89b8df26d0::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print mac_client_impl_t mci_mip->mi_perim_owner | ::findstack 
-v
stack pointer for thread ffffff457b4810e0: ffffff01ea06e870
[ ffffff01ea06e870 _resume_from_idle+0x112() ]
  ffffff01ea06e8a0 swtch+0x141()
  ffffff01ea06e940 turnstile_block+0x21a(ffffffb47aa4c1d8, 0, fffffffffbd09740, fffffffffbc08ce0, 0, 0)
  ffffff01ea06e9b0 rw_enter_sleep+0x19b(fffffffffbd09740, 0)
  ffffff01ea06ea20 dls_devnet_hold_common+0x57(29274, ffffff01ea06ea68, 0)
  ffffff01ea06ea40 dls_devnet_hold+0x17(29274, ffffff01ea06ea68)
  ffffff01ea06eac0 dls_devnet_setzid+0x8e(ffffff44ee427d10, 22ab, 1)
  ffffff01ea06eb90 drv_ioc_prop_common+0x3f2(ffffff44c8e85bc0, 8113e08, 1, ffffffdb9c922190, 100003)
  ffffff01ea06ebd0 drv_ioc_setprop+0x29(ffffff44c8e85bc0, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58)
  ffffff01ea06ec70 drv_ioctl+0x1e4(1200000000, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58)
  ffffff01ea06ecb0 cdev_ioctl+0x39(1200000000, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58)
  ffffff01ea06ed00 spec_ioctl+0x60(ffffff431ddedd80, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58, 0)
  ffffff01ea06ed90 fop_ioctl+0x55(ffffff431ddedd80, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58, 0)
  ffffff01ea06eeb0 ioctl+0x9b(4, d1d001b, 8113e08)
  ffffff01ea06ef10 _sys_sysenter_post_swapgs+0x153()

Okay, so who holds that lock?

> fffffffffbd09740::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffffffbd09740 ffffff4597fea0c0  B111 ffffff44a6915c20 (W)
                                    ||| ffffff81eabd4c00 (W)
                 WRITE_LOCKED ------+|| ffffff47a1dffc00 (W)
                 WRITE_WANTED -------+| ffffff45498e4c60 (W)
                  HAS_WAITERS --------+ ffffff474d2e3860 (W)
                                        ffffff45ed5b18a0 (W)
                                        ffffff45146e70c0 (W)
                                        ffffff45cb871740 (W)
                                        ffffff457b46e740 (W)
                                        ffffff44e9b12ae0 (W)
                                        ffffff44c4522080 (W)
                                        ffffff44b003a060 (W)
                                        ffffff598e28b0c0 (W)
                                        ffffff4597feeb20 (W)
                                        ffffff467d415b60 (W)
                                        ffffff45db6ebae0 (W)
                                        ffffff4379edf740 (W)
                                        ffffff45ed8a2480 (W)
                                        ffffff44ee7e3c20 (W)
...
> ffffff4597fea0c0::findstack -v      
stack pointer for thread ffffff4597fea0c0: ffffff0205cf9880
[ ffffff0205cf9880 _resume_from_idle+0x112() ]
  ffffff0205cf98b0 swtch+0x141()
  ffffff0205cf98f0 cv_wait+0x70(ffffff431f28cc04, ffffff431f28cbf0)
  ffffff0205cf9930 i_mac_perim_enter+0x63(ffffff89b8df26d0)
  ffffff0205cf9960 mac_perim_enter_by_mh+0x23(ffffff89b8df26d0, ffffff0205cf99e8)
  ffffff0205cf99b0 mac_perim_enter_by_macname+0x33(ffffff449a8aa034, ffffff0205cf99e8)
  ffffff0205cf9a50 i_dls_devnet_setzid+0x6b(ffffff449a8aa010, 0, 0, 0)
  ffffff0205cf9ab0 dls_devnet_unset+0x24a(ffffff89b8df26e8, ffffff0205cf9b4c, 1)
  ffffff0205cf9b20 dls_devnet_destroy+0x46(ffffff89b8df26d0, ffffff0205cf9b4c, 1)
  ffffff0205cf9b90 vnic_dev_delete+0x96(29257, 0, ffffff72e441caf8)
  ffffff0205cf9bd0 vnic_ioc_delete+0x28(ffffff465732d968, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9c70 drv_ioctl+0x1e4(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9cb0 cdev_ioctl+0x39(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58)
  ffffff0205cf9d00 spec_ioctl+0x60(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0)
  ffffff0205cf9d90 fop_ioctl+0x55(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0)
  ffffff0205cf9eb0 ioctl+0x9b(0, 1710002, ac841d04)
  ffffff0205cf9f10 sys_syscall32+0x14a()

Wow, there are a lot of waiters on that rwlock. So there we have it. One
thread has the mac perimeter and wants a dls lock. The other has a dls
lock and wants the mac perimeter. Well, that doesn't end well.


Comment by Ryan Zezeski
Created at 2016-11-08T19:52:44.000Z
Here's a dtrace script to monitor the relative acquisition order of the MAC perim and DLS devnet lock. For the vm start/stop (devnet_rename/devnet_destroy) deadlock scenario the window is super tight because the dls_devnet_rename() function immediately grabs the DLS devnet lock after entering the MAC perim -- things have to be scheduled just perfectly. Even using dtrace chill() I wasn't able to hit it. However, using this script you can clearly see the order of events and see when one operation is entering the MAC perim and then grabbing the DLS devnet lock and when another operation is performing these steps in the opposite order: in which case a deadlock could happen.

#pragma D option quiet

#define MIS_IS_VNIC 0x0002
#define IS_VNIC(mip) (mip->mi_state_flags & MIS_IS_VNIC)
#define mac_lower(upper) ((mac_client_impl_t *)((vnic_t *)upper->mi_driver)->vn_mc_handles[0])->mci_mip
#define MAC_PERIM_CNT(mip) (IS_VNIC(mip) ? mac_lower(mip)->mi_perim_ocnt : mip->mi_perim_ocnt)
#define IS_MAC_PERIM_CNT(mip, cnt) (MAC_PERIM_CNT(mip) == cnt)

#define IS_DEVNET_LOCK(l) ((uintptr_t)l == (uintptr_t)&`i_dls_devnet_lock)

BEGIN {
	self->dls = 0;
	self->mac = 0;
}

lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0)/ {
	self->dls = 1;
}

lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0) && self->mac/ {
	printf("MAC -> DLS %s (%d)\n", execname, pid);
	stack();
}

lockstat:::rw-release /IS_DEVNET_LOCK(arg0) && self->dls/ {
	self->dls = 0;
}

fbt::i_mac_perim_enter:entry {
	self->mac = 1;
}

/*
 * If MAC perim count is 0 then this is the first entry into the
 * perim.
 */
fbt::i_mac_perim_enter:entry /self->dls && IS_MAC_PERIM_CNT(args[0], 0)/ {
	printf("DLS -> MAC %s (%d)\n", execname, pid);
	stack();
}

/*
 * Last ref to the perim is about to be let go.
 */
fbt::i_mac_perim_exit:entry /IS_MAC_PERIM_CNT(args[0], 1)/ {
	self->mac = 0;
}

Comment by Ryan Zezeski
Created at 2016-11-08T19:58:54.000Z
Using the lock ordering script you can see that vmadm start uses dls_devnet_rename() as part of boot, and this function enters the MAC and then grabs the DLS devnet lock (MAC -> DLS):

MAC -> DLS dladm (125046)

              dls`dls_devnet_set+0x6b
              dls`dls_devnet_create+0x4c
              0xfffffffff81de5ca
              0xfffffffff81dd8cd
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x241
MAC -> DLS dladm (125046)

              dls`dls_devnet_hold_common+0x53
              dls`dls_devnet_hold+0x17
              dls`dls_devnet_setzid+0x8e
              dld`drv_ioc_prop_common+0x44c
              dld`drv_ioc_setprop+0x29
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x241
MAC -> DLS dladm (125047)

              dls`dls_devnet_rename+0xc0
              dld`drv_ioc_rename+0xbb
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x241

And in the case of the vmadm stop we see dls_devnet_destroy() does DLS -> MAC:

MAC -> DLS flowadm (124904)

              dls`dls_devnet_hold_common+0x53
              dls`dls_devnet_hold_tmp+0x1a
              dls`dls_devnet_hold_link+0x2f
              unix`stubs_common_code+0x59
              mac`mac_link_flow_walk+0x57
              dld`dld_walk_flow+0xca
              dld`drv_ioc_walkflow+0x24
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x241
DLS -> MAC dlmgmtd (100016)

              mac`mac_perim_enter_by_mh+0x23
              mac`mac_perim_enter_by_macname+0x33
              dls`i_dls_devnet_setzid+0x6b
              dls`dls_devnet_unset+0x2b2
              dls`dls_devnet_destroy+0x46
              vnic`vnic_dev_delete+0x96
              vnic`vnic_ioc_delete+0x28
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`sys_syscall32+0x227
DLS -> MAC dlmgmtd (100016)

              mac`mac_close+0x18
              mac`mac_perim_exit+0x33
              dls`i_dls_devnet_setzid+0xfc
              dls`dls_devnet_unset+0x2b2
              dls`dls_devnet_destroy+0x46
              vnic`vnic_dev_delete+0x96
              vnic`vnic_ioc_delete+0x28
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`sys_syscall32+0x227

Thus, if a stop and start operation run concurrently, and are scheduled just right, we will deadlock.

Comment by Ryan Zezeski
Created at 2016-11-09T02:00:00.000Z
Updated at 2016-12-19T23:48:30.000Z
Robert's initial hunch was to make sure all consumers grab the locks in the order DLS -> MAC. After several days of review I have come to the opposite conclusion. It is my finding that all operations perform MAC -> DLS, except for dls_devnet_destroy() which is our problem child and the reason for this ticket. If you try to change the order to DLS -> MAC you will receive a recursive RW lock panic (when grabbing it for write, RW_READER_STARVEWRITER will not save you here). Furthermore, there is a comment in DLS that indicates the lock order should always be MAC -> DLS, contrary to some comments in mac.c. Following are various reasons dls_devnet_destroy() should use MAC -> DLS ordering like all the other functions.

First, if you look at the dtrace output in the previous comment it's clear to see all operations besides dls_devnet_destroy are using MAC -> DLS order.

Second, it's not easy to change the order to DLS -> MAC everywhere. E.g., the dls_devnet_rename function must do MAC -> DLS or else it will cause a recursive rw panic during its call to mac_perim_enter_by_linkid() because this leads to dls_devnet_hold_common() which grabs the i_dls_devnet_lock. It's obvious, from the code, that this will happen but I also built an image with the order rearranged in dls_devnet_rename and sure enough it induced a rw recursive enter panic.

Third, and this brings me to the more general reason why I think it should always be MAC -> DLS: the MAC perim is built in such a way that it can be entered recursively: it's just a mutex guarding an owner + counter field. Whereas the i_dls_devnet_lock is a vanilla lock that can only be entered oncce. Because MAC and DLS are so intertwined there are places where MAC is a client to DLS and grabs the i_dls_devnet_lock: if you grabbed this lock before entering the MAC perim then you better make damn sure you never call a MAC function that needs to use DLS -- which is hard to do given how intertwined this code is. So it makes more sense to just grab the MAC perim as early as possible (e.g. where the ioctl first hits DLS/MAC) and hold the DLS lock for specific calls inside of that.

This brings me to my fourth point that helps shed some light on why DLS/MAC seem to blend into each other so much, this comment:

mac_flow.c:1258

	/*
	 * dls will eventually be merged with mac so it's ok
	 * to call dls' internal functions.
	 */
	err = dls_devnet_hold_link(linkid, &dlh, &dlp);

My fifth point is another code comment, from dls_mgmt.c, that supports the claim that the order should always be MAC -> DLS. This comment is saying DLS is part of the "GLD framework" and that framework locks (e.g., i_dls_devnet_lock) are never held across calls to a MAC layer function that needs to aquire the MAC perim (e.g., dls_devnet_rename()).

dls_mgmt.c:47

/*
 * This vanity name management module is treated as part of the GLD framework
 * and we don't hold any GLD framework lock across a call to any mac
 * function that needs to acquire the mac perimeter. The hierarchy is
 * mac perimeter -> framework locks
 */

It's my finding that the fix for this particular deadlock is to enter the perimeter earlier in the dls_devnet_destroy() function, before it makes any calls into DLS. I built an image doing just that and my dtrace script above confirms the proper order:

MAC -> DLS dlmgmtd (100015)

              dls`dls_devnet_unset+0x36
              dls`dls_devnet_destroy+0x52
              vnic`vnic_dev_delete+0x96
              vnic`vnic_ioc_delete+0x28
              dld`drv_ioctl+0x1e4
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`sys_syscall32+0x227


Comment by Bot Bot [X]
Created at 2017-01-30T17:20:52.000Z

illumos-joyent commit aefd5a8 (branch master, by Ryan Zezeski)

OS-3506 dls and mac lock ordering isn't honored
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>


Comment by Bot Bot [X]
Created at 2017-05-19T19:53:27.000Z

illumos-joyent commit 87b3eba (branch master, by Ryan Zezeski)

OS-6056 dls_devnet_unset() and DLS temp holds can deadlock
    OS-5989 OS-3506 introduced deadlock in dls_devnet_prop_task
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>