Issue Type: | Bug |
---|---|
Priority: | 2 - Critical |
Status: | Resolved |
Created at: | 2014-10-31T01:08:11.000Z |
Updated at: | 2017-05-19T19:53:27.000Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-01-30T20:38:36.000Z)
2017-02-02 ICELAND ESCALATION (Release Date: 2017-02-02)
mac.c defines the lock ordering between mac and dls as that one should grab the i_dls_devnet_lock before the mac perimeter. However, this also means that one needs to grab a dls hold before grabbing the perimeter. This is important because we definitely don't do this in a lot of places. Consider things like:
owner of the master mac perimeter: > ffffff443362b068::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print mac_client_impl_t mci_mip->mi_perim_owner | ::findstack -v stack pointer for thread ffffff4360d34520: ffffff01e9fbc810 [ ffffff01e9fbc810 _resume_from_idle+0xf4() ] ffffff01e9fbc840 swtch+0x141() ffffff01e9fbc8f0 turnstile_block+0x262(ffffff437c9e4688, 0, fffffffffbd096e0, fffffffffbc08cc0, 0, 0) ffffff01e9fbc960 rw_enter_sleep+0x19b(fffffffffbd096e0, 0) ffffff01e9fbc9d0 dls_devnet_hold_common+0x57(24, ffffff01e9fbca18, 1) ffffff01e9fbc9f0 dls_devnet_hold_tmp+0x1a(24, ffffff01e9fbca18) ffffff01e9fbca50 dls_devnet_hold_link+0x2b(24, ffffff01e9fbcad0, ffffff01e9fbcac8) ffffff01e9fbca90 stubs_common_code+0x51() ffffff01e9fbcb20 mac_link_flow_walk+0x57(24, fffffffff79b7480, ffffff01e9fbcb40) ffffff01e9fbcb90 dld_walk_flow+0xca(ffffff444feaa780, 8066a30, ffffff43e60934c0) ffffff01e9fbcbd0 drv_ioc_walkflow+0x24(ffffff444feaa780, 8066a30, 100003, ffffff43e60934c0, ffffff01e9fbce58) ffffff01e9fbcc70 drv_ioctl+0x1e4(1200000000, d1d0019, 8066a30, 100003, ffffff43e60934c0 , ffffff01e9fbce58) ffffff01e9fbccb0 cdev_ioctl+0x39(1200000000, d1d0019, 8066a30, 100003, ffffff43e60934c0 , ffffff01e9fbce58) ffffff01e9fbcd00 spec_ioctl+0x60(ffffff431de03440, d1d0019, 8066a30, 100003, ffffff43e60934c0, ffffff01e9fbce58, 0) ffffff01e9fbcd90 fop_ioctl+0x55(ffffff431de03440, d1d0019, 8066a30, 100003, ffffff43e60934c0, ffffff01e9fbce58, 0) ffffff01e9fbceb0 ioctl+0x9b(4, d1d0019, 8066a30) ffffff01e9fbcf10 _sys_sysenter_post_swapgs+0x153() The holder of the dls devnet, that wants the mac perimeter is: > ffffff43e72b9b80::findstack -v stack pointer for thread ffffff43e72b9b80: ffffff01f28f6880 [ ffffff01f28f6880 _resume_from_idle+0xf4() ] ffffff01f28f68b0 swtch+0x141() ffffff01f28f68f0 cv_wait+0x70(ffffff431f436c04, ffffff431f436bf0) ffffff01f28f6930 i_mac_perim_enter+0x63(ffffff443362b068) ffffff01f28f6960 mac_perim_enter_by_mh+0x23(ffffff443362b068, ffffff01f28f69e8) ffffff01f28f69b0 mac_perim_enter_by_macname+0x33(ffffff443558e91c, ffffff01f28f69e8) ffffff01f28f6a50 i_dls_devnet_setzid+0x6b(ffffff443558e8f8, 0, 0, 0) ffffff01f28f6ab0 dls_devnet_unset+0x24a(ffffff443362b080, ffffff01f28f6b4c, 1) ffffff01f28f6b20 dls_devnet_destroy+0x46(ffffff443362b068, ffffff01f28f6b4c, 1) ffffff01f28f6b90 vnic_dev_delete+0x96(68, 0, ffffff443aad0ed0) ffffff01f28f6bd0 vnic_ioc_delete+0x28(ffffff435ce8faa0, fe364d14, 100003, ffffff443aad0ed0, ffffff01f28f6e58) ffffff01f28f6c70 drv_ioctl+0x1e4(1200000000, 1710002, fe364d14, 100003, ffffff443aad0ed0 , ffffff01f28f6e58) ffffff01f28f6cb0 cdev_ioctl+0x39(1200000000, 1710002, fe364d14, 100003, ffffff443aad0ed0 , ffffff01f28f6e58) ffffff01f28f6d00 spec_ioctl+0x60(ffffff431de03440, 1710002, fe364d14, 100003, ffffff443aad0ed0, ffffff01f28f6e58, 0) ffffff01f28f6d90 fop_ioctl+0x55(ffffff431de03440, 1710002, fe364d14, 100003, ffffff443aad0ed0, ffffff01f28f6e58, 0) ffffff01f28f6eb0 ioctl+0x9b(0, 1710002, fe364d14) ffffff01f28f6f10 sys_syscall32+0x109()
Problems like this are littered all over the place. It's really unfortunate, but something we need to go work out.
We hit another case where everything appears to be stuck making calls to
dlmgmtd. Lots of things are backed up, including flowadm and dladm.
Here's an example thread:
[root@RM08212 (us-east-3) ~]# ptree 94842 44085 zoneadmd -z 73e67324-6c13-4d1a-83a2-85dd3af40c82 45344 /bin/ksh -p /usr/lib/brand/joyent-minimal/poststate 73e67324-6c13-4d1a-83a2-85d 45401 /bin/ksh -p /usr/lib/brand/joyent-minimal/statechange post 73e67324-6c13-4d1a-8 94842 dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a2-85dd3af
If we look at that dladm:
[root@RM08212 (us-east-3) ~]# pstack 94842 94842: dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a feec02d8 door (7, 80475b8, 0, 0, 0, 3) feeace72 door_call (7, 80475b8, 50, fed6d22e) + ed fed6d296 dladm_door_call (8105fd8, 8047638, 28, 8047620, 804761c, 8047838) + 76 fed6dda4 dladm_zname2info (8105fd8, 0, 8047838, 8047858, 0, 0) + 6c fed6de45 dladm_name2info (8105fd8, 8047838, 8047858, 0, 0, 0) + 2c 0805f69f do_create_vnic (b) + 5b5 08058f2e main (80478ac, fef38728, 80478e0, 8056ac7, c, 80478ec) + b9 08056ac7 _start (c, 8047a38, 8047a3e, 8047a4a, 8047a4d, 8047a50) + 83 [root@RM08212 (us-east-3) ~]# pfiles 94842 94842: dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=73e67324-6c13-4d1a-83a Current rlimit: 65536 file descriptors 0: S_IFIFO mode:0000 dev:550,0 ino:205840741 uid:0 gid:0 rdev:0,0 O_RDWR 1: S_IFIFO mode:0000 dev:550,0 ino:205990788 uid:0 gid:0 rdev:0,0 O_RDWR 2: S_IFIFO mode:0000 dev:550,0 ino:205990788 uid:0 gid:0 rdev:0,0 O_RDWR 3: S_IFCHR mode:0666 dev:552,0 ino:9437188 uid:0 gid:3 rdev:18,0 O_RDWR|O_LARGEFILE /devices/pseudo/dld@0:ctl offset:0 4: S_IFCHR mode:0666 dev:552,0 ino:9437188 uid:0 gid:3 rdev:18,0 O_RDWR /devices/pseudo/dld@0:ctl offset:0 5: S_IFDOOR mode:0644 dev:563,0 ino:63 uid:15 gid:65 rdev:562,0 O_RDONLY|O_LARGEFILE door to dlmgmtd[21] 6: S_IFCHR mode:0666 dev:552,0 ino:19922952 uid:0 gid:3 rdev:38,2 O_RDWR|O_LARGEFILE /devices/pseudo/mm@0:null offset:0 7: S_IFDOOR mode:0644 dev:563,0 ino:63 uid:15 gid:65 rdev:562,0 O_RDONLY door to dlmgmtd[21]
So, dlmgmtd is our problem. We already know that it's likely not going to be
something we can use pstack on.
THREAD STATE SOBJ COUNT ffffff431db7d3e0 STOPPED <NONE> 1360 swtch+0x141 stop+0x386 issig_forreal+0x3e4 issig+0x25 door_return+0x3ef doorfs32+0x180 sys_syscall32+0x14a ffffff42a9e33840 SLEEP CV 219 swtch+0x141 cv_wait_sig_swap_core+0x1b9 cv_wait_sig_swap+0x17 cv_waituntil_sig+0xbd lwp_park+0x15e syslwp_park+0x63 sys_syscall32+0x14a ffffff43607d1c40 SLEEP SHUTTLE 104 swtch+0x141 shuttle_swtch+0x203 door_return+0x214 doorfs32+0x180 sys_syscall32+0x14a ffffff4531383400 STOPPED <NONE> 2 swtch+0x141 stop+0x386 issig_forreal+0x3e4 issig+0x25 cv_wait_sig_swap_core+0x303 cv_wait_sig_swap+0x17 cv_waituntil_sig+0xbd lwp_park+0x15e syslwp_park+0x63 sys_syscall32+0x14a ffffff46904e8ae0 SLEEP CV 1 swtch+0x141 cv_wait+0x70 i_mac_perim_enter+0x63 mac_perim_enter_by_mh+0x23 mac_perim_enter_by_macname+0x33 i_dls_devnet_setzid+0x6b dls_devnet_unset+0x24a dls_devnet_destroy+0x46 vnic_dev_delete+0x96 vnic_ioc_delete+0x28 drv_ioctl+0x1e4 cdev_ioctl+0x39 spec_ioctl+0x60 fop_ioctl+0x55 ioctl+0x9b sys_syscall32+0x14a ffffff616a3bf7e0 SLEEP CV 1 swtch+0x141 cv_wait_sig+0x185 lwp_suspend+0xa4 syslwp_suspend+0x48 sys_syscall32+0x14a ffffff431de30b60 STOPPED <NONE> 1 swtch+0x141 stop+0x386 issig_forreal+0x3e4 issig+0x25 cv_wait_sig_swap_core+0x303 cv_wait_sig_swap+0x17 pause+0x45 sys_syscall32+0x14a
Well, thread ffffff46904e8ae0 is not in a happy place. It's blocked on
the perimeter, trying to see a zone id. So, let's see whose perimeter it
is and what exactly is going on:
> ffffff46904e8ae0::findstack -v stack pointer for thread ffffff46904e8ae0: ffffff0207706880 [ ffffff0207706880 _resume_from_idle+0x112() ] ffffff02077068b0 swtch+0x141() ffffff02077068f0 cv_wait+0x70(ffffff431f2b9c04, ffffff431f2b9bf0) ffffff0207706930 i_mac_perim_enter+0x63(ffffff4475ea8bb8) ffffff0207706960 mac_perim_enter_by_mh+0x23(ffffff4475ea8bb8, ffffff02077069e8) ffffff02077069b0 mac_perim_enter_by_macname+0x33(ffffff455198d78c, ffffff02077069e8) ffffff0207706a50 i_dls_devnet_setzid+0x6b(ffffff455198d768, 0, 0, 0) ffffff0207706ab0 dls_devnet_unset+0x24a(ffffff4475ea8bd0, ffffff0207706b4c, 1) ffffff0207706b20 dls_devnet_destroy+0x46(ffffff4475ea8bb8, ffffff0207706b4c, 1) ffffff0207706b90 vnic_dev_delete+0x96(19d34, 0, ffffff651be2fe10) ffffff0207706bd0 vnic_ioc_delete+0x28(ffffff447cdb25f8, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706c70 drv_ioctl+0x1e4(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706cb0 cdev_ioctl+0x39(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706d00 spec_ioctl+0x60(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0) ffffff0207706d90 fop_ioctl+0x55(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0) ffffff0207706eb0 ioctl+0x9b(0, 1710002, d0921d04) ffffff0207706f10 sys_syscall32+0x14a()
Because this is a VNIC the real MAC is the lower MAC:
> ffffff4475ea8bb8::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print mac_client_impl_t mci_mip mci_mip = 0xffffff431f2b9b18 > 0xffffff431f2b9b18::print mac_impl_t mi_perim_owner mi_perim_owner = 0xffffff4497847820 > 0xffffff431f2b9b18::print mac_impl_t mi_perim_owner | ::findstack -v stack pointer for thread ffffff4497847820: ffffff020457e950 [ ffffff020457e950 _resume_from_idle+0x112() ] ffffff020457e980 swtch+0x141() ffffff020457ea20 turnstile_block+0x21a(ffffff46902ee1f0, 0, fffffffffbd09740, fffffffffbc08ce0, 0, 0) ffffff020457ea90 rw_enter_sleep+0x19b(fffffffffbd09740, 0) ffffff020457eb60 dls_devnet_rename+0xc0(19daf, 0, ffffff4dfb659158, 1) ffffff020457ebd0 drv_ioc_rename+0xbb(ffffff4dfb659150, 80477d4, 100003, ffffff811a484648, ffffff020457ee58) ffffff020457ec70 drv_ioctl+0x1e4(1200000000, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58) ffffff020457ecb0 cdev_ioctl+0x39(1200000000, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58) ffffff020457ed00 spec_ioctl+0x60(ffffff431dcfa680, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58, 0) ffffff020457ed90 fop_ioctl+0x55(ffffff431dcfa680, d1d0011, 80477d4, 100003, ffffff811a484648, ffffff020457ee58, 0) ffffff020457eeb0 ioctl+0x9b(4, d1d0011, 80477d4) ffffff020457ef10 _sys_sysenter_post_swapgs+0x153()
So we're trying to rename a device. It's a sleep. Why is it asleep on the rw
lock, who owns it?
> fffffffffbd09740::rwlock ADDR OWNER/COUNT FLAGS WAITERS fffffffffbd09740 ffffff46904e8ae0 B111 ffffff4561bb8040 (W) ||| ffffff479544e7a0 (W) WRITE_LOCKED ------+|| ffffff44862964c0 (W) WRITE_WANTED -------+| ffffff4497847820 (W) HAS_WAITERS --------+ ffffff447c087740 (W) ffffff4484d2f0a0 (W) > ffffff46904e8ae0::findstack -v stack pointer for thread ffffff46904e8ae0: ffffff0207706880 [ ffffff0207706880 _resume_from_idle+0x112() ] ffffff02077068b0 swtch+0x141() ffffff02077068f0 cv_wait+0x70(ffffff431f2b9c04, ffffff431f2b9bf0) ffffff0207706930 i_mac_perim_enter+0x63(ffffff4475ea8bb8) ffffff0207706960 mac_perim_enter_by_mh+0x23(ffffff4475ea8bb8, ffffff02077069e8) ffffff02077069b0 mac_perim_enter_by_macname+0x33(ffffff455198d78c, ffffff02077069e8) ffffff0207706a50 i_dls_devnet_setzid+0x6b(ffffff455198d768, 0, 0, 0) ffffff0207706ab0 dls_devnet_unset+0x24a(ffffff4475ea8bd0, ffffff0207706b4c, 1) ffffff0207706b20 dls_devnet_destroy+0x46(ffffff4475ea8bb8, ffffff0207706b4c, 1) ffffff0207706b90 vnic_dev_delete+0x96(19d34, 0, ffffff651be2fe10) ffffff0207706bd0 vnic_ioc_delete+0x28(ffffff447cdb25f8, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706c70 drv_ioctl+0x1e4(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706cb0 cdev_ioctl+0x39(1200000000, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58) ffffff0207706d00 spec_ioctl+0x60(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0) ffffff0207706d90 fop_ioctl+0x55(ffffff431dcfa680, 1710002, d0921d04, 100003, ffffff651be2fe10, ffffff0207706e58, 0) ffffff0207706eb0 ioctl+0x9b(0, 1710002, d0921d04) ffffff0207706f10 sys_syscall32+0x14a()
So what we hae here is a problem. We have one thread that owns the
perimeter and is trying to go through and grab the rw lock as a writer.
On the other hand we have a thread that has the rw lock as a writer and
is trying to grab the mac perimeter. That can never end well.
We found another machine that appears to be in this state. So here's
it's dlmgmtd:
> ::pgrep dlmgmtd | ::walk thread | ::stacks THREAD STATE SOBJ COUNT ffffff431dc72000 SLEEP SHUTTLE 2412 swtch+0x141 shuttle_swtch+0x203 door_return+0x214 doorfs32+0x180 sys_syscall32+0x14a ffffff43576e94e0 SLEEP SHUTTLE 35 swtch_to+0xb6 shuttle_resume+0x2af door_return+0x3e0 doorfs32+0x180 sys_syscall32+0x14a ffffff44a67c9800 SLEEP CV 9 swtch+0x141 cv_timedwait_sig_hires+0x39d cv_waituntil_sig+0xfa nanosleep+0x19f sys_syscall32+0x14a ffffff4597fea0c0 SLEEP CV 1 swtch+0x141 cv_wait+0x70 i_mac_perim_enter+0x63 mac_perim_enter_by_mh+0x23 mac_perim_enter_by_macname+0x33 i_dls_devnet_setzid+0x6b dls_devnet_unset+0x24a dls_devnet_destroy+0x46 vnic_dev_delete+0x96 vnic_ioc_delete+0x28 drv_ioctl+0x1e4 cdev_ioctl+0x39 spec_ioctl+0x60 fop_ioctl+0x55 ioctl+0x9b sys_syscall32+0x14a ffffff431de58b60 SLEEP CV 1 swtch+0x141 cv_wait_sig_swap_core+0x1b9 cv_wait_sig_swap+0x17 pause+0x45 sys_syscall32+0x14a > ::walk thread | ::stacks THREAD STATE SOBJ COUNT 2 UNPARKED <NONE> 2440 dlmgmt_handler+0xf4 libc.so.1`__door_return+0x3d cd UNPARKED <NONE> 9 libc.so.1`sleep+0x2d dlmgmt_zonehalt+0x6f dlmgmt_handler+0x96 libc.so.1`__door_return+0x3d 98d UNPARKED <NONE> 5 libc.so.1`door_create_func+0x29 libc.so.1`_thrp_setup+0x8c libc.so.1`_lwp_start 5b1 UNPARKED <NONE> 2 dlmgmt_handler+0x10b libc.so.1`__door_return+0x3d 525 UNPARKED <NONE> 1 1 UNPARKED <NONE> 1 main+0xe2 _start+0x83
Obviously seeing ourselves blocked on the mac perimeter is not a good
sign. So let's verify what's going on there.
> ffffff4597fea0c0::findstack -v stack pointer for thread ffffff4597fea0c0: ffffff0205cf9880 [ ffffff0205cf9880 _resume_from_idle+0x112() ] ffffff0205cf98b0 swtch+0x141() ffffff0205cf98f0 cv_wait+0x70(ffffff431f28cc04, ffffff431f28cbf0) ffffff0205cf9930 i_mac_perim_enter+0x63(ffffff89b8df26d0) ffffff0205cf9960 mac_perim_enter_by_mh+0x23(ffffff89b8df26d0, ffffff0205cf99e8) ffffff0205cf99b0 mac_perim_enter_by_macname+0x33(ffffff449a8aa034, ffffff0205cf99e8) ffffff0205cf9a50 i_dls_devnet_setzid+0x6b(ffffff449a8aa010, 0, 0, 0) ffffff0205cf9ab0 dls_devnet_unset+0x24a(ffffff89b8df26e8, ffffff0205cf9b4c, 1) ffffff0205cf9b20 dls_devnet_destroy+0x46(ffffff89b8df26d0, ffffff0205cf9b4c, 1) ffffff0205cf9b90 vnic_dev_delete+0x96(29257, 0, ffffff72e441caf8) ffffff0205cf9bd0 vnic_ioc_delete+0x28(ffffff465732d968, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9c70 drv_ioctl+0x1e4(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9cb0 cdev_ioctl+0x39(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9d00 spec_ioctl+0x60(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0) ffffff0205cf9d90 fop_ioctl+0x55(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0) ffffff0205cf9eb0 ioctl+0x9b(0, 1710002, ac841d04) ffffff0205cf9f10 sys_syscall32+0x14a() > ffffff89b8df26d0::print mac_impl_t mi_driver | ::print vnic_t vn_mc_handles[0] | ::print mac_client_impl_t mci_mip->mi_perim_owner | ::findstack -v stack pointer for thread ffffff457b4810e0: ffffff01ea06e870 [ ffffff01ea06e870 _resume_from_idle+0x112() ] ffffff01ea06e8a0 swtch+0x141() ffffff01ea06e940 turnstile_block+0x21a(ffffffb47aa4c1d8, 0, fffffffffbd09740, fffffffffbc08ce0, 0, 0) ffffff01ea06e9b0 rw_enter_sleep+0x19b(fffffffffbd09740, 0) ffffff01ea06ea20 dls_devnet_hold_common+0x57(29274, ffffff01ea06ea68, 0) ffffff01ea06ea40 dls_devnet_hold+0x17(29274, ffffff01ea06ea68) ffffff01ea06eac0 dls_devnet_setzid+0x8e(ffffff44ee427d10, 22ab, 1) ffffff01ea06eb90 drv_ioc_prop_common+0x3f2(ffffff44c8e85bc0, 8113e08, 1, ffffffdb9c922190, 100003) ffffff01ea06ebd0 drv_ioc_setprop+0x29(ffffff44c8e85bc0, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58) ffffff01ea06ec70 drv_ioctl+0x1e4(1200000000, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58) ffffff01ea06ecb0 cdev_ioctl+0x39(1200000000, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58) ffffff01ea06ed00 spec_ioctl+0x60(ffffff431ddedd80, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58, 0) ffffff01ea06ed90 fop_ioctl+0x55(ffffff431ddedd80, d1d001b, 8113e08, 100003, ffffffdb9c922190, ffffff01ea06ee58, 0) ffffff01ea06eeb0 ioctl+0x9b(4, d1d001b, 8113e08) ffffff01ea06ef10 _sys_sysenter_post_swapgs+0x153()
Okay, so who holds that lock?
> fffffffffbd09740::rwlock ADDR OWNER/COUNT FLAGS WAITERS fffffffffbd09740 ffffff4597fea0c0 B111 ffffff44a6915c20 (W) ||| ffffff81eabd4c00 (W) WRITE_LOCKED ------+|| ffffff47a1dffc00 (W) WRITE_WANTED -------+| ffffff45498e4c60 (W) HAS_WAITERS --------+ ffffff474d2e3860 (W) ffffff45ed5b18a0 (W) ffffff45146e70c0 (W) ffffff45cb871740 (W) ffffff457b46e740 (W) ffffff44e9b12ae0 (W) ffffff44c4522080 (W) ffffff44b003a060 (W) ffffff598e28b0c0 (W) ffffff4597feeb20 (W) ffffff467d415b60 (W) ffffff45db6ebae0 (W) ffffff4379edf740 (W) ffffff45ed8a2480 (W) ffffff44ee7e3c20 (W) ... > ffffff4597fea0c0::findstack -v stack pointer for thread ffffff4597fea0c0: ffffff0205cf9880 [ ffffff0205cf9880 _resume_from_idle+0x112() ] ffffff0205cf98b0 swtch+0x141() ffffff0205cf98f0 cv_wait+0x70(ffffff431f28cc04, ffffff431f28cbf0) ffffff0205cf9930 i_mac_perim_enter+0x63(ffffff89b8df26d0) ffffff0205cf9960 mac_perim_enter_by_mh+0x23(ffffff89b8df26d0, ffffff0205cf99e8) ffffff0205cf99b0 mac_perim_enter_by_macname+0x33(ffffff449a8aa034, ffffff0205cf99e8) ffffff0205cf9a50 i_dls_devnet_setzid+0x6b(ffffff449a8aa010, 0, 0, 0) ffffff0205cf9ab0 dls_devnet_unset+0x24a(ffffff89b8df26e8, ffffff0205cf9b4c, 1) ffffff0205cf9b20 dls_devnet_destroy+0x46(ffffff89b8df26d0, ffffff0205cf9b4c, 1) ffffff0205cf9b90 vnic_dev_delete+0x96(29257, 0, ffffff72e441caf8) ffffff0205cf9bd0 vnic_ioc_delete+0x28(ffffff465732d968, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9c70 drv_ioctl+0x1e4(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9cb0 cdev_ioctl+0x39(1200000000, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58) ffffff0205cf9d00 spec_ioctl+0x60(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0) ffffff0205cf9d90 fop_ioctl+0x55(ffffff431ddedd80, 1710002, ac841d04, 100003, ffffff72e441caf8, ffffff0205cf9e58, 0) ffffff0205cf9eb0 ioctl+0x9b(0, 1710002, ac841d04) ffffff0205cf9f10 sys_syscall32+0x14a()
Wow, there are a lot of waiters on that rwlock. So there we have it. One
thread has the mac perimeter and wants a dls lock. The other has a dls
lock and wants the mac perimeter. Well, that doesn't end well.
Here's a dtrace script to monitor the relative acquisition order of the MAC perim and DLS devnet lock. For the vm start/stop (devnet_rename/devnet_destroy) deadlock scenario the window is super tight because the dls_devnet_rename()
function immediately grabs the DLS devnet lock after entering the MAC perim -- things have to be scheduled just perfectly. Even using dtrace chill() I wasn't able to hit it. However, using this script you can clearly see the order of events and see when one operation is entering the MAC perim and then grabbing the DLS devnet lock and when another operation is performing these steps in the opposite order: in which case a deadlock could happen.
#pragma D option quiet #define MIS_IS_VNIC 0x0002 #define IS_VNIC(mip) (mip->mi_state_flags & MIS_IS_VNIC) #define mac_lower(upper) ((mac_client_impl_t *)((vnic_t *)upper->mi_driver)->vn_mc_handles[0])->mci_mip #define MAC_PERIM_CNT(mip) (IS_VNIC(mip) ? mac_lower(mip)->mi_perim_ocnt : mip->mi_perim_ocnt) #define IS_MAC_PERIM_CNT(mip, cnt) (MAC_PERIM_CNT(mip) == cnt) #define IS_DEVNET_LOCK(l) ((uintptr_t)l == (uintptr_t)&`i_dls_devnet_lock) BEGIN { self->dls = 0; self->mac = 0; } lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0)/ { self->dls = 1; } lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0) && self->mac/ { printf("MAC -> DLS %s (%d)\n", execname, pid); stack(); } lockstat:::rw-release /IS_DEVNET_LOCK(arg0) && self->dls/ { self->dls = 0; } fbt::i_mac_perim_enter:entry { self->mac = 1; } /* * If MAC perim count is 0 then this is the first entry into the * perim. */ fbt::i_mac_perim_enter:entry /self->dls && IS_MAC_PERIM_CNT(args[0], 0)/ { printf("DLS -> MAC %s (%d)\n", execname, pid); stack(); } /* * Last ref to the perim is about to be let go. */ fbt::i_mac_perim_exit:entry /IS_MAC_PERIM_CNT(args[0], 1)/ { self->mac = 0; }
Using the lock ordering script you can see that vmadm start uses dls_devnet_rename()
as part of boot, and this function enters the MAC and then grabs the DLS devnet lock (MAC -> DLS):
MAC -> DLS dladm (125046) dls`dls_devnet_set+0x6b dls`dls_devnet_create+0x4c 0xfffffffff81de5ca 0xfffffffff81dd8cd dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 MAC -> DLS dladm (125046) dls`dls_devnet_hold_common+0x53 dls`dls_devnet_hold+0x17 dls`dls_devnet_setzid+0x8e dld`drv_ioc_prop_common+0x44c dld`drv_ioc_setprop+0x29 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 MAC -> DLS dladm (125047) dls`dls_devnet_rename+0xc0 dld`drv_ioc_rename+0xbb dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241
And in the case of the vmadm stop we see dls_devnet_destroy()
does DLS -> MAC:
MAC -> DLS flowadm (124904) dls`dls_devnet_hold_common+0x53 dls`dls_devnet_hold_tmp+0x1a dls`dls_devnet_hold_link+0x2f unix`stubs_common_code+0x59 mac`mac_link_flow_walk+0x57 dld`dld_walk_flow+0xca dld`drv_ioc_walkflow+0x24 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 DLS -> MAC dlmgmtd (100016) mac`mac_perim_enter_by_mh+0x23 mac`mac_perim_enter_by_macname+0x33 dls`i_dls_devnet_setzid+0x6b dls`dls_devnet_unset+0x2b2 dls`dls_devnet_destroy+0x46 vnic`vnic_dev_delete+0x96 vnic`vnic_ioc_delete+0x28 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`sys_syscall32+0x227 DLS -> MAC dlmgmtd (100016) mac`mac_close+0x18 mac`mac_perim_exit+0x33 dls`i_dls_devnet_setzid+0xfc dls`dls_devnet_unset+0x2b2 dls`dls_devnet_destroy+0x46 vnic`vnic_dev_delete+0x96 vnic`vnic_ioc_delete+0x28 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`sys_syscall32+0x227
Thus, if a stop and start operation run concurrently, and are scheduled just right, we will deadlock.
Robert's initial hunch was to make sure all consumers grab the locks in the order DLS -> MAC
. After several days of review I have come to the opposite conclusion. It is my finding that all operations perform MAC -> DLS
, except for dls_devnet_destroy()
which is our problem child and the reason for this ticket. If you try to change the order to DLS -> MAC
you will receive a recursive RW lock panic (when grabbing it for write, RW_READER_STARVEWRITER
will not save you here). Furthermore, there is a comment in DLS that indicates the lock order should always be MAC -> DLS
, contrary to some comments in mac.c
. Following are various reasons dls_devnet_destroy()
should use MAC -> DLS
ordering like all the other functions.
First, if you look at the dtrace output in the previous comment it's clear to see all operations besides dls_devnet_destroy
are using MAC -> DLS
order.
Second, it's not easy to change the order to DLS -> MAC
everywhere. E.g., the dls_devnet_rename
function must do MAC -> DLS
or else it will cause a recursive rw panic during its call to mac_perim_enter_by_linkid()
because this leads to dls_devnet_hold_common()
which grabs the i_dls_devnet_lock
. It's obvious, from the code, that this will happen but I also built an image with the order rearranged in dls_devnet_rename
and sure enough it induced a rw recursive enter panic.
Third, and this brings me to the more general reason why I think it should always be {{MAC > DLS}}: the MAC perim is built in such a way that it can be entered recursively: it's just a mutex guarding an owner + counter field. Whereas the - which is hard to do given how intertwined this code is. So it makes more sense to just grab the MAC perim as early as possible (e.g. where the ioctl first hits DLS/MAC) and hold the DLS lock for specific calls inside of that.i_dls_devnet_lock
is a vanilla lock that can only be entered oncce. Because MAC and DLS are so intertwined there are places where MAC is a client to DLS and grabs the i_dls_devnet_lock
: if you grabbed this lock before entering the MAC perim then you better make damn sure you never call a MAC function that needs to use DLS
This brings me to my fourth point that helps shed some light on why DLS/MAC seem to blend into each other so much, this comment:
mac_flow.c:1258
/* * dls will eventually be merged with mac so it's ok * to call dls' internal functions. */ err = dls_devnet_hold_link(linkid, &dlh, &dlp);
My fifth point is another code comment, from dls_mgmt.c
, that supports the claim that the order should always be MAC -> DLS
. This comment is saying DLS is part of the "GLD framework" and that framework locks (e.g., i_dls_devnet_lock
) are never held across calls to a MAC layer function that needs to aquire the MAC perim (e.g., dls_devnet_rename()
).
dls_mgmt.c:47
/* * This vanity name management module is treated as part of the GLD framework * and we don't hold any GLD framework lock across a call to any mac * function that needs to acquire the mac perimeter. The hierarchy is * mac perimeter -> framework locks */
It's my finding that the fix for this particular deadlock is to enter the perimeter earlier in the dls_devnet_destroy()
function, before it makes any calls into DLS. I built an image doing just that and my dtrace script above confirms the proper order:
MAC -> DLS dlmgmtd (100015) dls`dls_devnet_unset+0x36 dls`dls_devnet_destroy+0x52 vnic`vnic_dev_delete+0x96 vnic`vnic_ioc_delete+0x28 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`sys_syscall32+0x227
illumos-joyent commit aefd5a8 (branch master, by Ryan Zezeski)
OS-3506 dls and mac lock ordering isn't honored
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>
illumos-joyent commit 87b3eba (branch master, by Ryan Zezeski)
OS-6056 dls_devnet_unset() and DLS temp holds can deadlock
OS-5989 OS-3506 introduced deadlock in dls_devnet_prop_task
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>