OS-4260: dlmgmtd three-way deadlock

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2015-05-05T19:13:50.000Z
Updated at:2016-12-09T18:15:48.000Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Duplicate: The problem is a duplicate of an existing issue.
(Resolution Date: 2016-12-09T18:15:48.000Z)

Related Issues

Description

We've encountered a three-way deadlock here in dlmgmtd.

We have a thread that is trying to fork:

 feefd395 lwp_suspend (1e)
 feef521b suspend_fork (febd5240, 8059f2c, 4, 0, 0, fef6f400) + 94
 feeeb98b forkx    (0, 0, fe261760, 8057965, 5) + 11e
 feeebab3 fork     (0, 4, feecab96, fef6b000, 8059a48, fe262458) + 1e
 08057984 dlmgmt_zfop (fe261be4, 734, 8057c10, fe261784, 81cdb20, 0) + bf
 08057bb1 dlmgmt_zfopen (fe261be4, 8059c13, 734, fe261fe4, fe261be4, 400) + b8
 08058f35 dlmgmt_process_db_onereq (81cdb20, 0, 0, 0, 0, 0) + 54
 08059077 dlmgmt_process_db_req (81cdb20, 8059c8d, 80593eb, 0, 0, 0) + 4a
 0805923a dlmgmt_db_init (734, fe262928, 41, 400) + cc
 08054d51 dlmgmt_zone_init (734, fe262df8, fe262d50, 805543c, 806b5e0, fe262d80) + 12e
 0805544d dlmgmt_zoneboot (fe262df8, fe262d80, fe262dac, 0, 8134a90, fed44200) + 55
 08055220 dlmgmt_handler (0, fe262df8, 8, 0, 0, 805518a) + 96
 feefdc9b __door_return () + 4b

While the thread is in this state, it ultimately is blocked trying to supsend thread 0x17. Thread 0x17 is itself in the kernel doing the following deletion:

[ ffffff01646c89e0 _resume_from_idle+0xf4() ]
  ffffff01646c8a10 swtch+0x141()
  ffffff01646c8ab0 turnstile_block+0x21a(ffffff2bf1d1eec0, 0, ffffffffc011dad0, fffffffffbc08cc0, 0, 0)
  ffffff01646c8b20 rw_enter_sleep+0x19b(ffffffffc011dad0, 0)
  ffffff01646c8b90 vnic_dev_delete+0x43(8c8, 0, ffffff2d7fb5fc98)
  ffffff01646c8bd0 vnic_ioc_delete+0x28(ffffff2a06278818, fd65fd04, 100003, ffffff2d7fb5fc98, ffffff01646c8e58)
  ffffff01646c8c70 drv_ioctl+0x1e4(1200000000, 1710002, fd65fd04, 100003, ffffff2d7fb5fc98, ffffff01646c8e58)
  ffffff01646c8cb0 cdev_ioctl+0x39(1200000000, 1710002, fd65fd04, 100003, ffffff2d7fb5fc98, ffffff01646c8e58)
  ffffff01646c8d00 spec_ioctl+0x60(ffffff2377ba5000, 1710002, fd65fd04, 100003, ffffff2d7fb5fc98, ffffff01646c8e58, 0)
  ffffff01646c8d90 fop_ioctl+0x55(ffffff2377ba5000, 1710002, fd65fd04, 100003, ffffff2d7fb5fc98, ffffff01646c8e58, 0)
  ffffff01646c8eb0 ioctl+0x9b(0, 1710002, fd65fd04)
  ffffff01646c8f10 _sys_sysenter_post_swapgs+0x153()

So where is the rwlock that it's blocked on:

            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffffffc011dad0 ffffff3c54548880  B111 ffffff241f117b80 (W)
                                    ||| ffffff3409922b80 (W)
                 WRITE_LOCKED ------+|| ffffff2a744c1b80 (W)
                 WRITE_WANTED -------+| ffffff39a478b860 (W)
                  HAS_WAITERS --------+
stack pointer for thread ffffff3c54548880: ffffff016164e3c0
[ ffffff016164e3c0 _resume_from_idle+0xf4() ]
  ffffff016164e400 swtch_to+0xb6(ffffff2dc01e9860)
  ffffff016164e450 shuttle_resume+0x2af(ffffff2dc01e9860, ffffffffc0015fd0)
  ffffff016164e500 door_upcall+0x212(ffffff237834e400, ffffff016164e5e0, ffffff2353e78e18, ffffffffffffffff, 0)
  ffffff016164e580 door_ki_upcall_limited+0x67(ffffff2377ae9f58, ffffff016164e5e0, ffffff2353e78e18, ffffffffffffffff, 0)
  ffffff016164e5c0 stubs_common_code+0x51()
  ffffff016164e660 i_dls_mgmt_upcall+0xbf(ffffff016164e6b0, 8, ffffff016164e680, 30)
  ffffff016164e720 dls_mgmt_get_linkinfo+0x66(1907, ffffff25cb84a1c8, 0, 0, 0)
  ffffff016164e760 stubs_common_code+0x51()
  ffffff016164e7e0 mac_client_open+0x201(ffffff43c921b168, ffffff420abbf470, 0, 10)
  ffffff016164e830 i_dls_link_create+0x83(ffffff43c921b180, ffffff016164e848)
  ffffff016164e890 dls_link_hold_common+0x73(ffffff43c921b180, ffffff016164e8e8, 1)
  ffffff016164e8b0 dls_link_hold_create+0x1a(ffffff43c921b180, ffffff016164e8e8)
  ffffff016164e920 dls_devnet_create+0x67(ffffff43c921b168, 1907, 0)
  ffffff016164eaf0 vnic_dev_create+0x5cb(1907, 0, ffffff016164eb74, ffffff016164eb7c, ffffff016164eb50, ffffff016164eb78, ffffff230000000c, ffffff2300000000, ffffff0100000000, ffffffff00000000, 
  ffffff37cdeb2044, ffffff2300000002, ffffff016164eb70, ffffff23d2dc1640)
  ffffff016164ebd0 vnic_ioc_create+0xfd(ffffff37cdeb2000, 8041590, 100003, ffffff23d2dc1640, ffffff016164ee58)
  ffffff016164ec70 drv_ioctl+0x1e4(1200000000, 1710001, 8041590, 100003, ffffff23d2dc1640, ffffff016164ee58)
  ffffff016164ecb0 cdev_ioctl+0x39(1200000000, 1710001, 8041590, 100003, ffffff23d2dc1640, ffffff016164ee58)
  ffffff016164ed00 spec_ioctl+0x60(ffffff2377ba5000, 1710001, 8041590, 100003, ffffff23d2dc1640, ffffff016164ee58, 0)
  ffffff016164ed90 fop_ioctl+0x55(ffffff2377ba5000, 1710001, 8041590, 100003, ffffff23d2dc1640, ffffff016164ee58, 0)
  ffffff016164eeb0 ioctl+0x9b(3, 1710001, 8041590)
  ffffff016164ef10 _sys_sysenter_post_swapgs+0x153()

Which of course is blocked trying to grab the table lock in userland. So we need to figure out the right way forward here with the table lock. We need a better way to serialize our door upcalls and basically quiesce them while we fork.

Comments

Comment by Former user
Created at 2015-06-23T05:34:08.000Z

@accountid:62431b8f258562006fa2866a discovered that we hit this again on:

us-east-1 CN MS08214
   (https://east1-adminui.joyent.us/servers/00000000-0000-0000-0000-00259094bf40):
   plat=7.0/20141226T032659Z, adminIps=10.0.129.138, traits={"internal": "Manta Node"},
   comments="Manta Node"

He is injecting an NMI at this time (~2015-06-23T05:34:01.090Z).


Comment by Former user
Created at 2016-12-08T21:33:16.000Z

This is the same as OS-5363.