OS-7879: zfs recv is broken

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Resolved
Created at:2019-07-03T10:35:59.990Z
Updated at:2019-07-03T20:27:25.526Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-07-03T20:27:25.512Z)

Fix Versions

2019-07-04 Verdukianism (Release Date: 2019-07-04)

Related Issues

Description

I noticed (as did nightly, see TRITON-1791 and https://jenkins.joyent.us/job/nightly-triton-sdcadm-tests/124/execution/node/49/log/ ) that the sdcadm ha manatee test was hanging. A quick look indicated that the backupserver zfs send was making no progress.

Independently, I noticed that Manta setup also failed: in a similar fashion, it was unable to setup the sync peer as the zfs send and recv were stuck in the kernel in str_waitq().

I tried usb-master-20190624T090054Z-g574c7dc.tgz (reflashed) and we got past initial HA setup:

# touch /lib/sdc/.sdc-test-no-production-data
# /opt/smartdc/sdcadm/test/runtests -f post-setup
...
# /opt/smartdc/sdcadm/test/runtests -f manatee
# Copying test files to /var/tmp/sdcadmtest
# Setup a clean output dir (/var/tmp/sdcadmtest/results).

# Running filtered set of test files: test/manatee.test.js
# manatee.test.js
TAP version 13
# setup
ok 1 cnapi error
ok 2 empty stdout
ok 3 Get instances error
ok 4 Get instances empty stderr
ok 5 Manatee array of instances
ok 6 Manatee instances
ok 7 Execution error
ok 8 Empty stderr
# get initial zk-state
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKSession","level":30,"msg":"created zookeeper session 016bb75c29330015 with timeout 30000 ms","time":"2019-07-03T10:42:28.524Z","v":0}
ok 9 should be truthy
ok 10 null
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKConnectionFSM","zkAddress":"172.25.15.11","zkPort":2181,"sessionId":"016bb75c29330015","level":30,"xid":1,"msg":"sent CLOSE_SESSION request","time":"2019-07-03T10:42:28.532Z","v":0}
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKSession","id":"016bb75c29330015","level":30,"msg":"ZK session closed","time":"2019-07-03T10:42:28.536Z","v":0}
# post-setup help ha-manatee
ok 11 null
ok 12 should not be equal
ok 13 should be equal
# update non-HA
ok 14 Execution error
ok 15 Empty stderr
ok 16 check update string present
ok 17 check update string present
ok 18 check update string present
ok 19 check update string present
ok 20 check update string present
ok 21 check update string present
ok 22 check update string present
# post-setup bogus servers
ok 23 Bogus servers err
ok 24 Invalid servers stderr
# post-setup w/o servers
ok 25 post-setup w/o servers error
ok 26 No servers stderr
# post-setup OK


ok 27 Execution error
ok 28 Empty stderr
ok 29 check update string present
ok 30 check update string present
ok 31 check update string present
ok 32 check update string present
ok 33 check update string present
ok 34 check update string present
ok 35 check update string present
ok 36 check update string present
# update HA
ok 37 Execution error
ok 38 Empty stderr
ok 39 check update HA string present
ok 40 check update HA string present
ok 41 check update HA string present
ok 42 check update HA string present
ok 43 check update HA string present
ok 44 check update HA string present
ok 45 check update HA string present
ok 46 check update HA string present
ok 47 check update HA string present
# teardown
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKSession","level":30,"msg":"created zookeeper session 016bb75c2933004e with timeout 30000 ms","time":"2019-07-03T10:50:28.774Z","v":0}
ok 48 Get instances error
ok 49 Get instances empty stderr
ok 50 Manatee array of instances
ok 51 Manatee instances
ok 52 should be truthy
ok 53 null
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKConnectionFSM","zkAddress":"172.25.15.11","zkPort":2181,"sessionId":"016bb75c2933004e","level":30,"xid":3,"msg":"sent CLOSE_SESSION request","time":"2019-07-03T10:51:14.954Z","v":0}
{"name":"manatee.test","hostname":"headnode","pid":79450,"component":"ZKSession","id":"016bb75c2933004e","level":30,"msg":"ZK session closed","time":"2019-07-03T10:51:14.955Z","v":0}

1..53
# tests 53
# pass  53

# ok


# Test output in /var/tmp/sdcadmtest/results:
#    /var/tmp/sdcadmtest/results/manatee.test.js.tap

Next, reset, and try the same after updating the PI to see if it's OS-related.

Comments

Comment by Former user
Created at 2019-07-03T11:40:38.092Z

That USB key has PI:

# mdb -ke ::status
debugging live kernel (64-bit) on headnode
operating system: 5.11 joyent_20190621T175631Z (i86pc)
git branch: master
git rev: c6ce7a25abbb33fa38edd55230b7992c5a36ce91
image uuid: (not set)

Comment by Former user
Created at 2019-07-03T12:13:57.592Z

Booted into:

]# mdb -ke ::status
debugging live kernel (64-bit) on headnode
operating system: 5.11 joyent_20190702T115700Z (i86pc)
git branch: master
git rev: fb1c4f28ae25e51923dfe4f1242ff5b87af83bd1
image uuid: (not set)

then:

# /opt/smartdc/sdcadm/test/runtests -f manatee
# Copying test files to /var/tmp/sdcadmtest
# Setup a clean output dir (/var/tmp/sdcadmtest/results).

# Running filtered set of test files: test/manatee.test.js
# manatee.test.js
TAP version 13
# setup
ok 1 cnapi error
ok 2 empty stdout
ok 3 Get instances error
ok 4 Get instances empty stderr
ok 5 Manatee array of instances
ok 6 Manatee instances
ok 7 Execution error
ok 8 Empty stderr
# get initial zk-state
{"name":"manatee.test","hostname":"headnode","pid":34337,"component":"ZKSession","level":30,"msg":"created zookeeper session 016bb7b628b90022 with timeout 30000 ms","time":"2019-07-03T12:06:17.337Z","v":0}
ok 9 should be truthy
ok 10 null
{"name":"manatee.test","hostname":"headnode","pid":34337,"component":"ZKConnectionFSM","zkAddress":"172.25.15.11","zkPort":2181,"sessionId":"016bb7b628b90022","level":30,"xid":1,"msg":"sent CLOSE_SESSION request","time":"2019-07-03T12:06:17.346Z","v":0}
{"name":"manatee.test","hostname":"headnode","pid":34337,"component":"ZKSession","id":"016bb7b628b90022","level":30,"msg":"ZK session closed","time":"2019-07-03T12:06:17.349Z","v":0}
# post-setup help ha-manatee
ok 11 null
ok 12 should not be equal
ok 13 should be equal
# update non-HA


ok 14 Execution error
ok 15 Empty stderr
ok 16 check update string present
ok 17 check update string present
ok 18 check update string present
ok 19 check update string present
ok 20 check update string present
ok 21 check update string present
ok 22 check update string present
# post-setup bogus servers
ok 23 Bogus servers err
ok 24 Invalid servers stderr
# post-setup w/o servers
ok 25 post-setup w/o servers error
ok 26 No servers stderr
# post-setup OK
<hang>
[root@headnode (emy-15) ~]# vmadm list | grep manatee
61f65917-2217-4076-8a0a-f09d69b93c54  OS    2048     running           manatee0
7948c62d-f436-4b69-b29b-75dacc3b2b8e  OS    2048     running           manatee1

# on manatee0 (primary)
...
  36158 sudo -u postgres node /opt/smartdc/manatee/node_modules/manatee/backupserver.js
    36162 node /opt/smartdc/manatee/node_modules/manatee/backupserver.js -v -f etc/backup
      38437 /usr/sbin/zfs send -v -P zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manate
...

# tail -f manta-application-manatee-backupserver:default.log | bunyan | tee /var/tmp/l.out
...
[2019-07-03T12:12:59.909Z]  INFO: manatee-backupserver/BackupSender/36162 on 61f65917-2217-4076-8a0a-f09d69b93c54 (/opt/smartdc/manatee/node_modules/manatee/lib/backupSender.js:19:
    zfs send stderr:  12:12:59  2315576 zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manatee@1562155726942
    
[2019-07-03T12:12:59.909Z]  INFO: manatee-backupserver/BackupSender/36162 on 61f65917-2217-4076-8a0a-f09d69b93c54 (/opt/smartdc/manatee/node_modules/manatee/lib/backupSender.js:20)
[2019-07-03T12:12:59.956Z]  INFO: manatee-backupserver/BackupQueue/36162 on 61f65917-2217-4076-8a0a-f09d69b93c54 (/opt/smartdc/manatee/node_modules/manatee/lib/backupQueue.js:99 ia
[2019-07-03T12:12:59.957Z]  INFO: manatee-backupserver/BackupQueue/36162 on 61f65917-2217-4076-8a0a-f09d69b93c54 (/opt/smartdc/manatee/node_modules/manatee/lib/backupQueue.js:104 :
    found backup job { uuid: '31073862-a4ea-4c4b-9f4c-13d0a865ddba',
      host: '172.25.15.41',
      port: 1234,
      dataset: 'zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee',
      done: 0,
      size: '79236376',
      completed: '2315576' }
 ... rinse and repeat ...

Comment by Former user
Created at 2019-07-03T12:19:24.970Z

zfs send stdout:

# pfiles 38437
38437:  /usr/sbin/zfs send -v -P zones/61f65917-2217-4076-8a0a-f09d69b93c54/da
  Current rlimit: 65536 file descriptors
   0: S_IFSOCK mode:0666 dev:579,0 ino:58064 uid:0 gid:0 rdev:0,0
      O_RDWR
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]
   1: S_IFSOCK mode:0666 dev:579,0 ino:58067 uid:0 gid:0 rdev:0,0
      O_RDWR
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]
   2: S_IFSOCK mode:0666 dev:579,0 ino:20154 uid:0 gid:0 rdev:0,0
      O_RDWR
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]
...

# pfiles 36162
36162:  node /opt/smartdc/manatee/node_modules/manatee/backupserver.js -v -f e
  Current rlimit: 65536 file descriptors
   0: S_IFCHR mode:0666 dev:571,23 ino:3078891182 uid:0 gid:3 rdev:38,2
      O_RDONLY|O_LARGEFILE FD_CLOEXEC
      /dev/null
      offset:0
   1: S_IFREG mode:0644 dev:90,65629 ino:66202 uid:0 gid:0 size:984349
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE FD_CLOEXEC
      /var/svc/log/manta-application-manatee-backupserver:default.log
      offset:984349
   2: S_IFREG mode:0644 dev:90,65629 ino:66202 uid:0 gid:0 size:984349
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE FD_CLOEXEC
      /var/svc/log/manta-application-manatee-backupserver:default.log
      offset:984349
   3: S_IFIFO mode:0000 dev:568,0 ino:26437 uid:907 gid:907 rdev:0,0
      O_RDWR FD_CLOEXEC
   4: S_IFIFO mode:0000 dev:568,0 ino:26437 uid:907 gid:907 rdev:0,0
      O_RDWR FD_CLOEXEC
   5: S_IFPORT mode:0000 dev:582,0 uid:907 gid:907 size:0
   6: S_IFIFO mode:0000 dev:568,0 ino:26440 uid:907 gid:907 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
   7: S_IFIFO mode:0000 dev:568,0 ino:26440 uid:907 gid:907 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
   8: S_IFIFO mode:0000 dev:568,0 ino:26443 uid:907 gid:907 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
   9: S_IFIFO mode:0000 dev:568,0 ino:26443 uid:907 gid:907 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
  10: S_IFDIR mode:0755 dev:90,65629 ino:8 uid:0 gid:0 size:20
      O_RDONLY|O_LARGEFILE FD_CLOEXEC
      /
      offset:0
  11: S_IFSOCK mode:0666 dev:579,0 ino:63635 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_REUSEADDR,SO_SNDBUF(128000),SO_RCVBUF(1048576)
        sockname: AF_INET 0.0.0.0  port: 12345
  12: S_IFSOCK mode:0666 dev:579,0 ino:33749 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_REUSEADDR,SO_SNDBUF(128000),SO_RCVBUF(1049800)
        sockname: AF_INET 172.25.15.16  port: 12345
        peername: AF_INET 172.25.15.41  port: 59512
  13: S_IFSOCK mode:0666 dev:579,0 ino:42976 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(128000),SO_RCVBUF(1049800)
        sockname: AF_INET 172.25.15.16  port: 53672
        peername: AF_INET 172.25.15.41  port: 1234
  16: S_IFSOCK mode:0666 dev:579,0 ino:26690 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]
  18: S_IFSOCK mode:0666 dev:579,0 ino:43339 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]
  20: S_IFSOCK mode:0666 dev:579,0 ino:53392 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX 
        peer: node[36162] zone: 61f65917-2217-4076-8a0a-f09d69b93c54[23]


Comment by Former user
Created at 2019-07-03T12:21:47.076Z

We find it stuck in the kernel:

> ::walk thread | ::findstack -v
stack pointer for thread 1: 80446f8
[ 080446f8 libc.so.1`ioctl+0x15() ]
  08045cf8 libzfs.so.1`dump_ioctl+0x1a8(8144a48, 8047354, 0, 0, 0, 1)
  08045d78 libzfs.so.1`dump_snapshot+0x3d1(8144a48, 804734c)
  08045dd8 libzfs.so.1`zfs_iter_snapshots_sorted+0x65(8144cc8, feda51ea, 804734c)
  08046fd8 libzfs.so.1`dump_filesystem+0x180(8144cc8, 804734c)
  08047048 libzfs.so.1`dump_filesystems+0x2d(8144cc8, 804734c)
  080479d8 libzfs.so.1`zfs_send+0x9a5(8144cc8, 0, 8047e45, 8047bec, 1, 0)
  08047c38 zfs_do_send+0x764(4, 8047cf4)
  08047c88 main+0x22c(8047c8c, fef42388)
  08047cc8 _start_crt+0x97(5, 8047cf0, fefd0f60, 0, 0, 0)
  08047ce4 _start+0x1a(5, 8047df4, 8047e02, 8047e07, 8047e0a, 8047e0d)
stack pointer for thread 2: fe9add78
[ fe9add78 libc.so.1`__nanosleep+0x15() ]
  fe9adda8 libc.so.1`sleep+0x31(1)
  fe9aefc8 libzfs.so.1`send_progress_thread+0xae(8045d54)
  fe9aefe8 libc.so.1`_thrp_setup+0x88(fe6f0240)
  fe9aeff8 libc.so.1`_lwp_start(fe6f0240, 0, 0, 0, 0, 0)
[root@headnode (emy-15) ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba fctl stmf_sbd stmf zfs mm sd lofs idm mpt_sas sata crypto random cpc fcp logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t38437::pid2proc | ::walk thread | ::findstack
stack pointer for thread fffffe2398cc1020 (zfs/1): fffffcc2723ba070
[ fffffcc2723ba070 _resume_from_idle+0x12b() ]
  fffffcc2723ba0a0 swtch+0x141()
  fffffcc2723ba110 cv_wait_sig+0x185()
  fffffcc2723ba140 str_cv_wait+0x27()
  fffffcc2723ba1f0 strwaitq+0x2c3()
  fffffcc2723ba290 strwrite_common+0x19c()
  fffffcc2723ba2c0 strwrite+0x1b()
  fffffcc2723ba370 sotpi_sendmsg+0x51f()
  fffffcc2723ba3d0 socket_sendmsg+0x56()
  fffffcc2723ba470 socket_vop_write+0x6c()
  fffffcc2723ba510 fop_write+0xf3()
  fffffcc2723ba5f0 vn_rdwr+0x27a()
  fffffcc2723ba670 dump_bytes+0x8e()
  fffffcc2723ba6c0 dump_record+0x88()
  fffffcc2723ba760 dump_write+0x2cb()
  fffffcc2723ba810 do_dump+0x64d()
  fffffcc2723ba9d0 dmu_send_impl+0x7e3()
  fffffcc2723bab10 dmu_send_obj+0x19f()
  fffffcc2723babb0 zfs_ioc_send+0x105()
..

Comment by Former user
Created at 2019-07-03T12:26:13.179Z

stream info:

  fffffcc2723ba1f0 strwaitq+0x2c3(fffffe269bb03bc0, 1, 0, 2, ffffffffffffffff, fffffcc2723ba23c)
...
> fffffe269bb03bc0::stream

+----------------------+-----------------------+
| 0xfffffe26539cc3f8    | 0xfffffe26539cc300    | 
| strwhead              | strrhead              | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00004022      | flg = 0x00044032      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xfffffe267946b0f8    | 0xfffffe267946b000    | 
| tl                    | tl                    | 
|                       |                       |
| cnt = 0t17008         | cnt = 0t0             | 
| flg = 0x0024402c      | flg = 0x00204032      | 
+-----------------------+-----------------------+
> fffffe269bb03bc0::str2mate
fffffe269bb03bc0

Comment by Former user
Created at 2019-07-03T12:35:41.871Z

Taking a look at our peer's manatee-sitter log:

[2019-07-03T12:08:47.803Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: dataset "zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee" does not exist; not preserving (prefix=autorebuild, zfs_op="isolate dataset")
    dataset: zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee
[2019-07-03T12:08:47.804Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: ZfsClient.restore: entering (serverUrl=http://172.25.15.16:12345, pollInterval=1000)
    dataset: zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee
[2019-07-03T12:08:47.804Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: zfsClient._receive: starting zfs recv zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee
[2019-07-03T12:08:47.815Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: listening for zfs send on (port=1234, host=172.25.15.41)
[2019-07-03T12:08:47.815Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: ZfsClient.postRestoreRequest: entering (zfsHost=172.25.15.41, zfsPort=1234, serverUrl=http://172.25.15.16:12345, pollInterval=1000)
[2019-07-03T12:08:47.846Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: ZfsClient.postRestoreRequest: exiting (err=null, serverUrl=http://172.25.15.16:12345, pollInterval=1000)
    dataset: zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee
    --
    response: {
      "jobid": "31073862-a4ea-4c4b-9f4c-13d0a865ddba",
      "jobPath": "/backup/31073862-a4ea-4c4b-9f4c-13d0a865ddba"
    }
[2019-07-03T12:08:47.846Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: zfsClient.pollRestoreCompletion: entering (serverUrl=http://172.25.15.16:12345, jobPath=/backup/31073862-a4ea-4c4b-9f4c-13d0a865ddba, pollInterval=1000)
[2019-07-03T12:08:47.855Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: ZFSClient._receive: got socket, piping to zfs recv
[2019-07-03T12:08:47.877Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e:
    zfs recv stdout:  receiving full stream of zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manatee@1562155726942 into zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee@1562155726942

[2019-07-03T12:08:47.877Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e:
    zfs recv stderr:  cannot open 'zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee': dataset does not exist
    cannot receive new filesystem stream: dataset does not exist

[2019-07-03T12:08:47.880Z] ERROR: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: zfsClient._receive: zfs recv failed
    VError: zfs recv: cannot open 'zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee': dataset does not exist
    cannot receive new filesystem stream: dataset does not exist
     1
        at ChildProcess.<anonymous> (/opt/smartdc/manatee/node_modules/manatee/lib/zfsClient.js:809:32)
        at ChildProcess.emit (events.js:98:17)
        at Process.ChildProcess._handle.onexit (child_process.js:820:12)
[2019-07-03T12:08:47.880Z]  INFO: manatee-sitter/ZfsClient/38300 on 7948c62d-f436-4b69-b29b-75dacc3b2b8e: zfsClient._receive: completed zfs recv
...
# zfs list
NAME                                              USED  AVAIL  REFER  MOUNTPOINT
zones                                             213G  7.81T  2.13M  /zones
zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e       23.4M  50.0G  1.32G  /zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e
zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data   205K  50.0G   205K  /zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data

Comment by Former user
Created at 2019-07-03T13:13:22.242Z

So that dataset most definitely does, and did, exist, looking at smartdc-mdata:execute.log:

[2019-07-03T12:07:59Z] /opt/smartdc/boot/setup.sh:90: common_manatee_setup(): zfs set compress=lz4 zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data
[2019-07-03T12:07:59Z] /opt/smartdc/boot/setup.sh:92: common_manatee_setup(): echo 'setting recordsize to 8K on manatee dataset'
setting recordsize to 8K on manatee dataset
[2019-07-03T12:07:59Z] /opt/smartdc/boot/setup.sh:93: common_manatee_setup(): zfs set recordsize=8k zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data

Comment by Former user
Created at 2019-07-03T13:15:16.315Z

If I shut down the primary, I can send it directly myself from the gz:

zfs send zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manatee@1562155726942 | zfs recv zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee

Comment by Former user
Created at 2019-07-03T13:19:57.202Z

Between the two zones, though:

[root@61f65917-2217-4076-8a0a-f09d69b93c54 (emy-15:manatee0) ~]# zfs send zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manatee@1562155726942 | nc  172.25.15.41 4444
..
# nc -l 172.25.15.41 4444 | zfs recv zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test
cannot open 'zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test': dataset does not exist
cannot receive new filesystem stream: dataset does not exist

Comment by Former user
Created at 2019-07-03T13:31:20.727Z

The ENOENT is coming from:

[root@headnode (emy-15) ~]# dtrace -n 'zfs_secpolicy_write_perms:return  /execname == "zfs"/ { print(execname); stack(); ustack(); print(arg1); }'
dtrace: description 'zfs_secpolicy_write_perms:return  ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1  65453 zfs_secpolicy_write_perms:return string "zfs"

              zfs`zfs_secpolicy_recv+0x40
              zfs`zfsdev_ioctl+0x12a
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x153

              0xfeec6995
              0xfeda03fa
              0xfeda32af
              0xfeda33e5
              0x805c13f
              0x80600f2
              0x8056418
              0x80562ea
int64_t 0x2

Comment by Former user
Created at 2019-07-03T13:54:17.104Z

Namely:

  1  69097           dsl_dataset_hold:entry string "zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test"
string "zfs"

              zfs`zfs_secpolicy_write_perms+0x87
              zfs`zfs_secpolicy_recv+0x40
              zfs`zfsdev_ioctl+0x12a
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x153

...
  1  69098          dsl_dataset_hold:return string "zfs"

              zfs`zfs_secpolicy_write_perms+0x87
              zfs`zfs_secpolicy_recv+0x40
              zfs`zfsdev_ioctl+0x12a
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x153

              0xfeec6995
              0xfeda03fa
              0xfeda32af
              0xfeda33e5
              0x805c13f
              0x80600f2
              0x8056418
              0x80562ea
int64_t 0x2


Comment by Former user
Created at 2019-07-03T14:34:08.011Z

libzfs seems to be broken:

(emy-15:manatee1) ~]# nc -l 172.25.15.41 4444 | truss -o /tmp/t zfs recv -v zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test
(emy-15:manatee0) ~]# /usr/sbin/zfs send zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manafoo  | nc  172.25.15.41 4444

This is happy with joyent_20190621T175631Z, where:

 51  69927         zfs_secpolicy_recv:entry char [1024] "zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data"

Under the newer PI, this becomes:

  3  65478         zfs_secpolicy_recv:entry char [1024] "zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test"

i.e. we're no longer passing the parent along to the ioctl, but the destination


Comment by Former user
Created at 2019-07-03T14:56:22.071Z

Broken:

# mdb -ke ::status
debugging live kernel (64-bit) on headnode
operating system: 5.11 joyent_20190626T201442Z (i86pc)
git branch: master
git rev: 356e1b4941ba70261404f9bb0255f70891042331
image uuid: (not set)

this is still after zfs encryption


Comment by Former user
Created at 2019-07-03T15:49:36.855Z

Works (mostly) with:

# mdb -ke ::status
debugging live kernel (64-bit) on headnode
operating system: 5.11 joyent_20190624T212418Z (i86pc)
git branch: master
git rev: 394dadbc86535e5075b260851683ccbb3f571c6d
image uuid: (not set)

I see:

# nc -l 172.25.15.41 4444 | truss -o /tmp/t zfs recv -v zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test
receiving full stream of zones/61f65917-2217-4076-8a0a-f09d69b93c54/data/manafoo@--head-- into zones/7948c62d-f436-4b69-b29b-75dacc3b2b8e/data/manatee.test@--head--
received 46.9KB stream in 1 seconds (46.9KB/sec)
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/nlockmgr:default
SMF Initialization problems..svc:/network/nfs/nlockmgr:default
SMF Initialization problems..svc:/network/nfs/nlockmgr:default
SMF Initialization problems..svc:/network/nfs/nlockmgr:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/client:default
SMF Initialization problems..svc:/network/nfs/client:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/mapid:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/server:default
SMF Initialization problems..svc:/network/nfs/status:default
NFS plugin problem with SMF repository: unknown -1
SMF Initialization problems..svc:/system/filesystem/autofs
SMF Initialization problems..svc:/system/filesystem/autofs
SMF Initialization problems..svc:/system/filesystem/autofs
SMF Initialization problems..svc:/system/filesystem/autofs
SMF Initialization problems..svc:/system/filesystem/autofs
SMF Initialization problems..svc:/system/filesystem/autofs
AUTOFS plugin problem with SMF properties: unknown -1
libshare SMF initialization problem: entity not found

Comment by Former user
Created at 2019-07-03T17:33:38.527Z

I built a standalone script and some dtrace and I can see the problem before and after the crypto changes.


Comment by Former user
Created at 2019-07-03T18:55:44.789Z

I have a one-line fix to correct the zc_name back to how it used to work on illumos before the crypto wad. I've done a zfs-test run of all the send/recv tests on both OI (illumos-gate) and smartos (illumos-joyent), before and after the fix. The tests all pass both ways, so the fix is compatible with the rest of the system.


Comment by Jira Bot
Created at 2019-07-03T20:26:36.902Z

illumos-joyent commit 7b68e6271c0d65aaf13b11f96df43d8572db22fb (branch master, by Jerry Jelinek)

OS-7879 zfs recv is broken
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Reviewed by: Jason King <jason.king@joyent.com>
Approved by: Kody Kantor <kody.kantor@joyent.com>