OS-7483: excessive page destruction caused by 6602

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-01-04T19:52:18.299Z
Updated at:2019-01-08T02:33:29.692Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-01-08T02:33:29.676Z)

Fix Versions

2019-01-17 Jenna Maroney (Release Date: 2019-01-17)

Description

As part of the work for illumos#6602, logic was added to flush any dirty pages when destroying a lofi instance:

        if (lsp->ls_vp != NULL) {
                (void) VOP_PUTPAGE(lsp->ls_vp, 0, 0, B_INVAL, credp, NULL);
                (void) VOP_CLOSE(lsp->ls_vp, lsp->ls_openflag,
                    1, 0, credp, NULL);
                VN_RELE(lsp->ls_vp);
        }

At the time, questions were raised about the necessity of this call. That said, the vdev_file logic in ZFS, which could be considered a similar layering use-case, performs a similar flush of pages on close. This has come under scrutiny due to an ugly interaction between the above lofi logic and tmpfs. During the live image portion of a SmartOS build, lofi is used on top of tmpfs in order to create the ramdisk resources. After the filesystem operations are complete, the lofi instances are torn down so the resulting files can be use for the boot archive. The lofi_destroy() call during tear-down leads to the VOP_PUTPAGE() operation being called on a tmpfs vnode. Although tmp_putpage() handles most flag combinations with a no-op (it is just in RAM after all), it reads B_INVAL as an edict to flush all vnode pages to its swap backing store before freeing the page_t instances. Not only does this cause a large amount of writes to swap, they are also all done in a single-page-at-a-time thanks to B_INVAL acting as an explicit block to klustering operation. All in all, it results in a large amount of small sync writes to swap for an operation that was put on tmpfs to avoid FS writes in the first place.

Stepping back for a moment, let's examine the difference between the (in this case, ill-fated) behavior of B_INVAL vs B_FREE. Both will flush modified pages to the backing resource (absent other modifiers like B_TRUNC) before going to release the page_t resource. B_FREE will place the page in the cachelist, where it still retains its contents and associate with the vnode. It can be quickly brought back into service if needed. B_INVAL releases the page's associate with the vnode and places it directly on the freelist.

In the case of lofi, releasing pages to the cachelist (using B_FREE) where they can be quickly pressed back into service for that file resource seems like the ideal situation. This in stark contrast to some other existing B_INVAL users like the NFS client, where an expiring remote resource might necessitate the elimination of any cached data associated with it.

Comments

Comment by Former user
Created at 2019-01-05T14:40:51.645Z

To test the performance implications of this, I drew up a small script to act upon a lofi device instantiated over tmpfs:

#!/bin/bash
set -x

testfile=`mktemp -t lofitest_XXXX`
size=64

mkfile ${size}m $testfile

lodev=`pfexec lofiadm -a $testfile`

if [[ -n $lodev ]]; then
        dd if=/dev/zero of=$lodev bs=1024k count=$size
fi
ptime pfexec lofiadm -d $lodev
dd if=$testfile of=/dev/null bs=1024k

Without the fix, performance was slow as expected:

++ mktemp -t lofitest_XXXX
+ testfile=/tmp/lofitest_5eTt
+ size=64
+ mkfile 64m /tmp/lofitest_5eTt
++ pfexec lofiadm -a /tmp/lofitest_5eTt
+ lodev=/dev/lofi/2
+ [[ -n /dev/lofi/2 ]]
+ dd if=/dev/zero of=/dev/lofi/2 bs=1024k count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.104588 s, 642 MB/s
+ ptime pfexec lofiadm -d /dev/lofi/2

real       13.044614293
user        0.000563764
sys         0.341955549
+ dd if=/tmp/lofitest_5eTt of=/dev/null bs=1024k
64+0 records in
64+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.308707 s, 217 MB/s

With the patch, things picked up quite a bit:

++ mktemp -t lofitest_XXXX
+ testfile=/tmp/lofitest_OWY7
+ size=64
+ mkfile 64m /tmp/lofitest_OWY7
++ pfexec lofiadm -a /tmp/lofitest_OWY7
+ lodev=/dev/lofi/2
+ [[ -n /dev/lofi/2 ]]
+ dd if=/dev/zero of=/dev/lofi/2 bs=1024k count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.111955 s, 599 MB/s
+ ptime pfexec lofiadm -d /dev/lofi/2

real        0.007979472
user        0.001244637
sys         0.004401436
+ dd if=/tmp/lofitest_OWY7 of=/dev/null bs=1024k
64+0 records in
64+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.0354094 s, 1.9 GB/s

Comment by Former user
Created at 2019-01-08T00:59:06.509Z

I did a functional smoke test to see if known content copied into a lofi device had the same checksum in the file:

mkfile 20m <filename>
lofiadm -a <filename>
dd if=/dev/zero of=/dev/lofi/#
dd of=/dev/lofi/#
... ENTER FIXED CONTENT
^D
lofiadm -d /dev/lofi/#
digest -a sha1 <filename>
*compare to same test done on "vanilla" platform*

Additionally, I did a full smartos-live build on a machine with this patch and then booted the resulting live image to check for any problems there.

Both tests showed no sigh of issues.


Comment by Jira Bot
Created at 2019-01-08T01:26:40.771Z

illumos-joyent commit 1681fea05e271e6e0ec1aa68bf37b34cbcf5854e (branch master, by Patrick Mooney)

OS-7483 excessive page destruction caused by 6602
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>