SmartOS training from Joyent!

New Video: Running Without a ZFS Root Pool

(Apologies to Bill for this taking so long! I was hoping to find a way to improve the audio quality, but… I didn’t.)

At ZFS Day, Bill Pijewski:

One of the main design principles of ZFS is merging the management of physical volumes with individual filesystems. Instead of relying on an underlying volume manager, ZFS manages disks directly and aggregates them into pools from which individual filesystems are allocated. Storage servers using ZFS typically configure two pools: one pool onto which the system’s root filesystem is installed, and a second for the data to be managed by that system.

At Joyent we’ve taken a different approach and discarded the root pool in favor of a single system-wide pool. Not only does this approach free up an additional two drives to be used for main storage, it also provides us flexibility in upgrading system software, higher customer multitenancy, and ease of deploying new machines. In this talk, I’ll describe our overall architecture, talk about challenges we faced in constructing such an architecture, and characterize our experiences having deployed this model in production over the last 18 months.



• Why ZFS is important to Joyent • Evolution of USB and PXE boot architectures • Running with no system pool

ZFS at Joyent

• We run a production cloud with many servers in datacenters worldwide

• Two kinds of zones (covered in detail in other talks):

• Zones: sparse zones share libraries with the platform

• VMs: fully virtualized GNU/Linux, Windows, FreeBSD, etc. machines

• Use small number of NFS machines to provide additional storage capacity in each datacenter

ZFS for Zones and VMs

• Zones are allocated two ZFS datasets

• One dataset for data in that zone

• Another for core files — to prevent cores from exceeding quota

• VMs have a ZFS volume into which the VM image is installed, plus one or more additional volumes presented to guest as disks

• Guest filesystems are installed into volumes

ZFS in different contexts

• For Joyent, two main contexts: SmartOS and SDC

• SmartOS: community distribution, illumos + lightweight virtualization tools

• SmartDataCenter (SDC): SmartOS + full cloud management and orchestration stack

Important ZFS features

• As with all ZFS users, we take for granted rely on end-to-end data integrity

• Copy-on-write architecture: snapshots, clones • Compression • Space management tools: quotas and reservations

• Replication to move customers around between different machines

Tuesday, October 2, 2012Delegated administration!

• In our next SDC release, we enable delegated administration

• Allows customers to: • Take snapshots outside of Joyentʼs API • Create child datasets • Snapshot and clone datasets • Replicate or migrate data between instances

• Open work: basic limits on delegated activity to avoid DOS

ZFS Performance

• SSDs for ZIL •ARC

• We hold back some portion of a serverʼs total memory, knowing that a good portion of this memory will be consumed by the ARC

• Committing memory achieves greater I/O performance

• ZFS I/O throttle for QoS controls

• For more information, check out Brendan Greggʼs excellent talk next door

Read-only system pool

• At Fishworks, we decided to have a read-only system pool

• Necessary for OS install as well as analytics data • Simplified some things:

• No unnecessary customizations from customers

• Discouraged hot patching • Other disadvantages:

• Upgrade, rollback, and factory reset were tricky

SmartOS USB Boot

• Instead of installing OS to root disks, SmartOS boots from a USB key

• Entire kernel and userland fit in about 200 MB (compressed)

• Other software can be installed from pkgsrc • Single ZFS pool for all zones

USB Boot Advantages

• All disks are available for zone/VM storage, thereby increasing both performance and capacity

• Encourages users to provision a zone for each application rather than using the global zone

• Discourages customization and one-off patching • Fast to get up and running • Easy to “bring your OS with you”

SmartDataCenter (SDC) Architecture

• Two kinds of servers: head nodes and compute nodes

• Head nodes run management, provisioning, monitoring, and boot services

• Compute nodes contain customer zones

• Head nodes are similar to SmartOS installs

• Each compute node PXE boots its platform from the head node

• Both head nodes and compute nodes have a single ZFS pool

PXE Boot Advantages

• Ben Rockwood, 10/1/2012: “Apparently other people spend time installing

software. I think that’s stupid.”

• As with SmartOS, operators encouraged to put applications in zones instead of global

• Upgrade = rollback = reboot, nothing more

• Newer platforms can be staged and machines rebooted later

• Any machine which hits a known fixed problem will automatically boot onto fresh platform

Storage pools!

• Most OSes assume the existence of a “system” pool — a pool onto which the OS, applications, and configuration information is installed

• Joyent moving away from single-vdev pools backed by hardware RAID

• Embracing hybrid storage pool (HSP) using an SSD for the ZFS intent log (ZIL)

• Everything else worked on RAID-Z pools except for saving a crash dump

RAID-Z Crash Dump

• Problem: have only one RAID-Z or mirrored pool but cannot save crash dump on said pool

• Implement crash dumps on RAID-Z (majority of work) and pools with multiple vdevs

• Not necessarily to save parity bits for crash dump data:

• Crash dump is immediately saved upon reboot • Needs to be reliable, simple, and (hopefully) fast

Why no parity bits?

• Since DVAs on the dump device are preallocated, use those 128K blocks for each write

• Most calls into dump entry point are not block aligned

• Rather than write variable size, use original 128K

• I first calculated parity bits, only my test machine took three hours to save a crash dump

• No parity calculated — on a pool with vdevs, each write could require n-1 (synchronous) reads

Other system components

• Swap device (thankfully) supports RAID-Z pools • /var, /opt have their own datasets • /etc notpersistent

• /root also not persistent, again incentivizing people to configure applications in zones rather than using the GZ


• The single ZFS pool has simplified Joyentʼs deployment

• Delegated administration has given customers more power

• ZFS has been and will continue to be a crucial component of our architecture for many years


Share this post:

Vote on HN