ZFS

ZFS is a combined logical volume manager and filesystem originally developed at Sun Microsystems with a focus on data integrity and performance.

Features

  • Fast (constant time) snapshots and restores, enabled by a copy-on-write integrity model.
  • Realtime deduplication (~5GB/RAM required per TB of storage).
  • Hierarchical nesting of datasets, with cascading properties (for e.g. compression and storage quotas).

History

  • 2001: development starts at Sun Microsystems as part of Solaris.
  • 2005: released as part of OpenSolaris under the CDDL.
  • 2008: released as part of FreeBSD.
  • 2010: OpenSolaris development ceases and illumos is forked.
  • 2013: ZFS on Linux effort begins under the OpenZFS project.
  • 2016: Ubuntu ships ZFS by default.

Concepts

ZFS storage topology

  • A zpool is a pool of storage vdevs. A vdev cannot be shared between multiple zpools. The pool itself is not redundant: loss of a storage or special vdev effectively loses the pool.
    • A vdev represents either a (virtual device) represents either a single disk or a set of disks, all of approximately the same size. You can think of them as a container for one or more storage devices. Broadly speaking we can group these into three categories:
      • Devices:
        • disks are typical block devices, usually in /dev.
        • files are regular files, and shouldn't be used outside of experimentation.
      • Standard (storage):
        • stripe stripes data across a set of disks.
        • mirror mirrors data across a set of disks.
        • RAID-Z (follow power of 2 plus parity; e.g. 2+1 for raidz1, 2+2 for raidz2, 8+3 for raidz3):
          • raidz1 (or just raidz) offers single parity, similar to RAID-1.
          • raidz2 offers double parity, similar to RAID-6.
          • raidz3 offers triple parity.
        • dRAID is a variant of RAID-Z that provides hot spares for faster resilvering. dRAID vdevs are constructed from multiple raid groups, each with D data devices and P parity devices.
          • draid<parity 1-3> (or just draid for draid1) offers the specified parity.
          • draid<parity 1-3>[:data d (8)][:children c (null; no check)][:spares s (0)] sets up a custom dRAID configuration with the specified number of data, child and spare devices.
      • Support (optional):
        • The log (SLOG) device provides fast, non-volatile storage for the ZIL. The ZIL is where dirty data can be quickly written to satisfy `sync()` before being written to final storage via a TXG.
        • cache (L2ARC) provides a read cache operating below the ARC (in RAM). It's not an ARC, but rather a read buffer containing blocks evicted from the ARC.
        • special (Special allocation) vdevs allow offloading storage of system metadata and small writes independently of main storage. Loss of a special allocation vdev leads to loss of the pool.
        • spares are pseudo-vdevs that are used to track hot spares available for a zpool.
        • dedup devices are used to store deduplication tables. Redundancy should match the other pool devices. Multiple devices will be subject to load-balancing.
    • Blocks are the most basic unit of allocation and are sized according to the zpool's recordsize property.
    • Checkpoints are pool-wide snapshots that can be used to rewind the pool to an earlier state. Only one checkpoint can be retained per-pool at any one time.
  • Datasets are filesystems or block device hosted on a zpool. ZFS volume mount points are managed for you; no need to edit /etc/fstab.
    • Snapshots are read-only copes of a filesystem or volume at a point-in-time. They're quick to create and use no additional disk space at creation-time, and will only consume space as the dataset changes. They're accessible within the filesystem under the .zfs/snapshot directory, if the snapdir property is enabled. They can be shared between zpools on the same host and even between networked hosts.
    • Bookmarks are similar to snapshots, but are not visible to the filesystem.
    • Clones are writeable copies of a dataset that are created from a snapshot, and their existence prevents deletion of this snapshot. A parent-child relationship is created between the clone and the "origin" filesystem.
    • Properties associate information with a dataset:
      • Native properties either export internal ZFS statistics or allow configuring dataset behaviour. They're defined in zfsprops(7).
      • User-defined properties allow association of arbitrary metadata with a dataset and must contain a colon (:). They should be named in the format module:property, though this is not enforced.
  • Volumes (zvols) are block devices allocated on the pool which can be used to back swap devices, other filesystems, or be passed through to VMs.
  • Resilvering is the process of rebuilding an array after disk replacement, by reading only the data required to restore redundancy.
  • Scrubbing is a zpool integrity check similar to resilvering except that it reads all data. Unlike traditional fsck it can take place with the pool online, though with performance degradation. Identified data corruption will be rectified using data on other disks in the pool where possible. Pools should undergo regular scrubbing, though the frequency depends on the nature and scale of the data.

Management

Management of ZFS is split across two main commands:

  • zpool manages devices (disks, vdevs, zpools).
    • create POOL MODE DEVICES [...MODE DEVICES] creates a pool with the specified vdevs.
    • destroy POOL destroys the named pool.
    • list lists active zpools.
    • events lists events generated by the kernel module, ususally consumed by zed to enable response to hardware failures.
      • -v prints the full payload rather than just the name.
      • -f follows the events.
    • checkpoint POOL creates a checkpoint of the current state.
      • -d discards the existing checkpoint.
    • export POOL removes a pool from the running system after unmounting any volumes.
    • import POOL imports a pool into the running system.
      • -d DEVICE_NODE_PATH lets ZFS determine which devices need to be imported from the contents of the specified path (e.g. /dev/disk/by-id).
      • --rewind-to-checkpoint rewinds the pool to the associated checkpoint as it's imported.
    • offline POOL DEVICE takes the named device in the named pool offline, leaving the rest of the pool online.
    • online POOL DEVICE takes the named device back online.
    • resilver POOL [...POOL] initiates a resilver.
    • scrub POOL initiates a scrub.
      • -s stops an in-progress scrub.
      • -p pauses a scrub until a future issue of zpool scrub.
      • -w waits for the scrub to complete.
  • zfs manages datasets.
    • create POOL/VOLUME creates a volume, by default hosting a ZFS dataset.
      • -s creates a sparse volume, without a storage reservation.
      • -V VOLUME_SIZE creates a volume of the specified size and skips creation a ZFS filesystem. This allows using pooled storage for non-ZFS volumes.
    • destroy POOL/VOLUME removes the named volume from a pool.
    • rename OLD NEW changes the name of a volume, and can move volumes between parents in the hierarchy.
    • get PROPERTY|all POOL/VOLUME gets a property on a volume.
    • set PROPERTY=VALUE POOL/VOLUME sets a property on a volume.
    • snapshot DATASET@SNAPSHOT creates a snapshot with the specified name of the named dataset.
      • -r causes the command to recurse into descendent datasets.
    • bookmark SNAPSHOT|BOOKMARK creates a bookmark of the specified bookmark or snapshot, which can be used as an incremental source for a zfs send.
    • send DATASET|VOLUME|SNAPSHOT generates a stream of the named item for use with zfs receive.
      • -i SNAPSHOT|BOOKMARK limits the stream to changes from the specified incremental snapshot.
    • receive DATASET|VOLUME|SNAPSHOT receives a stream generated by zfs send and writes it to the specified destination.
    • clone SNAPSHOT DATASET|VOLUME creates a read-write copy of an existing dataset from the named snapshot.
    • promote CLONE removes the parent-child relationship between a clone and its source, allowing deletion of the source snapshot.
  • zdb presents information about the internal state of a ZFS pool, and may be useful for troubleshooting and debugging.
  • fsck.zfs provides a thin wrapper around zpool status for compatibility with the existing filesystem infrastructure.
  • mount.zfs provides a mount helper.

Ecosystem

Boot Environments

Boot environments associate the state of an operating system installation with the underlying storage, allowing for rollback after failed configuration changes or upgrades. The feature originated in Solaris, but has since made it to BSDs.

Internals

ZFS uses 128-bit addressing, meaning you'll run out of hardware before being unable to address it:

  • 2^48 entries in a directory
  • 16EB file size
  • 256ZB zpool size
  • 2^64 zpools per system
  • 2^64 datasets in a zpool

It follows a number of principles to ensure integrity:

  • Existing blocks are never overwritten.
  • The filesystem is kept consistent at all time, atomically moving between consistent states (checkpoints).
    • This is the reason snapshots are relatively cheap.
  • Data is checksummed on write and subsequent read, ensuring that data retrieved from the filesystem is as it was when it was originally written.

References


Children
  1. Scrubbing with systemd

Backlinks