ZFS
ZFS is a combined logical volume manager and filesystem originally developed at Sun Microsystems with a focus on data integrity and performance.
Features
- Fast (constant time) snapshots and restores, enabled by a copy-on-write integrity model.
- Realtime deduplication (~5GB/RAM required per TB of storage).
- Hierarchical nesting of datasets, with cascading properties (for e.g. compression and storage quotas).
History
- 2001: development starts at Sun Microsystems as part of Solaris.
- 2005: released as part of OpenSolaris under the CDDL.
- 2008: released as part of FreeBSD.
- 2010: OpenSolaris development ceases and illumos is forked.
- 2013: ZFS on Linux effort begins under the OpenZFS project.
- 2016: Ubuntu ships ZFS by default.
Concepts
- A
zpool
is a pool of storagevdev
s. Avdev
cannot be shared between multiplezpool
s. The pool itself is not redundant: loss of a storage or specialvdev
effectively loses the pool.- A
vdev
represents either a (virtual device) represents either a single disk or a set of disks, all of approximately the same size. You can think of them as a container for one or more storage devices. Broadly speaking we can group these into three categories:- Devices:
disk
s are typical block devices, usually in/dev
.file
s are regular files, and shouldn't be used outside of experimentation.
- Standard (storage):
stripe
stripes data across a set of disks.mirror
mirrors data across a set of disks.- RAID-Z (follow power of 2 plus parity; e.g. 2+1 for
raidz1
, 2+2 forraidz2
, 8+3 forraidz3
):raidz1
(or justraidz
) offers single parity, similar to RAID-1.raidz2
offers double parity, similar to RAID-6.raidz3
offers triple parity.
- dRAID is a variant of RAID-Z that provides hot spares for faster resilvering. dRAID
vdev
s are constructed from multipleraid
groups, each with D data devices and P parity devices.draid<parity 1-3>
(or justdraid
fordraid1
) offers the specified parity.draid<parity 1-3>[:data d (8)][:children c (null; no check)][:spares s (0)]
sets up a custom dRAID configuration with the specified number of data, child and spare devices.
- Support (optional):
- The
log
(SLOG) device provides fast, non-volatile storage for the ZIL. The ZIL is where dirty data can be quickly written to satisfy `sync()` before being written to final storage via a TXG. cache
(L2ARC) provides a read cache operating below the ARC (in RAM). It's not an ARC, but rather a read buffer containing blocks evicted from the ARC.special
(Special allocation)vdev
s allow offloading storage of system metadata and small writes independently of main storage. Loss of a special allocationvdev
leads to loss of the pool.spare
s are pseudo-vdev
s that are used to track hot spares available for azpool
.dedup
devices are used to store deduplication tables. Redundancy should match the other pool devices. Multiple devices will be subject to load-balancing.
- The
- Devices:
- Blocks are the most basic unit of allocation and are sized according to the
zpool
'srecordsize
property. - Checkpoints are pool-wide snapshots that can be used to rewind the pool to an earlier state. Only one checkpoint can be retained per-pool at any one time.
- A
- Datasets are filesystems or block device hosted on a
zpool
. ZFS volume mount points are managed for you; no need to edit/etc/fstab
.- Snapshots are read-only copes of a filesystem or volume at a point-in-time. They're quick to create and use no additional disk space at creation-time, and will only consume space as the dataset changes. They're accessible within the filesystem under the
.zfs/snapshot
directory, if thesnapdir
property is enabled. They can be shared betweenzpool
s on the same host and even between networked hosts. - Bookmarks are similar to snapshots, but are not visible to the filesystem.
- Clones are writeable copies of a dataset that are created from a snapshot, and their existence prevents deletion of this snapshot. A parent-child relationship is created between the clone and the "origin" filesystem.
- Properties associate information with a dataset:
- Native properties either export internal ZFS statistics or allow configuring dataset behaviour. They're defined in
zfsprops(7)
. - User-defined properties allow association of arbitrary metadata with a dataset and must contain a colon (
:
). They should be named in the formatmodule:property
, though this is not enforced.
- Native properties either export internal ZFS statistics or allow configuring dataset behaviour. They're defined in
- Snapshots are read-only copes of a filesystem or volume at a point-in-time. They're quick to create and use no additional disk space at creation-time, and will only consume space as the dataset changes. They're accessible within the filesystem under the
- Volumes (
zvol
s) are block devices allocated on the pool which can be used to back swap devices, other filesystems, or be passed through to VMs. - Resilvering is the process of rebuilding an array after disk replacement, by reading only the data required to restore redundancy.
- Scrubbing is a
zpool
integrity check similar to resilvering except that it reads all data. Unlike traditionalfsck
it can take place with the pool online, though with performance degradation. Identified data corruption will be rectified using data on other disks in the pool where possible. Pools should undergo regular scrubbing, though the frequency depends on the nature and scale of the data.
Management
Management of ZFS is split across two main commands:
zpool
manages devices (disks, vdevs, zpools).create POOL MODE DEVICES [...MODE DEVICES]
creates a pool with the specifiedvdev
s.destroy POOL
destroys the named pool.list
lists active zpools.events
lists events generated by the kernel module, ususally consumed byzed
to enable response to hardware failures.-v
prints the full payload rather than just the name.-f
follows the events.
checkpoint POOL
creates a checkpoint of the current state.-d
discards the existing checkpoint.
export POOL
removes a pool from the running system after unmounting any volumes.import POOL
imports a pool into the running system.-d DEVICE_NODE_PATH
lets ZFS determine which devices need to be imported from the contents of the specified path (e.g./dev/disk/by-id
).--rewind-to-checkpoint
rewinds the pool to the associated checkpoint as it's imported.
offline POOL DEVICE
takes the named device in the named pool offline, leaving the rest of the pool online.online POOL DEVICE
takes the named device back online.resilver POOL [...POOL]
initiates a resilver.scrub POOL
initiates a scrub.-s
stops an in-progress scrub.-p
pauses a scrub until a future issue ofzpool scrub
.-w
waits for the scrub to complete.
zfs
manages datasets.create POOL/VOLUME
creates a volume, by default hosting a ZFS dataset.-s
creates a sparse volume, without a storage reservation.-V VOLUME_SIZE
creates a volume of the specified size and skips creation a ZFS filesystem. This allows using pooled storage for non-ZFS volumes.
destroy POOL/VOLUME
removes the named volume from a pool.rename OLD NEW
changes the name of a volume, and can move volumes between parents in the hierarchy.get PROPERTY|all POOL/VOLUME
gets a property on a volume.set PROPERTY=VALUE POOL/VOLUME
sets a property on a volume.snapshot DATASET@SNAPSHOT
creates a snapshot with the specified name of the named dataset.-r
causes the command to recurse into descendent datasets.
bookmark SNAPSHOT|BOOKMARK
creates a bookmark of the specified bookmark or snapshot, which can be used as an incremental source for azfs send
.send DATASET|VOLUME|SNAPSHOT
generates a stream of the named item for use withzfs receive
.-i SNAPSHOT|BOOKMARK
limits the stream to changes from the specified incremental snapshot.
receive DATASET|VOLUME|SNAPSHOT
receives a stream generated byzfs send
and writes it to the specified destination.clone SNAPSHOT DATASET|VOLUME
creates a read-write copy of an existing dataset from the named snapshot.promote CLONE
removes the parent-child relationship between a clone and its source, allowing deletion of the source snapshot.
zdb
presents information about the internal state of a ZFS pool, and may be useful for troubleshooting and debugging.fsck.zfs
provides a thin wrapper aroundzpool status
for compatibility with the existing filesystem infrastructure.mount.zfs
provides a mount helper.
Ecosystem
Boot Environments
Boot environments associate the state of an operating system installation with the underlying storage, allowing for rollback after failed configuration changes or upgrades. The feature originated in Solaris, but has since made it to BSDs.
Internals
ZFS uses 128-bit addressing, meaning you'll run out of hardware before being unable to address it:
- 2^48 entries in a directory
- 16EB file size
- 256ZB zpool size
- 2^64 zpools per system
- 2^64 datasets in a zpool
It follows a number of principles to ensure integrity:
- Existing blocks are never overwritten.
- The filesystem is kept consistent at all time, atomically moving between consistent states (checkpoints).
- This is the reason snapshots are relatively cheap.
- Data is checksummed on write and subsequent read, ensuring that data retrieved from the filesystem is as it was when it was originally written.
References
- ZFS for Newbies
- ZFS Features & Concepts TOI
- ZFS Topology FAQ
- ZFS On-Disk Specification
zfsconcepts(7)
man page
Children
Backlinks