Borg

Borg is Google's distributed cluster operating cluster manager.

Concepts

  • Cells are clusters of machines, defined by the underlying network fabric, within a single datacentre.
    • Median size of around 10k, with one large cluster per-datacentre and a number of smaller test/special-purpose cells.
    • Comprise different instance types, but these are not directly exposed to users.
  • Jobs made up of one or more identical Tasks.
    • Have name and owner metadata.
    • Use a mix of hard and soft constraints to hint to the scheduler what they need.
    • Granular resource specifications, no fixed-size buckets/increments.
  • Tasks can be indefinitely running servers or batch jobs like MapReduce.
  • Tasks are scheduled and monitored by Borg, which performs bin-packing.
  • Borg Naming Service allocates a name and index number which can be used to identify the task.

Interaction

  • Via RPC, usually via a CLI.
  • Jobs are written in BCL, a variant of GCL.

Attempted replacement with p.r.comp.google.omega (Private)

  • Chasing a moving target; Borg kept moving.

  • Early estimates of the difficulty of enhancing Borg were overly pessimistic, and improvements turned out to be simpler in practice.

  • Complexity of the migration was far more difficult in practice due to the pervasive ubiquity of Borg in system configurations.

    This shelved development project led to many features being incorporated into future Borg iterations, and kickstarted development of Kubernetes.

References


Backlinks