Datacentre architecture

The Azure datacentre infrastructure is managed by MCIO, formerly known as GFS.

By the numbers

10 languages
19 currencies
60 regions
Availability in > 140 countries
Azure datacentres contain > 1m servers

Locations

Azure is divided into a number of locations (also known as regions) comprising one or more datacentres. These are each at least 100 miles from one another, providing geographic redundancy.

Locations are paired either one or two other regions within a geographic location, connected to a low-latency backbone network allowing replication of data across regions. One of these paired regions will be broken into a number of distinct availability zones. Geo-redundant storage resources are designed to allow replication across these paired regions.

During planning, consider:

Regional service availability, especially for newer locations and services. Rollouts of minor Azure features can take up to two weeks, and larger features can take several months.
Network latency, via the Azure speed test service.
Region access qualifications requiring a billing address in a specific location (e.g. China and Australia), or GovCloud access requirements. Inaccessible regions generally won't be displayed in the portal.

Design

40-50 servers per rack, depending on use case.
Top of rack switch connected to aggregation switches.
Stamps (clusters) are 20 rack groups, approximately 1,000 servers, homogeneous by processor generation.
Some racks contain fabric controllers (5 per stamp): servers responsible for monitoring the health of resources and performing lifecycle management operations within the cluster.
Leaf and spine network topology.

High availability

The Azure 99.95% SLA applies only if the environment is correctly configured. In the case of virtual machines, ensure they're correctly placed into an availability set.

When designing infrastructure, consider redundancy at the following levels:

Hardware issues such as hypervisor components or disks.
Datacentre loss due to power or cooling failure or misconfiguration.
Region loss due to natural disasters.

Guard against the two different categories of outage:

Planned maintenance such as operating system updates or hypervisor maintenance.
Unexpected downtime such as hypervisor failure, datacentre outages due to cooling or power loss, and natural disasters.