Batch

AWS Batch is a batch computing platform built on ECS (though it can also schedule jobs on EKS). It's designed for both time- and cost-optimised operation, and supports spot instances for cheaper operation.

Use cases

Batch is well-suited to:

  • Machine learning.
  • Post-trade analytics.
  • High-performance computing.
  • Rendering, transcoding, or other long-running file format conversion.

Concepts

  • The Scheduler examines submitted jobs and assigned compute environments. Scheduling can either be one container per instance or bin-packing multiple containers.
  • Jobs are containerised workloads, run approximately in submission order. They can have dependencies on one another, and can target successful completion of specific elements of array jobs.
  • Job Queues contain jobs until they're scheduled to a compute resource and for 24 hours afterwards.
  • Compute Environments come in two forms:
    • Managed are defined by business requirements such as budget or filesystem, and are launched and scaled by the Batch platform. These can be allocated based on a spot instance bid.
    • Unmanaged allow the customer to launch and manage their own resources, which must run the ECS agent.
  • Job Definitions allow templating jobs to reduce duplication of job properties. Their properties can be overridden at job submission time.

Array jobs

Array jobs allow scheduling multiple jobs from a single job specification including an array of input. Up to 10,000 jobs may be submitted in this means.

Dependencies

  • Straight dependsOn with jobId.
  • 1:1.
  • Sequential.
  • End-to-end.

Job states

  • SUBMITTED means accepted into the queue, but not yet evaluated for execution.
  • PENDING indicates the job is waiting for dependencies to complete.
  • RUNNABLE means the job has been evaluated by the scheduler and is ready to run.
  • STARTING means the job is currently being assigned to a compute environment.
  • RUNNING indicates that the job's execution is in progress.
  • SUCCEEDED indicates that it's completed successfully.
  • FAILED indicates the job experienced a problem at run-time.