Athena

Amazon Athena is an interactive query service for S3 that allows querying objects using SQL. It doesn't require an ETL process as processing applies a serialiser directly to the input object (with parallelisation), though performance may be improved through use of a columnar format with fixed-length records.

Since the underlying implementation is serverless you only pay for what you use, and the platform can be considered infinitely scalable. Athena is built on Presto.

Concepts

  • A SerDe (Serialiser-Deserialiser) defines the composition of records within the source (the file format).
  • Decompression of compressed objects is possible for an additional fee.
  • Columnar data, where columns are of a fixed size, will be deserialised more quickly.
  • Partitioning data by key cuts down the amount of data Athena must scan when seeking matching records.

Use cases

Declaring schema

Athena will apply the configured schema to records as it reads the data from the source bucket. Schemas can be managed in two ways:

  • Glue Data Catalog provides centralised management of schemas across AWS services, and can attempt to determine the schema from the source data using crawlers.
  • Athena provides its own DDL (CREATE EXTERNAL TABLE) syntax, though this is harder to automate.
  1. Create an S3 bucket and place an object in it.
  2. Create a metadata database.
  3. Declare a table schema for records, along with a deserialiser that can parse complex types from records in the object.

Querying

It's just SQL!


Backlinks