Amazon Athena is an interactive query service for S3 that allows querying objects using SQL. It doesn't require an ETL process as processing applies a serialiser directly to the input object (with parallelisation), though performance may be improved through use of a columnar format with fixed-length records.
Since the underlying implementation is serverless you only pay for what you use, and the platform can be considered infinitely scalable. Athena is built on Presto.
- A SerDe (Serialiser-Deserialiser) defines the composition of records within the source (the file format).
- Decompression of compressed objects is possible for an additional fee.
- Columnar data, where columns are of a fixed size, will be deserialised more quickly.
- Partitioning data by key cuts down the amount of data Athena must scan when seeking matching records.
- Querying web server or Cloudfront logs.
Athena will apply the configured schema to records as it reads the data from the source bucket. Schemas can be managed in two ways:
- Glue Data Catalog provides centralised management of schemas across AWS services, and can attempt to determine the schema from the source data using crawlers.
- Athena provides its own DDL (
CREATE EXTERNAL TABLE) syntax, though this is harder to automate.
- Create an S3 bucket and place an object in it.
- Create a metadata database.
- Declare a table schema for records, along with a deserialiser that can parse complex types from records in the object.
It's just SQL!