Streams
Kinesis Streams allow collection, processing, and analysis of streaming data in real-time. The service is designed to provide durable storage and be resilient to faults.
Some key features of Streams:
- Client ordering is preserved, allowing FIFO processing.
- Processing of data records can take place across multiple clients in parallel.
- Collection and processing of data records is decoupled, allowing for processing of a spiky workload on a fixed size processor deployment.
Concepts
- Producers create data records, and can be in or outside of AWS.
- A Streams application processes them.
- Comprised of Shards, each providing 1MB/s of read capacity and 2MB/s of write capacity.
- Consumers EC2 or Lambda functions.
- Data Records are the units that are processed, comprising a sequence number, partition key and a data blob up to 1MB.
Producers
There are two means of ingesting data records into a Kinesis Stream:
- KPL is a library for producer applications that can batch submissions to a stream to improve performance, and submits CloudWatch metrics to allow monitoring of performance. It's not suited to time-sensitive applications because of its buffering.
- The API offers a simple means of ingestion with just two methods:
PutRecord
andPutRecords
, the latter accepting up to 500 records per request.
Consumers
KCL is a Java (Private) library (and MultiLangDaemon, for non-Java clients) eases authoring consumers by managing:
- Data stream connections
- Shard enumeration
- Shard-worker leases for coordination
- Per-shard record processor instantiation
- Pulling records and pushing records to processors
- Checkpointing processed records
- Balancing shard-worker associations (leases)
Shard iterators
Shard iterators come in types which specify a position from which to start streaming messages:
AT_SEQUENCE_NUMBER
starts reading from the specifiedStartingSequenceNumber
.AFTER_SEQUENCE_NUMBER
starts reading immediately after the specifiedStartingSequenceNumber
.AT_TIMESTAMP
starts reading from the providedTimestamp
.TRIM_HORIZON
starts reading at the last (oldest) untrimmed record in the shard.LATEST
starts reading after the most recent.
Partition keys distribute across shards
Partition keys are hashed (MD5), and the resulting 128-bit key is used to identify the shard that'll process the data record.
Retention period
The retention period for data records is 24 hours by default, can be increased to up to 365 days.
Backlinks