Kinesis Streams allow collection, processing, and analysis of streaming data in real-time. The service is designed to provide durable storage and be resilient to faults.
Some key features of Streams:
- Client ordering is preserved, allowing FIFO processing.
- Processing of data records can take place across multiple clients in parallel.
- Collection and processing of data records is decoupled, allowing for processing of a spiky workload on a fixed size processor deployment.
- Producers create data records, and can be in or outside of AWS.
- A Streams application processes them.
- Comprised of Shards, each providing 1MB/s of read capacity and 2MB/s of write capacity.
- Consumers EC2 or Lambda functions.
- Data Records are the the units that are processed, comprising a sequence number, partition key and a data blob up to 1MB.
There are two means of ingesting data records into a Kinesis Stream:
- KPL is a library for producer applications that can batch submissions to a stream to improve performance, and submits CloudWatch metrics to allow monitoring of performance. It's not suited to time-sensitive applications because of its buffering.
- The API offers a simple means of ingestion with just two methods:
PutRecords, the latter accepting up to 500 records per request.
KCL is a Java (Private) library (and MultiLangDaemon, for non-Java clients) eases authoring consumers by managing:
- Data stream connections
- Shard enumeration
- Shard-worker leases for coordination
- Per-shard record processor instantiation
- Pulling records and pushing records to processors
- Checkpointing processed records
- Balancing shard-worker associations (leases)
Shard iterators come in types which specify a position from which to start streaming messages:
AT_SEQUENCE_NUMBERstarts reading from the specified
AFTER_SEQUENCE_NUMBERstarts reading immediately after the specified
AT_TIMESTAMPstarts reading from the provided
TRIM_HORIZONstarts reading at the last (oldest) untrimmed record in the shard.
LATESTstarts reading after the most recent.
Partition keys distribute across shards
Partition keys are hashed (MD5), and the resulting 128-bit key is used to identify the shard that'll process the data record.
The retention period for data records is 24 hours by default, can be increased to up to 365 days.