- Data record: Tweet
- Stream: all tweets send to kineses like queue name
- Partition Key: Twitter Topic. One topic can have multiple tweets
- Shard: data belong twitter topic that will grouped togather
- Sequence number: each data record gets one assigned when first ingested
- Worker: process the record of a sard in sequence number order
Shard Is unit Of Scale/Paralizem. It shard can ingest data up to 1MB/S and up to 1000 TPS Scla kinesis stream by spliting or merging Shards ex: 100MB/S. 100 Shards is the best
Put Data to Stream
If you have 10M for ingres, you need 10 Shards. 30% more shards is good.
2MB/s Consumer more hungry than data you want to put it
Putting Data Best practice
- Sending at Batch is Good. 50K in one time is better than 5 K multiple times.
- Use asyncrhronouse operation is preferable. Use ThreadBlokingQueue, etc.
Dealing with Provision ThroughputExceeded
You sending more data, exceding MB/S. or Exceeding TPS 1000TPS.
- If that’s happend: Retry
- keep track your partition_key
- keep track your shard-ids, based on shard-id you can check if you shard is full. and you can split full shard into two shards.
Kinesis Client Library
1 Shard – 1 Record Processor. It will balanced if your Shards scale up or scale down.
Transform -> Filter -> Buffer -> Emit
- Transform: Transform your event data to business specific needs. ie: Transform protobuff to POJO.
- Filter: some how you only need to consume specific event type and ignore some.
- Buffer: It will be good if collect event to some amount of size dan process it aka: Micro Batch processing.
- Emit: The actual processor. ie: Store record to data store.
Best Practice Processing App
- Don’t run app in single EC2 instane stand alone.
- Put it to Auto Scaling Group. When data maybe Spiky. When you scale up Shards, amazon can load balanced the application.
- When one of worker down, the kinesis will automaticly re-route the shard data to other worker
- Reshard only work on going forward. you can’t reshard after you have lot of data in one shard and you want to spead accros more shard it will not going to work.
Scalling and Pararel Processing
- You can spin up multiple app / worker for one stream. ex: document-changed-stream need to be consume by Reports Worker and Quota worker. Each app will maintain it’s own check point / cursor. Basicly the apps is independent each other, the really don’t know each other. the order and the timing when they are processing the data the completly independent each other
- Idealy your worker instances should not more than number of shards.
To scale up processing you should try this followoing combination
- Increase instance size L4 Large. it needs more Processor or Memory
- Increase number of instance up to numbers of shards.
- Increase number of shards (Wich increase level of paralelism)
Consumer Caught Up / Slow
- GetRecords.Byets: Track number of byets retrieved from the stream
- GetRecords.IteratorAge: The age of iterator used to request records for all shards in your system in Milliseconds
- GetRecords.Latency: Time taken per GetRecords operation in Milliseconds
Flush all your kinessis Data so S3
- Always good idea to to capture your data and keep it to S3 because maigth be legilimate needed to extact and reprocess it in case your down stream app down for more than 24 hours.
- Ensure data store / processed at the first time then do the check point.
- Resillient to duplicate. may be the producer send duplicate data, and the consumer should able to detect and react to this situation.