Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Streaming data for real time analysis
1. @ 2014 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc
Streaming Data for Analysis
Brett Francis
Enterprise Solutions Architect
2. Talk Outline
• Streaming Big Data
• Analytics with Redshift
• Generalizing the Streaming for Analytics design pattern
• Cost Influences on Architecture
3. You’re likely already “streaming”
• Sensor networks analytics
• Ad network analytics
• Log shipping and centralization
• Click stream analysis
• Gaming status
• Hardware and software appliance metrics
• …more…
6. One common starting point is ingesting records
for analysis
Elastic Beanstalk
foo-analysis.com
Global top-10
foo-analysis.com
7. Too big to handle on one box
Global top-10Elastic Beanstalk
foo-analysis.com
8. The solution: needs record sorting and grouping
Local top-10
Local top-10
Local top-10 Global top-10
Elastic Beanstalk
foo-analysis.com
9. The solution: streaming map/reduce
Global top-10
Elastic Beanstalk
foo-analysis.com
Local top-10
Local top-10
Local top-10
Data Record
Shard:
Sequence Number
14 17 18 21 23
10. When to use Stream Processing
• “real-time” starts coming onto the radar
• The time to answer can’t wait for batch processing times
• Instead of processing serially as A > B > C it would be
better to have a fan out pattern
• The records are just a means to an end, most records
can be immediately archived after an “answer” is
determined.
11. How this relates to Kinesis
Global top-10Elastic Beanstalk
foo-analysis.com
Kinesis
Kinesis
Application
12. Core streaming concepts
Global top-10Elastic Beanstalk
foo-analysis.com
Data
Record
Stream
Shard
Partition Key
Worker
My top-10
Data Record
Shard:
Sequence Number
14 17 18 21 23
13. Kinesis Managed Stream Processing
• Moved from batch to continuous processing
• Scale shards and time series elastically UP or DOWN
without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
• Records stored across multiple availability zones
• Multiple parallel Kinesis Aps output to anything…
• RDBMS, S3, In-house Data Warehouse, Messaging, another stream,
JavaSDK, PythonSDK, etc.
15. Core Concepts Recapped
• Data Record ~ a single generated record
• Stream ~ all records (aka. The Fire Hose)
• Partition Key ~ all records for specific topic / sensor
• Shard ~ all data records belonging to a set of topics, grouped
together
• Sequence Number ~ generated and assigned to each data record
when ingested
• Worker ~ processes the records of a shard in sequence order
17. Analysis using Redshift
• Compatible with existing SQL Business Intelligence tools
• Start small and grow massively
• Scalable from 160GB to Petabyte+
• Elastic data warehousing
• Automatically run queries against old cluster while the new one is being
provisioned
• Run it when you need it
21. Example: Kinesis for Simple Metering & Billing
Billing
auditors
Incremental
bill
computation
Metering
record
archive
Billing mgmt
service
22. Kinesis Poster Worker Demo
(aka. The Egg Finder)
• Published at AWSlabs
• h t t p s : / / g i t h u b . c o m / a w s l a b s / k i n e s i s - p o s t e r - w o r k e r
• Poster ~ multi-threaded client that posts random characters in to a stream
• Worker ~ a thread-per-shard client that gets batches of records looking for
the word ‘egg’