Más contenido relacionado La actualidad más candente (20) Similar a Getting started with streaming analytics (20) Más de javier ramirez (20) Getting started with streaming analytics1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Part 1 of 3: The basics of real-time streaming analytics
Getting started with streaming analytics
Javier Ramirez
AWS Developer Advocate
@supercoco9
2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Agenda
Why real-time analytics and data streaming?
Challenges of streaming analytics
Useful concepts to reason about streaming data
Components of a streaming analytics pipeline
Overview of popular Open Source components for
streaming analytics: Apache Kafka, Apache Spark, Apache Flink, Apache
Cassandra, Apache HBase, ElasticSearch
AWS toolbox for streaming analytics: Amazon MSK, Amazon
EMR, Amazon Kinesis, Amazon Keyspaces, Amazon DynamoDB, Amazon
ElasticSearch
3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Why streaming analytics
• The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
• 90% of the data in the world was generated in the
last 2 years
• There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating
Source: BI Intelligence Estimates Source: Forbes – How much data do we produce
Data streaming technology enables a customer to ingest, process,
and analyze high volumes of high-velocity data from a variety of
sources
4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
The of data diminishes over time
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
https://aws.amazon.com/streaming-data/
6. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
Data is never complete
You don’t know the volume of the data before you start
Low-latency is expected
Data can come out of order
System should remain available during upgrades
7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple problem (until you know the details)
I want to calculate the total and average of several numbers
8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple big data problem (until you know the details)
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory,
or in a single hard drive
9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simplish streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We
will be adding and removing sensors all the time
11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A quite standard streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a
while and then send a bunch of stale data
12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An elastic and scalable streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to
thousands)
13. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An almost real-life streaming analytics scenario
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per
week, per day, per hour, per minute…
14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A real business use case for streaming
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
We need pretty dashboards with current status, comparison with the
past, trends, and anomaly detection
To run this reliably, we need advanced monitoring, alerts, and
autoscaling
No, I am not hiring a whole new operations team to manage the
system
15. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
17. Probably less than you think
~20 lines of JAVA code (plus a
few hundreds with imports,
POJOs, and boilerplate, because
JAVA)
a simple GROUP BY statement in
SQL with streaming extensions
(plus a few lines of boilerplate for
schema definition)
OR
18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Streaming analytics concepts
19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming data pipeline overview
Ingest Transform Analyze React Persist
• Durable
• Stateful
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Durability and reliability
Need to store intermediate data
You might want to be able to replay the stream
Self-healing architecture. If one component goes down
while data is in-flight, the system needs to re-balance and
data needs to be reassigned seamlessly
Monitoring
21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stateful processing
Working on per-element streams is relatively easy (i.e. change format of each item, or filter
our records based on their own properties)
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
The real fun starts when you need to do transforms/ aggregations over groups of elements:
group by, count, max, average, joins, filtering based on properties from related records, or
complex pattern detection
22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Continuous and fast
Data can come in spikes, faster than we can process it.
Need to account for reliable persistent storage while in-
flight
You will need to think how to update a system that never
stops receiving data
Since data is never complete, in the case of stateful
computations, we need to decide when to output data
(windowing)
23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Processing-Time based windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Event-Time Based Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Session Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
26. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Late-arriving data
Event-time vs Processing-time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Delivery semantics
• Exactly once
• At least once
• At most once
28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Reactive
All the components need to be designed for low-latency
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Components of a streaming
analytics pipeline
30. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming analytics components
Devices and/or
applications
that produce
real-time
data at high
velocity
Data from tens of
thousands of data sources
can be written to a single
stream
Data are stored in the
order they were received
for a set duration
of time and can be
replayed indefinitely
during that time
Records are read in
the order they are produced,
enabling real-time analytics
or streaming ETL
Database (NoSQL
most common),
Message broker,
Notification system,
File Storage, or Data
Lake
`
Analytics
dashboard
31. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
The (excellent) Open Source ecosystem
32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Kafka
A distributed streaming platform
Concepts:
Producers
Topics
Brokers
Consumers
33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Flume
Distributed, reliable, and available service for collecting,
aggregating, and moving large amounts of log data
34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
Concepts:
Driver/Workers
Data Source
Discretized Stream
Transforms
Streaming SQL
Outputs
35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
36. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
Concepts:
Job Manager/Workers
Source
DataStream
Transforms/Operators
TableAPI/SQL
Sinks
37. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
38. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
39. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
https://cassandra.apache.org/
Concepts:
Nodes
Token Ring
Consistency Levels
Column Families
40. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
41. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache HBase
The Hadoop database, a distributed, scalable, big data store
https://hbase.apache.org/book.html
First, make sure you have enough data. If you have
hundreds of millions or billions of rows, then HBase
is a good candidate. If you only have a few
thousand/million rows, then using a traditional
RDBMS might be a better choice due to the fact
that all of your data might wind up on a single node
(or two) and the rest of the cluster may be sitting
idle.
Concepts: Hbase Master, Regions, Region Servers, Data Nodes, Column Families
42. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Elasticsearch with Kibana
Elasticsearch is a distributed JSON-based search and
analytics engine. Kibana gives shape to your data
https://www.elastic.co/kibana
Wikimedia has a live
interactive dashboard
powered by Kibana at
https://wikimedia.biterg.io/
Concepts:
Master Node
Data Nodes
Shard
Index
43. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Grafana
Grafana allows you to query, visualize, alert on and
understand your metrics no matter where they are stored.
https://grafana.com/grafana/
Wikimedia also has a
live interactive metrics
dashboard powered by
Grafana at
https://grafana.wikimedia.org/
Concepts:
Data Source
44. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Challenges of data streaming components
Difficult to setup Tricky to scale
Hard to achieve high availability Integration required
development
Error prone and complex to manage Expensive to maintain
45. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
AWS services for streaming analytics
Both managed services and native services
46. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming real-time data with AWS
* Some services scale up and down elastically, while others allow you to automate when to scale up/down
** It is possible to have a serverless data streaming pipeline, in which you pay only for what you use. In the case of managed
non-serverless services, you can dynamically adapt to your traffic
47. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for Ingestion/in-stream storage
Amazon Managed Streaming for Apache Kafka
Fully managed version of Apache Kafka
Amazon Kinesis Data Streams
Massively scalable, elastic, and durable real-time data streaming
Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data
into data lakes, data stores, and analytics services.
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL
48. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream processing
Amazon Kinesis Data Analytics for Apache Flink
Fully managed, elastic, version of Apache Flink
Amazon Kinesis Data Analytics for SQL Applications
Process and analyze streaming data using standard SQL
Amazon EMR
Easily run and scale Apache Spark and other big data frameworks. You can also
run Apache Flink and Apache HBase on EMR
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL. Supports Spark for serverless ETL
49. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream storage
Amazon Keyspaces for Apache Cassandra
Scalable, highly available, and managed Apache Cassandra compatible db service
Amazon DynamoDB
Fast and flexible NoSQL database service for any scale (for example, in 2017 Samsung
Cloud Service was serving 300M users with a total storage of 860TB)
Amazon EMR
Easily run and scale Apache HBase and other big data frameworks. You can also run
Apache Flink and Apache Spark on EMR
50. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for analytics dashboards
Amazon Elasticsearch Service
Fully managed, scalable, and secure Elasticsearch service
Amazon Quicksight
Fast, cloud-powered business intelligence service that makes it easy to deliver
insights to everyone in your organization.
51. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A serverless data stream (per element processing)
data
producer
Kinesis Data
Streams
Amazon
SNS
Continuously stream data
Lambda
service
Lambda
functionA
Lambda
function B
Continuously polls for new data,
1 poll per second
Automatically invokes your
function(s) when data found
DynamoDB
52. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Fully managed stateful streaming analytics
53. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Getting Started
https://engineering.linkedin.com/distributed-systems/log-what-every-software-
engineer-should-know-about-real-time-datas-unifying
A great write-up on streaming analytics challenges
https://aws.amazon.com/streaming-data/
Streaming data
https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html
Getting started with Apache Kafka/Amazon MSK
https://aws.amazon.com/kinesis/
Amazon Kinesis Services for streaming data
https://aws.amazon.com/elasticsearch-service/
Amazon ElasticSearch Service
https://dl.acm.org/doi/10.1145/543613.543615
Research about Models and Issues in data stream systems
54. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
ThanksJavier Ramirez
AWS Developer Advocate
@supercoco9