SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Part 1 of 3: The basics of real-time streaming analytics
Getting started with streaming analytics
Javier Ramirez
AWS Developer Advocate
@supercoco9
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Agenda
Why real-time analytics and data streaming?
Challenges of streaming analytics
Useful concepts to reason about streaming data
Components of a streaming analytics pipeline
Overview of popular Open Source components for
streaming analytics: Apache Kafka, Apache Spark, Apache Flink, Apache
Cassandra, Apache HBase, ElasticSearch
AWS toolbox for streaming analytics: Amazon MSK, Amazon
EMR, Amazon Kinesis, Amazon Keyspaces, Amazon DynamoDB, Amazon
ElasticSearch
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Why streaming analytics
• The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
• 90% of the data in the world was generated in the
last 2 years
• There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating
Source: BI Intelligence Estimates Source: Forbes – How much data do we produce
Data streaming technology enables a customer to ingest, process,
and analyze high volumes of high-velocity data from a variety of
sources
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
The of data diminishes over time
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
https://aws.amazon.com/streaming-data/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Cannot I just use batch big data analytics tools?
Data is never complete
You don’t know the volume of the data before you start
Low-latency is expected
Data can come out of order
System should remain available during upgrades
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple problem (until you know the details)
I want to calculate the total and average of several numbers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple big data problem (until you know the details)
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory,
or in a single hard drive
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simple streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A simplish streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We
will be adding and removing sensors all the time
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A quite standard streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a
while and then send a bunch of stale data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An elastic and scalable streaming problem
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to
thousands)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
An almost real-life streaming analytics scenario
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a
single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then
send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per
week, per day, per hour, per minute…
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A real business use case for streaming
I want to calculate the total and average of several numbers
They might be MANY numbers, more than you can store in memory, or in a single hard drive
The dataset is not static, new numbers are coming all the time
From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
Flow will not be constant (from few events per second to thousands)
And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
We need pretty dashboards with current status, comparison with the
past, trends, and anomaly detection
To run this reliably, we need advanced monitoring, alerts, and
autoscaling
No, I am not hiring a whole new operations team to manage the
system
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
http://gunshowcomic.com/648
Probably less than you think
~20 lines of JAVA code (plus a
few hundreds with imports,
POJOs, and boilerplate, because
JAVA)
a simple GROUP BY statement in
SQL with streaming extensions
(plus a few lines of boilerplate for
schema definition)
OR
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Streaming analytics concepts
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming data pipeline overview
Ingest Transform Analyze React Persist
• Durable
• Stateful
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Durability and reliability
Need to store intermediate data
You might want to be able to replay the stream
Self-healing architecture. If one component goes down
while data is in-flight, the system needs to re-balance and
data needs to be reassigned seamlessly
Monitoring
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stateful processing
Working on per-element streams is relatively easy (i.e. change format of each item, or filter
our records based on their own properties)
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
The real fun starts when you need to do transforms/ aggregations over groups of elements:
group by, count, max, average, joins, filtering based on properties from related records, or
complex pattern detection
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Continuous and fast
Data can come in spikes, faster than we can process it.
Need to account for reliable persistent storage while in-
flight
You will need to think how to update a system that never
stops receiving data
Since data is never complete, in the case of stateful
computations, we need to decide when to output data
(windowing)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Processing-Time based windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Event-Time Based Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Session Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Late-arriving data
Event-time vs Processing-time
Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Correctness: Delivery semantics
• Exactly once
• At least once
• At most once
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Reactive
All the components need to be designed for low-latency
Source: Perishable insights, Mike Gualtieri, Forrester
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
Information half-life
in decision-making
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Components of a streaming
analytics pipeline
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming analytics components
Devices and/or
applications
that produce
real-time
data at high
velocity
Data from tens of
thousands of data sources
can be written to a single
stream
Data are stored in the
order they were received
for a set duration
of time and can be
replayed indefinitely
during that time
Records are read in
the order they are produced,
enabling real-time analytics
or streaming ETL
Database (NoSQL
most common),
Message broker,
Notification system,
File Storage, or Data
Lake
`
Analytics
dashboard
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
The (excellent) Open Source ecosystem
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Kafka
A distributed streaming platform
Concepts:
Producers
Topics
Brokers
Consumers
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Ingestion/in-stream storage: Apache Flume
Distributed, reliable, and available service for collecting,
aggregating, and moving large amounts of log data
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
Concepts:
Driver/Workers
Data Source
Discretized Stream
Transforms
Streaming SQL
Outputs
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Spark
Unified Analytics Engine for large-scale data processing
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
Concepts:
Job Manager/Workers
Source
DataStream
Transforms/Operators
TableAPI/SQL
Sinks
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Processing: Apache Flink
Stateful computation over Data Streams
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
https://cassandra.apache.org/
Concepts:
Nodes
Token Ring
Consistency Levels
Column Families
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache Cassandra
Manage massive amounts of data, fast, without losing sleep
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Stream Storage: Apache HBase
The Hadoop database, a distributed, scalable, big data store
https://hbase.apache.org/book.html
First, make sure you have enough data. If you have
hundreds of millions or billions of rows, then HBase
is a good candidate. If you only have a few
thousand/million rows, then using a traditional
RDBMS might be a better choice due to the fact
that all of your data might wind up on a single node
(or two) and the rest of the cluster may be sitting
idle.
Concepts: Hbase Master, Regions, Region Servers, Data Nodes, Column Families
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Elasticsearch with Kibana
Elasticsearch is a distributed JSON-based search and
analytics engine. Kibana gives shape to your data
https://www.elastic.co/kibana
Wikimedia has a live
interactive dashboard
powered by Kibana at
https://wikimedia.biterg.io/
Concepts:
Master Node
Data Nodes
Shard
Index
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Dashboard: Grafana
Grafana allows you to query, visualize, alert on and
understand your metrics no matter where they are stored.
https://grafana.com/grafana/
Wikimedia also has a
live interactive metrics
dashboard powered by
Grafana at
https://grafana.wikimedia.org/
Concepts:
Data Source
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Challenges of data streaming components
Difficult to setup Tricky to scale
Hard to achieve high availability Integration required
development
Error prone and complex to manage Expensive to maintain
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
AWS services for streaming analytics
Both managed services and native services
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Streaming real-time data with AWS
* Some services scale up and down elastically, while others allow you to automate when to scale up/down
** It is possible to have a serverless data streaming pipeline, in which you pay only for what you use. In the case of managed
non-serverless services, you can dynamically adapt to your traffic
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for Ingestion/in-stream storage
Amazon Managed Streaming for Apache Kafka
Fully managed version of Apache Kafka
Amazon Kinesis Data Streams
Massively scalable, elastic, and durable real-time data streaming
Amazon Kinesis Data Firehose
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data
into data lakes, data stores, and analytics services.
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream processing
Amazon Kinesis Data Analytics for Apache Flink
Fully managed, elastic, version of Apache Flink
Amazon Kinesis Data Analytics for SQL Applications
Process and analyze streaming data using standard SQL
Amazon EMR
Easily run and scale Apache Spark and other big data frameworks. You can also
run Apache Flink and Apache HBase on EMR
AWS Glue with serverless streaming
Simple, flexible, and cost-effective ETL. Supports Spark for serverless ETL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for stream storage
Amazon Keyspaces for Apache Cassandra
Scalable, highly available, and managed Apache Cassandra compatible db service
Amazon DynamoDB
Fast and flexible NoSQL database service for any scale (for example, in 2017 Samsung
Cloud Service was serving 300M users with a total storage of 860TB)
Amazon EMR
Easily run and scale Apache HBase and other big data frameworks. You can also run
Apache Flink and Apache Spark on EMR
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Services for analytics dashboards
Amazon Elasticsearch Service
Fully managed, scalable, and secure Elasticsearch service
Amazon Quicksight
Fast, cloud-powered business intelligence service that makes it easy to deliver
insights to everyone in your organization.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
A serverless data stream (per element processing)
data
producer
Kinesis Data
Streams
Amazon
SNS
Continuously stream data
Lambda
service
Lambda
functionA
Lambda
function B
Continuously polls for new data,
1 poll per second
Automatically invokes your
function(s) when data found
DynamoDB
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Fully managed stateful streaming analytics
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
Getting Started
https://engineering.linkedin.com/distributed-systems/log-what-every-software-
engineer-should-know-about-real-time-datas-unifying
A great write-up on streaming analytics challenges
https://aws.amazon.com/streaming-data/
Streaming data
https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html
Getting started with Apache Kafka/Amazon MSK
https://aws.amazon.com/kinesis/
Amazon Kinesis Services for streaming data
https://aws.amazon.com/elasticsearch-service/
Amazon ElasticSearch Service
https://dl.acm.org/doi/10.1145/543613.543615
Research about Models and Issues in data stream systems
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential
ThanksJavier Ramirez
AWS Developer Advocate
@supercoco9

Más contenido relacionado

La actualidad más candente

New AWS Security Solutions to Protect Your Workload
New AWS Security Solutions to Protect Your WorkloadNew AWS Security Solutions to Protect Your Workload
New AWS Security Solutions to Protect Your Workload
Amazon Web Services
 

La actualidad más candente (20)

Optimizing data lakes with Amazon S3 - STG302 - New York AWS Summit
Optimizing data lakes with Amazon S3 - STG302 - New York AWS SummitOptimizing data lakes with Amazon S3 - STG302 - New York AWS Summit
Optimizing data lakes with Amazon S3 - STG302 - New York AWS Summit
 
Databases in the Cloud em Amazon Web Services
Databases in the Cloud em Amazon Web Services Databases in the Cloud em Amazon Web Services
Databases in the Cloud em Amazon Web Services
 
Re:Invent 2019 Recap. AWS User Group Zaragoza. Javier Ramirez
Re:Invent 2019 Recap. AWS User Group Zaragoza. Javier RamirezRe:Invent 2019 Recap. AWS User Group Zaragoza. Javier Ramirez
Re:Invent 2019 Recap. AWS User Group Zaragoza. Javier Ramirez
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best Practices
 
Cloud Computing - How AWS can help your business
Cloud Computing - How AWS can help your businessCloud Computing - How AWS can help your business
Cloud Computing - How AWS can help your business
 
AWS DeepLens Workshop_Build Computer Vision Applications
AWS DeepLens Workshop_Build Computer Vision Applications AWS DeepLens Workshop_Build Computer Vision Applications
AWS DeepLens Workshop_Build Computer Vision Applications
 
Re:Invent 2019 Recap. AWS User Groups in Spain. Javier Ramirez
 Re:Invent 2019 Recap. AWS User Groups in Spain. Javier Ramirez Re:Invent 2019 Recap. AWS User Groups in Spain. Javier Ramirez
Re:Invent 2019 Recap. AWS User Groups in Spain. Javier Ramirez
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
 
Power up Your AWS Data Lake and Warehouse with Trusted Data (Sponsored by Tal...
Power up Your AWS Data Lake and Warehouse with Trusted Data (Sponsored by Tal...Power up Your AWS Data Lake and Warehouse with Trusted Data (Sponsored by Tal...
Power up Your AWS Data Lake and Warehouse with Trusted Data (Sponsored by Tal...
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 
New AWS Security Solutions to Protect Your Workload
New AWS Security Solutions to Protect Your WorkloadNew AWS Security Solutions to Protect Your Workload
New AWS Security Solutions to Protect Your Workload
 
Innovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-TechnologiesInnovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
Innovation-at-Hyper-scale-Outlook-on-Emerging-Technologies
 
SRV203 Optimizing Amazon EC2 for Fun and Profit
 SRV203 Optimizing Amazon EC2 for Fun and Profit SRV203 Optimizing Amazon EC2 for Fun and Profit
SRV203 Optimizing Amazon EC2 for Fun and Profit
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Migrate a relational database to Aurora - ADB302 - Atlanta AWS Summit
Migrate a relational database to Aurora - ADB302 - Atlanta AWS SummitMigrate a relational database to Aurora - ADB302 - Atlanta AWS Summit
Migrate a relational database to Aurora - ADB302 - Atlanta AWS Summit
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
 

Similar a Getting started with streaming analytics

透過資料平台掌握關鍵數據消費者洞察極大化
透過資料平台掌握關鍵數據消費者洞察極大化透過資料平台掌握關鍵數據消費者洞察極大化
透過資料平台掌握關鍵數據消費者洞察極大化
Amazon Web Services
 
State of the Union: Compute & DevOps
State of the Union: Compute & DevOpsState of the Union: Compute & DevOps
State of the Union: Compute & DevOps
Amazon Web Services
 

Similar a Getting started with streaming analytics (20)

Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)Analysing streaming data in real time (AWS)
Analysing streaming data in real time (AWS)
 
透過資料平台掌握關鍵數據消費者洞察極大化
透過資料平台掌握關鍵數據消費者洞察極大化透過資料平台掌握關鍵數據消費者洞察極大化
透過資料平台掌握關鍵數據消費者洞察極大化
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
State of the Union: Compute & DevOps
State of the Union: Compute & DevOpsState of the Union: Compute & DevOps
State of the Union: Compute & DevOps
 
Data Led Migration
Data Led Migration Data Led Migration
Data Led Migration
 
AI/ML Week: Strengthen Cybersecurity
AI/ML Week: Strengthen CybersecurityAI/ML Week: Strengthen Cybersecurity
AI/ML Week: Strengthen Cybersecurity
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
 
Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
 
NEW LAUNCH! AWS IoT Analytics from Consumer IoT to Industrial IoT - IOT211 - ...
NEW LAUNCH! AWS IoT Analytics from Consumer IoT to Industrial IoT - IOT211 - ...NEW LAUNCH! AWS IoT Analytics from Consumer IoT to Industrial IoT - IOT211 - ...
NEW LAUNCH! AWS IoT Analytics from Consumer IoT to Industrial IoT - IOT211 - ...
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWS
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
 
Modern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the CloudModern Data Platforms - Thinking Data Flywheel on the Cloud
Modern Data Platforms - Thinking Data Flywheel on the Cloud
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 

Más de javier ramirez

Más de javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...Recomendaciones, predicciones y detección de fraude usando servicios de intel...
Recomendaciones, predicciones y detección de fraude usando servicios de intel...
 
OpenDistro for Elasticsearch and how Bitergia is using it.Madrid DevOps
OpenDistro for Elasticsearch and how Bitergia is using it.Madrid DevOpsOpenDistro for Elasticsearch and how Bitergia is using it.Madrid DevOps
OpenDistro for Elasticsearch and how Bitergia is using it.Madrid DevOps
 
¿Son las bases de datos de contabilidad interesantes, o son parte del hype al...
¿Son las bases de datos de contabilidad interesantes, o son parte del hype al...¿Son las bases de datos de contabilidad interesantes, o son parte del hype al...
¿Son las bases de datos de contabilidad interesantes, o son parte del hype al...
 
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
En un mundo hiperconectado, las bases de datos de grafos son tu arma secretaEn un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
 
El futuro era esto: Reconocimiento facial sobre video en tiempo real sin serv...
El futuro era esto: Reconocimiento facial sobre video en tiempo real sin serv...El futuro era esto: Reconocimiento facial sobre video en tiempo real sin serv...
El futuro era esto: Reconocimiento facial sobre video en tiempo real sin serv...
 

Último

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 

Último (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 

Getting started with streaming analytics

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Part 1 of 3: The basics of real-time streaming analytics Getting started with streaming analytics Javier Ramirez AWS Developer Advocate @supercoco9
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Agenda Why real-time analytics and data streaming? Challenges of streaming analytics Useful concepts to reason about streaming data Components of a streaming analytics pipeline Overview of popular Open Source components for streaming analytics: Apache Kafka, Apache Spark, Apache Flink, Apache Cassandra, Apache HBase, ElasticSearch AWS toolbox for streaming analytics: Amazon MSK, Amazon EMR, Amazon Kinesis, Amazon Keyspaces, Amazon DynamoDB, Amazon ElasticSearch
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Why streaming analytics • The number of “smart” devices is projected to be 200 billion by 2020 (over 100X increase in ten years) • 90% of the data in the world was generated in the last 2 years • There are 2.5 quintillion bytes of data created each day, and this pace is accelerating Source: BI Intelligence Estimates Source: Forbes – How much data do we produce Data streaming technology enables a customer to ingest, process, and analyze high volumes of high-velocity data from a variety of sources
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential The of data diminishes over time Source: Perishable insights, Mike Gualtieri, Forrester Real time Seconds Minutes Hours Days Months Valueofdatatodecision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence Information half-life in decision-making
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Cannot I just use batch big data analytics tools? https://aws.amazon.com/streaming-data/
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Cannot I just use batch big data analytics tools? Data is never complete You don’t know the volume of the data before you start Low-latency is expected Data can come out of order System should remain available during upgrades
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple problem (until you know the details) I want to calculate the total and average of several numbers
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple big data problem (until you know the details) I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simple streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A simplish streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A quite standard streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential An elastic and scalable streaming problem I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands)
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential An almost real-life streaming analytics scenario I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands) And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A real business use case for streaming I want to calculate the total and average of several numbers They might be MANY numbers, more than you can store in memory, or in a single hard drive The dataset is not static, new numbers are coming all the time From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data Flow will not be constant (from few events per second to thousands) And I don’t want just the total average, but total per month, per week, per day, per hour, per minute… We need pretty dashboards with current status, comparison with the past, trends, and anomaly detection To run this reliably, we need advanced monitoring, alerts, and autoscaling No, I am not hiring a whole new operations team to manage the system
  • 15. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. Probably less than you think ~20 lines of JAVA code (plus a few hundreds with imports, POJOs, and boilerplate, because JAVA) a simple GROUP BY statement in SQL with streaming extensions (plus a few lines of boilerplate for schema definition) OR
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Streaming analytics concepts
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming data pipeline overview Ingest Transform Analyze React Persist • Durable • Stateful • Continuous • Fast • Correct • Reactive • Reliable What are the key requirements?
  • 20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Durability and reliability Need to store intermediate data You might want to be able to replay the stream Self-healing architecture. If one component goes down while data is in-flight, the system needs to re-balance and data needs to be reassigned seamlessly Monitoring
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stateful processing Working on per-element streams is relatively easy (i.e. change format of each item, or filter our records based on their own properties) 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/ The real fun starts when you need to do transforms/ aggregations over groups of elements: group by, count, max, average, joins, filtering based on properties from related records, or complex pattern detection
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Continuous and fast Data can come in spikes, faster than we can process it. Need to account for reliable persistent storage while in- flight You will need to think how to update a system that never stops receiving data Since data is never complete, in the case of stateful computations, we need to decide when to output data (windowing)
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Processing-Time based windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Event-Time Based Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Correctness: Late-arriving data Event-time vs Processing-time Graphics from The Beam Model. By Tyler Akidau and Frances Perry. https://beam.apache.org/community/presentation-materials/
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Correctness: Delivery semantics • Exactly once • At least once • At most once
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Reactive All the components need to be designed for low-latency Source: Perishable insights, Mike Gualtieri, Forrester Real time Seconds Minutes Hours Days Months Valueofdatatodecision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence Information half-life in decision-making
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Components of a streaming analytics pipeline
  • 30. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming analytics components Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Database (NoSQL most common), Message broker, Notification system, File Storage, or Data Lake ` Analytics dashboard
  • 31. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark The (excellent) Open Source ecosystem
  • 32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Ingestion/in-stream storage: Apache Kafka A distributed streaming platform Concepts: Producers Topics Brokers Consumers
  • 33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Ingestion/in-stream storage: Apache Flume Distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data
  • 34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Spark Unified Analytics Engine for large-scale data processing Concepts: Driver/Workers Data Source Discretized Stream Transforms Streaming SQL Outputs
  • 35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Spark Unified Analytics Engine for large-scale data processing
  • 36. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams Concepts: Job Manager/Workers Source DataStream Transforms/Operators TableAPI/SQL Sinks
  • 37. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams
  • 38. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Processing: Apache Flink Stateful computation over Data Streams
  • 39. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache Cassandra Manage massive amounts of data, fast, without losing sleep https://cassandra.apache.org/ Concepts: Nodes Token Ring Consistency Levels Column Families
  • 40. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache Cassandra Manage massive amounts of data, fast, without losing sleep
  • 41. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Stream Storage: Apache HBase The Hadoop database, a distributed, scalable, big data store https://hbase.apache.org/book.html First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. Concepts: Hbase Master, Regions, Region Servers, Data Nodes, Column Families
  • 42. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Dashboard: Elasticsearch with Kibana Elasticsearch is a distributed JSON-based search and analytics engine. Kibana gives shape to your data https://www.elastic.co/kibana Wikimedia has a live interactive dashboard powered by Kibana at https://wikimedia.biterg.io/ Concepts: Master Node Data Nodes Shard Index
  • 43. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Dashboard: Grafana Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. https://grafana.com/grafana/ Wikimedia also has a live interactive metrics dashboard powered by Grafana at https://grafana.wikimedia.org/ Concepts: Data Source
  • 44. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Challenges of data streaming components Difficult to setup Tricky to scale Hard to achieve high availability Integration required development Error prone and complex to manage Expensive to maintain
  • 45. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark AWS services for streaming analytics Both managed services and native services
  • 46. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Streaming real-time data with AWS * Some services scale up and down elastically, while others allow you to automate when to scale up/down ** It is possible to have a serverless data streaming pipeline, in which you pay only for what you use. In the case of managed non-serverless services, you can dynamically adapt to your traffic
  • 47. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for Ingestion/in-stream storage Amazon Managed Streaming for Apache Kafka Fully managed version of Apache Kafka Amazon Kinesis Data Streams Massively scalable, elastic, and durable real-time data streaming Amazon Kinesis Data Firehose Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. AWS Glue with serverless streaming Simple, flexible, and cost-effective ETL
  • 48. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for stream processing Amazon Kinesis Data Analytics for Apache Flink Fully managed, elastic, version of Apache Flink Amazon Kinesis Data Analytics for SQL Applications Process and analyze streaming data using standard SQL Amazon EMR Easily run and scale Apache Spark and other big data frameworks. You can also run Apache Flink and Apache HBase on EMR AWS Glue with serverless streaming Simple, flexible, and cost-effective ETL. Supports Spark for serverless ETL
  • 49. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for stream storage Amazon Keyspaces for Apache Cassandra Scalable, highly available, and managed Apache Cassandra compatible db service Amazon DynamoDB Fast and flexible NoSQL database service for any scale (for example, in 2017 Samsung Cloud Service was serving 300M users with a total storage of 860TB) Amazon EMR Easily run and scale Apache HBase and other big data frameworks. You can also run Apache Flink and Apache Spark on EMR
  • 50. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Services for analytics dashboards Amazon Elasticsearch Service Fully managed, scalable, and secure Elasticsearch service Amazon Quicksight Fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization.
  • 51. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential A serverless data stream (per element processing) data producer Kinesis Data Streams Amazon SNS Continuously stream data Lambda service Lambda functionA Lambda function B Continuously polls for new data, 1 poll per second Automatically invokes your function(s) when data found DynamoDB
  • 52. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Fully managed stateful streaming analytics
  • 53. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Getting Started https://engineering.linkedin.com/distributed-systems/log-what-every-software- engineer-should-know-about-real-time-datas-unifying A great write-up on streaming analytics challenges https://aws.amazon.com/streaming-data/ Streaming data https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html Getting started with Apache Kafka/Amazon MSK https://aws.amazon.com/kinesis/ Amazon Kinesis Services for streaming data https://aws.amazon.com/elasticsearch-service/ Amazon ElasticSearch Service https://dl.acm.org/doi/10.1145/543613.543615 Research about Models and Issues in data stream systems
  • 54. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential ThanksJavier Ramirez AWS Developer Advocate @supercoco9