É cada vez mais importante a análise de streaming de dados em tempo real. Ela possibilita que as organizações permaneçam competitivas através da descoberta de insights relevantes para a tomada de decisão. A AWS possibilita que a captura, armazenamento e análise de streaming de dados em tempo real seja feita de maneira simplificada. Nesta sessão, iremos guiá-lo através de algumas das arquiteturas de referência para o processamento de streaming de dados, usando uma combinação de ferramentas, incluindo o Amazon Kinesis Streams, AWS Lambda e o Spark Streaming em Amazon EMR. Em seguida, falaremos sobre casos de uso comuns e melhores práticas para análise de dados em tempo real na AWS.
https://aws.amazon.com/pt/big-data/
2. O Valor dos Dados
Dados mais recentes são valiosos
Se você puder usá-lo em real time
Capture o valor dos seus dados
3. Origem dos Dados
Database
Client App
Web server
Transactional:
Database reads/writes
Cloud
Storage
File
Log Files
Servers
Access Log
System Log
Amazon S3
DynamoDB
Amazon
RDS
4. Origem dos Dados
Stream
Storage
Database
Client App
Web server
Transactional:
Database reads/writes
Cloud
Storage
File
Log Files
Servers
Access Log
System Log
Stream
Amazon S3
DynamoDB
Amazon
RDS
Amazon
Kinesis
IoT
Metrics
ClickStream
5. Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Cities
Smart Buildings
[Wed Oct 11 14:32:52 2000] [error]
[client 127.0.0.1] client denied by
server configuration: /www/htdocs/test
Streaming Data: produzido continuamente
6. Batch Processing
Hourly server logs
Weekly or monthly bills
Daily web-site clickstream
Daily fraud reports
Tem a ver com a frequência...!
Stream Processing
Real-time metrics
Real-time spending alerts/limits
Real-time clickstream analysis
Real-time anomaly detection
10. Simple Pattern for Streaming Data
Continuously creates
data
Continuously writes
data to a stream
Can be almost
anything
Data Producer
Durably stores data
Provides temporary
buffer
Supports very high-
throughput
Streaming Storage
Continuously
processes data
Cleans, prepares, &
aggregates
Transforms data to
information
Data Consumer
Mobile Client Amazon Kinesis Amazon Kinesis app
14. Amazon Kinesis: Streaming Data Made Easy
Fica mais fácil capturar, fazer delivery, processar streams em AWS
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
18. KCL: Amazon Kinesis Client Library
• Kinesis Client Library (KCL) para Java, Ruby, Python ou
Node.JS
• Deploy em EC2 instances
• Componentes principais:
1. Record Processor Factory – Creates the record processor
2. Record Processor – Processor unit that processes data from a
shard in Amazon Kinesis Streams
3. Worker – Processing unit that maps to each application instance
20. KCL: State Management
• One record processor maps to one shard for processing
• One worker maps to one or more record processors
• Balances shard-worker associations when worker / instance counts change
• Balances shard-worker associations when shards added / removed
21. Amazon Kinesis Firehose
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
Escalável
Fácil de usar
Continuous transform
Store integration *
24. Amazon Kinesis Streams vs. Amazon Kinesis Firehose
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
25. Qual o seu tipo de latência?
vs
Amazon Kinesis Streams vs. Amazon Kinesis Firehose
26. Amazon Kinesis Analytics
• Apply SQL on streams
• Build real-time, stream processing applications
• Easy scalability
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time
27. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
28. AWS Lambda
• Function code a partir de novos eventos
• Processamento simples de novas entradas
• Serverless
Social media stream is loaded
into Kinesis in real-time
Lambda runs code that generates hashtag
trend data and stores it in DynamoDB
Social media trend data
immediately available
for business users to
query
LAMBDASTREAMS DYNAMODB
29. Amazon Elastic Map Reduce (EMR)
• Ingest streaming com baixa latência e alto throughput
• Clusters com versões de open source frameworks
• Processamento mais complexo e elaborado
Ingest streaming data
through Amazon Kinesis
Streams
Your choice of stream
data processing engine,
Spark Streaming or Apache Flink
EMRSTREAMS S3
Send processed
data to S3, HDFS,
or a custom
destination using
an open source
connector
31. Streaming
Ingest-Transform-Load
• Ingest and store raw data at high volume
• Atomic transformations
• Simple data enrichment
Continuous Metric
Generation
• Windowed analytics (count X over 5 minutes)
• Event correlation like sessionization
• Visualization
Actionable Insights
• Act on the data by triggering events or alerts
• Real-time feedback loops
• Machine learning
Stream Processing: Casos de Uso
Use Case Characteristics
32. Cenário: Análise de VPC Flow Logs
• VPC Flow Logs contém informação sobre todo o tráfico IP inbound
e outbound da sua VPC
• A combinação de Kinesis Firehose e S3 permite um
armazenamento de baixo custo dos logs e uma análise em tempo
real do seu tráfego.
VPC subnet
Amazon
CloudWatch
Logs
AWS
Lambda
AmazonKinesis
Firehose
Amazon S3
bucket
Amazon
Athena
Amazon
QuickSight
Forward VPC flow logs with
subscription
Aggregation and
transformation
Near real-time queries and
visualization
33. Cenário: Time-Series Analytics
• IoT data via MQTT
• Consolidar metricas a cada 10 seconds
• Guardar dados de time series numa base RDS MySQL para
consultas via queries e BI
IoT sensors AWS IoT
RDS
MySQL DB
instance
Amazon
Kinesis
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Analytics
AWS
Lambda
function
Compute average
temperature every 10secIngest sensor data
Persist time series
analytic to database
34. Cenário: Real-Time Anomaly Detection
• Website clickstream via API Gateway e Amazon Kinesis Streams
• Usar Amazon Kinesis Analytics para quantificar data points e definir
tendências de comportamento
• Enviar notificações via Amazon SNS
Amazon API
Gateway
Amazon
Kinesis
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Analytics
Lambda
function
Amazon
SNS
topic
email
notification
users
SMS
notification
SMS
Ingest clickstream data Detect anomalies& take action Notify users
36. “Com os serviços da AWS pudemos criar um
sistema de processamento em tempo real
altamente escalável e de ótimo desempenho”
O Magazine Luiza é um dos líderes no
mercado de varejo do Brasil com quase 60
anos de vida, +800 lojas e milhões de
clientes.
Com times de desenvolvimento
espalhados por 5 cidades, nosso lema é
“experimentação e escala”.
“Poder lidar com
+300k rps com uma
latência estável e um
baixo custo, o
Kinesis se mostrou a
ferramenta ideal para
processar todo nosso
catálogo de produtos.”
- Bruno Gouveia
Dev Specialist
37. O Desafio
• Suportar grandes oscilações no volume de escrita,
sem efeitos colaterais na leitura.
• Facilidade de manutenção.
• Previsibilidade comportamental, monitoramento e
alertas em tempo real
• Menor custo possível
40. Lots of customer examples
Entertainment | 17
PB of game data
per season
Ads | 80 billion
impressions/day
Enterprise | 100
GB/day click streams
from 250+ sites
Varejo | 10 million
events/day
IoT | 1 billion
events/semana
41. AWS Blog: Mão na Massa e Sample Code
https://aws.amazon.com/blogs/big-data/
46. Ainda não tem o App oficial do
AWS Summit São Paulo?
http://amzn.to/2rOcsVy
Não deixe de avaliar as sessões no app!
Notas del editor
What is the customer going to get out of this presentation?
You are going to learn:
Core streaming processing use cases
Overview of AWS services that can help you solve these use cases including all Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, AWS Lambda, and Amazon EMR
Deep dive into three core stream processing using: streaming ingest-transform-load, continuous metric generation, and responsive analysis
Finally, we’ll give some information on where to go next to start implementing your next use case
Perishable Insights (M. Gualtieri, Forrester)
Speed matters in business!
http://www.slideshare.net/AmazonWebServices/bdt207-use-streaming-analytics-to-exploit-perishable-insights-aws-reinvent-2014-41775034/34-2014_Forrester_Research_Inc_Reproduction/34
To capture “Perishable Insights”:
Insights that can provide incredible value but the value expires and evaporates once the moment is gone.
Mike Gualtieri, Principal Analyst at Forrester Research
Use the type of data your environment is collecting and the manner in which it is collected to determine what kind of ingestion solution is ideal for your needs.
Transactional data needs to be able to quickly store and retrieve small pieces of data. End-users need quick and straight-forward access to the data, making app and web servers the ideal ingestion methods. For the same reasons, databases, such as DynamoDB and Amazon RDS, are usually the best solution for these kinds of processes.
Data transmitted through individual files typically is ingested from connected devices. This kind of log data does not require fast storage and retrieval like transactional data does, and also transferal is one-way, so ingestion can come from a wide-variety of sources. This kind of data is typically stored in Amazon S3.
Stream data, such as click-stream logs, should be ingested through an appropriate solution such as a Kinesis-enabled application or fluentd. Initially, these logs are stored in stream storage solutions such as Kinesis, so they're available for real-time processing and analysis. Long-term storage of these logs would best be done in a low-cost solution such as Amazon S3.
Use the type of data your environment is collecting and the manner in which it is collected to determine what kind of ingestion solution is ideal for your needs.
Transactional data needs to be able to quickly store and retrieve small pieces of data. End-users need quick and straight-forward access to the data, making app and web servers the ideal ingestion methods. For the same reasons, databases, such as DynamoDB and Amazon RDS, are usually the best solution for these kinds of processes.
Data transmitted through individual files typically is ingested from connected devices. This kind of log data does not require fast storage and retrieval like transactional data does, and also transferal is one-way, so ingestion can come from a wide-variety of sources. This kind of data is typically stored in Amazon S3.
Stream data, such as click-stream logs, should be ingested through an appropriate solution such as a Kinesis-enabled application or fluentd. Initially, these logs are stored in stream storage solutions such as Kinesis, so they're available for real-time processing and analysis. Long-term storage of these logs would best be done in a low-cost solution such as Amazon S3.
Narrative: The reality is that most data is produced continuously and is coming at us at lightning speeds due to an explosive growth of real-time data sources.
TP: Machine data will make up 40% of our digital universe by 2020
TP: Over 60MM units will be deployed by retailers and bank branches by 2019
Narrative: Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now.
TP: Customer Benefits
Improve operational efficiencies, improve customer experiences, new business models
Smart building: reduce energy costs, cut maintenance, increase safety and security
Smart textiles: monitor skin temperature, monitor stress
Speed matters in business!
http://www.slideshare.net/AmazonWebServices/bdt207-use-streaming-analytics-to-exploit-perishable-insights-aws-reinvent-2014-41775034/34-2014_Forrester_Research_Inc_Reproduction/34
To capture “Perishable Insights”:
Insights that can provide incredible value but the value expires and evaporates once the moment is gone.
Mike Gualtieri, Principal Analyst at Forrester Research
Key Points
Data producers write data in real-time to streaming storage like Amazon Kinesis Streams, Amazon Kinesis Firehose, or Apache Kafka
Lots of variety including: Mobile Apps, Web Clickstream, Application Logs, Metering Records, IoT Sensors, Smart Buildings
The key point is that this data is written to the stream as shortly after it was created as possible for low latency processing
Data is durably ingested and stored for a temporary time period in streaming storage
Streaming service ingests the data durably and holds in a temporary storage for processing
Durable is important because you don’t want to lose, so high availability and persistence across multiple storage devices are important
High throughput is important because the sheer amount of data being created is enormous. See past point for examples
Options include managed services including Amazon Kinesis Streams and Amazon Kinesis Firehose, as well as self-managed solutions like running Apache Kafka on EC2
Other solutions that are migrating to streaming include a selection of queueing and messaging technologies like RabbitMQ. The programming model is different for streaming and we believe a better and more modern approach to messaging.
Amazon SQS - Managed message queue service
Apache Kafka - High throughput distributed messaging system
Amazon Kinesis Streams - Managed stream storage + processing
Amazon Kinesis Firehose - Managed data delivery
Amazon DynamoDB - Managed NoSQL database, Tables can be stream-enabled
Data consumer apps process and send data downstream for later analysis.
Top use cases include raw 1) raw archival / transformation to long term storage, 2) continuous metric generation, and 3) responsive analysis. We will cover these in great detail today.
Data is cleansed, prepared, and aggregated as it arrives on a continuous basis
Data is sometimes sent to multiple streams for various processing and analytics applications, creating a multi-stage streaming data pipeline. One of these streams persists the data to a database for perm storage, one may store in S3, another may react in real time to the data. Lots of different use cases.
In this example, a business's social media presence and brand engagement is analyzed so that business intelligence users can respond and adjust to a variety of circumstances:
Since they don't require any processing, clickstream logs are collected by Amazon Kinesis Firehose.
From there, they are stored directly into Amazon S3.
At scheduled intervals, old logs are archived into Amazon Glacier. They can be restored back to S3 later if the need for a long-term analysis arises.
Then, using a Kinesis Producer, relevant social media interactions are configured to be collected. The producer directs this stream of data into one Kinesis stream. The Kinesis stream sends user metadata from the social media interactions to a Kinesis Consumer app, which then processes, combines, and makes them conform to uniform files. This metadata is associated with users using unique ids.
These files are also compressed and stored in Amazon S3. Old data is regularly cycled out and archived to Glacier, just as with the logs.
Amazon Redshift pulls logs and metadata from Amazon S3, and performs analysis jobs to provide insight on the type of responses and attention they're getting on social media, from where and by whom.
This insight is then output in a meaningful, readable way to a custom dashboard (using visualization software or something created specifically for their purposes), where users can create and share reports that can be used to make informed decisions about how to adjust their company's branding and social media presence.
The raw stream of data is also sent by the Kinesis Stream to a third Kinesis Consumer application, this one tasked with stripping down, combining, and conforming data with the purpose of identifying possible trends as they are happening in near real-time.
This analysis can be performed by Amazon EMR, Amazon EC2, Amazon Redshift, or even a lightly more complex model using Amazon Machine Learning. In this circumstance, they're using Amazon EC2 to perform the analysis on the processed data.
This output is then also sent to the custom dashboard for BI users to view and respond to quickly, as trends are potentially occurring.
Amazon Kinesis is a streaming platform on AWS that offers three services so you can build the solution that matches your use case.
Amazon Kinesis Streams – The first cloud based streaming service ever launched, in 2013. Allows technical developers to ingest and process large volumes of streaming data. Provides the flexibility for customer to use the framework of choice to process data, such as an AWS managed service like AWS Lambda or Amazon EMR or using an open source library like the Amazon Kinesis Client Library (KCL).
Amazon Kinesis Firehose – Takes the number one use case for streaming data, accelerated ingest-transform-load, and makes it super simple. Service launched in 2015 and is easy to setup and dead simple to operate with virtually no management overhead. Developers of all types can use the service because it doesn’t require building an application.
Amazon Kinesis Analytics – Allows customers to easily build stream processing applications using standard SQL. Developers of all types can use the service including data scientists, systems engineers, and database administrators.
Key Points
You use the service to create streams which are written to by data producers and read from by data consumers in real time. In this example, we are sending clickstream data from web clients to a Kinesis stream.
Streams consist of shards, with each shard supporting 1 MB/sec write and 2 MB/sec read. To add scale, you add shards.
Kinesis Streams provides you the flexibility to use any data processing framework you would like.
Easy administration: Simply create a new stream, and set the desired level of capacity with shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using Kinesis Client Library (KCL), Kinesis Analytics, Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Key Points:
The KCL is one of the most popular stream data processing frameworks in the world, and is the most commonly used solutions with Kinesis Streams. However, it is not a fully managed solution. It is open source software running on EC2.
Does a lot of the heavy lifting for you, such as load balancing across EC2 instances.
The KCL is used as part of the underlying architecture of almost all of the latter examples in this presentation.
Each EC2 instance / JVM gets a worker. Each worker will setup the record processors for every shard. For example, if you have 10 shards and two EC2 instances, each EC2 instance will get five shards. The KCL handles mapping shards and EC2 instances so if you add or subtract from either, it will detect and adjust.
Use the KCL to build consumers that read and process records from a stream.
Example
A consumer called SensorAlertApplication uses the KCL to read and process sensor readings from a stream. It raises an alert when temperature readings from a sensor exceed a specified threshold.
SensorAlertApplication creates a KCL worker that uses a record processor factory (SensorRecordProcessorFactory of type IRecordProcessorFactory) to instantiate record processors (SensorRecordProcessor of type IRecordProcessor).
The relationship between the consumer application, worker, record processor, and shard is as follows:
One instance of a consumer application has one KCL worker.
The KCL worker is associated with a KinesisClientLibConfiguration. The KinesisClientLibConfiguration can be used to configure the position in stream from which to start reading records and the metrics to be published by the KCL.
The KCL worker is also associated with an IRecordProcessorFactory object. The KCL worker uses the IRecordProcessorFactory object to instantiate record processors.
The KCL worker can have one or more record processors.
Each record processor is responsible for processing records in a single shard. The KCL worker instantiates a record processor for every shard that it manages. The record processor contains the application-specific business logic to process individual records. In the case of SensorAlertApplication, each instance of SensorRecordProcessor reads sensor data from a different shard. It checks if each sensor reading is above a specific threshold and notifies the SensorAlertApplication when the reading exceeds the threshold.
Each KCL worker, composed of several record processors, can thus process multiple shards.
Each consumer application must have a unique application name. A consumer application runs on an EC2 instance. You can run the consumer application on multiple EC2 instances at the same time. The application instances will load balance and split the handling of shards amongst themselves.
The KCL creates an Amazon DynamoDB table with the application name and uses the table to maintain state information (such as checkpoints and worker-shard mapping) for the application. Each application has its own DynamoDB table.
The KCL automatically creates a DynamoDB table for each Kinesis application to keep track and maintain state information.
The KCL uses the name of the Amazon Kinesis application as the name of the DynamoDB table. Each application name must be unique within a region.
Your account is charged for this DynamoDB usage.
Each row in the DynamoDB table represents a shard in the Kinesis stream.
Application State Data:
Shard ID: Hash key for the shard that is being processed by your application.
Checkpoint: The most recent checkpoint sequence number for the shard.
Worker ID: The ID of the worker that is processing the shard.
Lease/Heartbeat: A count that is periodically incremented by the record processor.
The read/write throughput for the DynamoDB table depends on the number of shards and how often your application
Once a shard is completely processed, the record processor that was processing it should receive a call to its shutdown
method with the reason "TERMINATE", client's should call checkpointer.checkpoint() when the shutdown reason is
"TERMINATE" so that the end of the shard is checkpointed. If cleanupLeasesUponShardCompletion is set to true in
your KinesisClientLibConfiguration this will allow the shard sync task to clean up the lease for that shard.
You can check your lease table in dynamodb, it will be named the same as the application name provided to your
Workers. In that table you can confirm whether or not the leases for the closed shards have been checkpointed at
SHARD_END or not.
If the Amazon DynamoDB table for your Amazon Kinesis application does not exist when the application starts up, one of the workers creates the table and calls the describeStream method to populate the table.
Throughput
If your Amazon Kinesis application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. The KCL creates the table with a provisioned throughput of 10 reads per second and 10 writes per second, but this might not be sufficient for your application. For example, if your Amazon Kinesis application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data.
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose Delivery Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies.
Easy Scalability : Elastically scales to match data throughput and query complexity.
Connect to streaming source Select Kinesis Streams or Firehoses as input data for your SQL code
Easily write SQL code to process streaming data - Author applications with a single SQL statement, or build sophisticated applications with multi-step pipelines with advanced analytical functions
Continuously deliver SQL results - Configure one to many outputs for your processed data for real-time alerts, visualizations, and batch analytics
The are three major components of a streaming application; connecting to an input stream, SQL code, and an output destinations.
Inputs include Kinesis Streams and Firehose.
Standard SQL code consists of one or more SQL queries which make up an application. Most applications will have one to three SQL statements that perform ETL after initial schematization, and then several SQL queries to generate real-time analytics on the transformed data. This illustrates the point of the “streaming application” concept; you are chaining SQL statements together in serial and parallel through filters, joins, and merges to build a streaming graph within your application. We believe most customers will have between 5-10 SQL statements, but some customers will build sophisticated applications with 50-100 statements.
Outputs includes Kinesis Streams and Firehose (S3, Redshift, and Elasticsearch. Custom destinations can be enabled by writing process SQL results to a Kinesis Stream and then using Lambda to write to the destination of choice.
Key Points
With Lambda, you create a function that is invoked based on some trigger. For Kinesis, the trigger is records arriving in the stream which will then be passed to your function code.
The event your Lambda function receives is a collection of records AWS Lambda reads from your stream. When you configure event source mapping, the batch size you specify is the maximum number of records that you want your Lambda function to receive per invocation.
You can use AWS Lambda function in conjunction with Kinesis Streams (as a data consumer) and Kinesis Firehose (as a transformation step as part of delivery).
AWS Lambda functions are typically used to perform event-based processing. This means it is well suited for performing simple processing or enrichment. You can use AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a connected device. You can also use AWS Lambda to create new back-end services where compute resources are automatically triggered based on custom requests.With AWS Lambda you pay only for the requests served and the compute time required to run your code. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.
Key Points
Key is flexibility over ease of use.
You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and you can customize every cluster. Amazon EMR also supports multiple Hadoop distributions and applications.
Consume and process real-time data from Amazon Kinesis, Apache Kafka, or other data streams with Spark Streaming on Amazon EMR. Perform streaming analytics in a fault-tolerant way and write results to Amazon S3 or on-cluster HDFS.
Amazon EMR running Spark Streaming or Apache Flink allows you to ingest data from anywhere, including Kinesis Streams, Apache Kafka, RabbitMQ, and more. That is the power of open source, is there are a large number of connectors
You can spend less time tuning and monitoring your cluster. Amazon EMR has tuned Hadoop for the cloud; it also monitors your cluster —retrying failed tasks and automatically replacing poorly performing instances.
Key Points
Customers tend to start with mundane, unglamorous applications, such as system log ingestion and archival. However, as they progress through their streaming data journey, they add sophisticated to their processing. More value is from the data as it is processed closer to real–time.
Initially they may simply process their data streams to produce simple metrics and reports, and perform simple actions in response, such as emitting alarms over metrics that have breached their warning thresholds.
Eventually they start to do more sophisticated forms of data analysis in order to obtain deeper understanding of their data, such as applying machine learning algorithms.
In the long run they can be expected to apply a multiplicity of complex stream and event processing algorithms to their data
The benefits of real-time processing apply to pretty much every scenario where new data is being generated on a continual basis. We see it pervasively across all the big data areas we’ve looked at, and in pretty much every industry segment whose customers we’ve spoken to.
We then looked at specific use cases
Alert when public sentiment changes
Customers want to respond quickly to social media feeds
Twitter Firehose: ~450 million events per second, or 10.1 MB/sec1
Respond to changes in the stock market
Customers want to compute value-at-risk or rebalance portfolios based on stock prices
NASDAQ Ultra Feed: 230 GB/day, or 8.1 MB/sec1
Aggregate data before loading in external data store
Customers have billions of click stream records arriving continuously from servers
Large, aggregated PUTs to S3 are cost effective
Internet marketing company: 20MB/sec
Key Points
This is the use case for this example: https://aws.amazon.com/blogs/big-data/analyzing-vpc-flow-logs-with-amazon-kinesis-firehose-amazon-athena-and-amazon-quicksight/
Read before presenting so you can make the most of this slide.
Key Points:
Amazon Kinesis Analytics provides the mechanisms to take raw, real-time data and output processed, meaningful information to a variety of destinations in real-time.
Time-series analytics are natural place to be stored in DB. The delivery mechanism of using Streams/Lambda will allow you to make simple updates to the database like counts
Writing directly to the database with hundreds to thousands of events per second of sustained throughput is less than ideal and often not practice for performance and storage reasons. Kinesis Analytics allows you to not only buffer the data, but aggregate the data in a very meaningful way by calculating a key metric, and do so very quickly.
30 segundos para apresentar a empresa, rapidamente
Os 4 (máximo) maiores desafios do projeto, que foram resolvidos pela utilização da nuvem da AWS
Diagrama de solução, e explicar a solução, vantagens, etc