SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Kinesis & Big Data
Meld real-time streaming with EMR (Hadoop), &
Redshift (Data Warehousing)
Adi Krishnan, AWS Product Management, @adityak
Daniel Mintz, Director of BI, Upworthy, @danielmintz
July 10, 2014
Amazon Kinesis & Big Data
o Motivations for Stream Processing
 Origins: Internal metering capability
 Expanding the big data processing landscape
o Customer view on streaming data
o Amazon Kinesis Overview
 Amazon Kinesis Architecture
 Kinesis concepts & Demo
o Amazon Elastic MapReduce and Kinesis
 EMR connector morphs Kinesis streamed data into Hadoop framework
 Applying Hadoop frameworks to streaming data
o Amazon Kinesis and Redshift:
 Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”
 Presented by Daniel Mintz, Director of Business Intelligence, Upworthy
The Motivation for Continuous Processing
Origins: Internal AWS Metering Capability
Workload
• 10s of millions records/sec
• Multiple TB per hour
• 100,000s of sources
Pain points
• Doesn’t scale elastically
• Customers want real-time alerts
• Expensive to operate
• Relies on eventually consistent
storage
Expanding the Big Data Processing Landscape
• Query Engine Approach
• Pre-computations such as
indices and dimensional views
improve performance
• Historical, structured data
• HIVE/SQL-on-Hadoop/ M-R/
Spark
• Batch programs, or other
abstractions breaking down
into MR style computations
• Historical, Semi-structured
data
• Custom computations of
relative simple complexity
• Continuous Processing –
filters, sliding windows,
aggregates – on infinite data
streams
• Semi/Structured data,
generated continuously in
real-time
Traditional Data Warehousing Hadoop Style Processing Stream Processing
A Generalized Data Flow
Many different technologies, at different stages of evolution
Client/Sensor Aggregator Continuous
Processing
Storage Analytics +
Reporting
Our Big Data Transition
Old Posture
• Capture huge amounts of data
and process it in hourly or daily
batches
New Requirements
• Make decisions faster,
sometimes in real-time
• Scale entire system elastically
• Make it easy to “keep
everything”
• Multiple applications can
process data in parallel
Foundation for Data Streams Ingestion, Continuous Processing
Right Toolset for the Right Job
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Elastic
• Enable multiple apps to process in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Continuous, real-time workloads
Customer View
Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Software/
Technology
IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational
Intelligence
Digital Ad Tech./
Marketing
Advertising Data aggregation Advertising metrics like coverage, yield,
conversion
Analytics on User engagement with
Ads, Optimized bid/ buy engines
Financial Services Market/ Financial Transaction order data
collection
Financial market data metrics Fraud monitoring, and Value-at-Risk
assessment, Auditing of market order
data
Consumer Online/
E-Commerce
Online customer engagement data
aggregation
Consumer engagement metrics like
page views, CTR
Customer clickstream analytics,
Recommendation engines
Customer Scenarios across Industry Segments
1 2 3
Big streaming data comes from the small
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@32473
class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime”
– 412309129140
MQTT Record <R,AMZN ,T,G,R1>
NASDAQ OMX Record
What Biz. Problem needs to be solved?
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients
Amazon Kinesis
Managed Service for streaming data ingestion, and processing
Amazon Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting
or merging Shards
• Replay data inside of 24Hr.
Window
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the
hash key range of a Shard
• A unique Sequence # is returned to the Producer
upon a successful PUT call
Building Kinesis Processing Apps: Kinesis Client Library
Open Source library for fault-tolerant, continuous processing apps
• Java client library, source available on Github
• Build app with KCL on your EC2 instance(s)
• KCL is intermediary b/w your application & stream
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading by abstracting individual shards
• Increase / Decrease Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in
the stream, Restarts Workers if they fail
• Deploy app on your EC2 instances
• Integrates with AutoScaling groups to redistribute workers
to new instances
Amazon Kinesis Connector Library
Open Source code to Connect Kinesis with S3, Redshift, DynamoDB
S3
DynamoDB
Redshift
Kinesis
ITransformer
• Defines the
transformation
of records
from the
Amazon
Kinesis stream
in order to suit
the user-
defined data
model
IFilter
• Excludes
irrelevant
records from
the
processing.
IBuffer
• Buffers the set
of records to
be processed
by specifying
size limit (# of
records)& total
byte count
IEmitter
• Makes client
calls to other
AWS services
and persists
the records
stored in the
buffer.
Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
Amazon Kinesis & Elastic MapReduce
Amazon Elastic MapReduce (EMR)
Managed Service for Hadoop based data processing
• Managed service
• Easy to tune clusters and trim
costs
• Support for multiple data stores
• Unique features that ensure
customer success on AWS
Applying batch processing to streamed data
Client/ Sensor Recording
Service
Aggregator/
Sequencer
Continuous
processor for
dashboard
Storage
Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data Ingestion
What would this look like?
Processing
Input
• User
• Dev
My Website
Kinesis
Log4J
Appender
push to
Kinesis
EMR
Hive
Pig
Cascading
MapReduce
pull from
• Features offered starting EMR AMI 3.0.4
– Simply spin up the EMR cluster like normal
• Logical names
– Labels that define units of work (Job A vs Job B)
• Iterations
– Provide idempotency (pessimistic locking of the Logical name)
• Checkpoints
– Creating an input start and end points to allow batch processing
Features and Functionality
Iterations – the run of a Job
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
-24 hours
Logical Name
Stream
NOW
Latest seqID
Next
Logical Names & Checkpointing – allows
efficient batching
Kinesis Stream
NOW
Latest seqIDTrim Horizon seqID
-24 hours
Logical Name
Stream
• Dynamo DB
Metadata Storage
Logical Name A
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Logical Name B
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Each Kinesis shard maps 1:1 to a Hadoop
map task
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Next
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
-24 hours
Start seq ID End seq ID
NOW
Latest seqID
Handling stream scaling events
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
3
Mapper 3
1
2
4
5
Split
Shard 2
Shard 1
Shard 2
Shard 1
Shard 3
Shard 2
Shard 1
Shard 3
Split Merge
-24 hours
Latest seqID
NOW
Next
• InputFormat handles service errors
– Throttling: 400
– Service unavailable errors : 503
– Internal server 500
– Http Client exceptions : socket connection timeout
• Hadoop handles retry of failed map tasks
• Iterations allow retrys
– Fixed input boundaries on a stream (idempotency for reruns)
– Enable multiple queries on the same input boundaries
Handling errors
Hadoop Ecosystem Implementation
• Hadoop Input format
• Hive Storage Handler
• Pig Load Function
• Cascading Scheme and
Tap
• Join multiple data
sources for analysis
• Filter and preprocess
streams
• Export and archive
streaming data
Use CasesImplementations
Writing to Kinesis using Log4J
Option Default Description
log4j.appender.KINESIS.streamName AccessLog
Stream
Stream name to which data is to be published.
log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into
bytes before sending to Amazon Kinesis.
log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs
to publish a log message.
log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.
log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to
configured Kinesis stream.
log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to
keep in memory.
log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application
JVM quits normally.
.error("Cannot find resource XYX… go do something about it!");
Run the Ad-hoc Hive Query
Run the Ad-hoc Hive Query
Amazon Kinesis & Redshift
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
24 Hours to 10 Minutes
How Upworthy’s Data Pipeline uses Kinesis
Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz
What’s Upworthy
• We’ve been called
– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically
What We Do
• We aim to drive
massive amounts of
attention to things that
really matter.
• We do that by finding,
packaging, and
distributing great,
meaningful content.
Our Use Case
When We Started
• Had built a data warehouse from scratch
• Hadoop-based batch workflow
• Nightly ETL cycle
• 2.5 Engineers
• Wanted to do all three:
– Comprehensive
– Ad Hoc
– Real-Time
The Decision
• Speed up our current system, rather than
building a parallel one
• Had looked at alternative stream processors
– Cost
– Maintenance
• Comfortable with concept of application log
stream
How It Works
• Log Drain receives, formats, batches and zips
• PUTs 50k GZIP batches on Kinesis stream
• Three types of Kinesis consumers:
1. Archiver – Batch and write permanent record
2. Stats – Filter, sample and count; Report to StatHat
3. Transformer – Filter, batch, validate; writes temporary BSVs to S3
• Database Importer handles manifest files.
• S3 handles garbage collection.
Our system now
• Stats:
– Average: ~1085 events/second
– Peak: ~2500 events/second
• Data is available in Redshift < 10 min
• Kinesis has been cheap, stable, and gives us
redundancy and resiliency.
• Computation model that’s easy to reason about
Resiliency
• When something goes
wrong, you have 24
hours.
• Timestamp at outset.
Track lag at each step.
• Bigger workers (more
CPU, RAM, deeper
queues) can catch us
up very fast.
What We’ve Learned
Some Lessons
• You can use one pipeline for everything.
• High-cardinality fact data belongs in Kinesis.
• EDN works well with Kinesis.
• We prefer explicit checkpointing. (Your mileage may
vary.)
• Languages that run on the JVM can take advantage of
AWS Client Libraries.
Kinesis Pricing
Simple, Pay-as-you-go, & no up-front costs
Pricing Dimension Value
Hourly Shard Rate $0.015
Per 1,000,000 PUT
transactions:
$0.028
• Customers specify throughput requirements in shards, that they control
• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress
• Inbound data transfer is free
• EC2 instance charges apply for Kinesis processing applications
Canonical Data flows with Amazon Kinesis
Continuous Metric
Extraction
Incremental Stats
Computation
Record Archiving
Live Dashboard
Try out Amazon Kinesis
• Try out Amazon Kinesis
– http://aws.amazon.com/kinesis/
• Thumb through the Developer Guide
– http://aws.amazon.com/documentation/kinesis/
• Test drive the sample app
– https://github.com/awslabs/amazon-kinesis-data-visualization-sample
• Kinesis Connector Framework
– https://github.com/awslabs/amazon-kinesis-connectors
• Read EMR-Kinesis FAQs
– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector
• Visit, and Post on Kinesis Forum
– https://forums.aws.amazon.com/forum.jspa?forumID=169#
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Thank You!
Adi Krishnan, Product Management, AWS

Más contenido relacionado

La actualidad más candente

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Practical FinOps in Practice
Practical FinOps in PracticePractical FinOps in Practice
Practical FinOps in PracticePetri Kallberg
 
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Amazon Web Services
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Amazon Web Service Sales Role Play - Case Study
Amazon Web Service Sales Role Play - Case StudyAmazon Web Service Sales Role Play - Case Study
Amazon Web Service Sales Role Play - Case StudyVineet Sood
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyHostedbyConfluent
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Azure Serverless with Functions, Logic Apps, and Event Grid
Azure Serverless with Functions, Logic Apps, and Event Grid  Azure Serverless with Functions, Logic Apps, and Event Grid
Azure Serverless with Functions, Logic Apps, and Event Grid WinWire Technologies Inc
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetesconfluent
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 

La actualidad más candente (20)

Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Practical FinOps in Practice
Practical FinOps in PracticePractical FinOps in Practice
Practical FinOps in Practice
 
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
Speed and Reliability at Any Scale: Amazon SQS and Database Services (SVC206)...
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Databases on AWS Workshop.pdf
Databases on AWS Workshop.pdfDatabases on AWS Workshop.pdf
Databases on AWS Workshop.pdf
 
Amazon Web Service Sales Role Play - Case Study
Amazon Web Service Sales Role Play - Case StudyAmazon Web Service Sales Role Play - Case Study
Amazon Web Service Sales Role Play - Case Study
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan StanleyA Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Azure Serverless with Functions, Logic Apps, and Event Grid
Azure Serverless with Functions, Logic Apps, and Event Grid  Azure Serverless with Functions, Logic Apps, and Event Grid
Azure Serverless with Functions, Logic Apps, and Event Grid
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Cost optimization on AWS
Cost optimization on AWSCost optimization on AWS
Cost optimization on AWS
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
AWS SQS SNS
AWS SQS SNSAWS SQS SNS
AWS SQS SNS
 

Destacado

Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Amazon Web Services
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisAmazon Web Services
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsAmazon Web Services
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesMatteo Redaelli
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
Going Serverless with CQRS on AWS
Going Serverless with CQRS on AWSGoing Serverless with CQRS on AWS
Going Serverless with CQRS on AWSAnton Udovychenko
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
Dimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with ExampleDimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with ExampleSajjad Zaheer
 
Workshop AWS IoT @ IoT World Paris
Workshop AWS IoT @ IoT World ParisWorkshop AWS IoT @ IoT World Paris
Workshop AWS IoT @ IoT World ParisJulien SIMON
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaAmazon Web Services
 
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...Amazon Web Services
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computingViet-Trung TRAN
 

Destacado (20)

Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Collecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databasesCollecting and analyzing sensor data with hadoop or other no sql databases
Collecting and analyzing sensor data with hadoop or other no sql databases
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Going Serverless with CQRS on AWS
Going Serverless with CQRS on AWSGoing Serverless with CQRS on AWS
Going Serverless with CQRS on AWS
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Dimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with ExampleDimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with Example
 
Workshop AWS IoT @ IoT World Paris
Workshop AWS IoT @ IoT World ParisWorkshop AWS IoT @ IoT World Paris
Workshop AWS IoT @ IoT World Paris
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
 
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 

Similar a Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAmazon Web Services
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)Amazon Web Services Korea
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAmazon Web Services
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014Amazon Web Services
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesAmazon Web Services
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeSingleStore
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Amazon Web Services
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon KinesisAmazon Web Services
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesisJampp
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAmazon Web Services
 

Similar a Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce (20)

What's new in AWS?
What's new in AWS?What's new in AWS?
What's new in AWS?
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
 
Real-Time Streaming Data on AWS
Real-Time Streaming Data on AWSReal-Time Streaming Data on AWS
Real-Time Streaming Data on AWS
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
 
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real Time
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesis
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), & Redshift (Data Warehousing) Adi Krishnan, AWS Product Management, @adityak Daniel Mintz, Director of BI, Upworthy, @danielmintz July 10, 2014
  • 2. Amazon Kinesis & Big Data o Motivations for Stream Processing  Origins: Internal metering capability  Expanding the big data processing landscape o Customer view on streaming data o Amazon Kinesis Overview  Amazon Kinesis Architecture  Kinesis concepts & Demo o Amazon Elastic MapReduce and Kinesis  EMR connector morphs Kinesis streamed data into Hadoop framework  Applying Hadoop frameworks to streaming data o Amazon Kinesis and Redshift:  Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”  Presented by Daniel Mintz, Director of Business Intelligence, Upworthy
  • 3. The Motivation for Continuous Processing
  • 4. Origins: Internal AWS Metering Capability Workload • 10s of millions records/sec • Multiple TB per hour • 100,000s of sources Pain points • Doesn’t scale elastically • Customers want real-time alerts • Expensive to operate • Relies on eventually consistent storage
  • 5. Expanding the Big Data Processing Landscape • Query Engine Approach • Pre-computations such as indices and dimensional views improve performance • Historical, structured data • HIVE/SQL-on-Hadoop/ M-R/ Spark • Batch programs, or other abstractions breaking down into MR style computations • Historical, Semi-structured data • Custom computations of relative simple complexity • Continuous Processing – filters, sliding windows, aggregates – on infinite data streams • Semi/Structured data, generated continuously in real-time Traditional Data Warehousing Hadoop Style Processing Stream Processing
  • 6. A Generalized Data Flow Many different technologies, at different stages of evolution Client/Sensor Aggregator Continuous Processing Storage Analytics + Reporting
  • 7. Our Big Data Transition Old Posture • Capture huge amounts of data and process it in hourly or daily batches New Requirements • Make decisions faster, sometimes in real-time • Scale entire system elastically • Make it easy to “keep everything” • Multiple applications can process data in parallel
  • 8. Foundation for Data Streams Ingestion, Continuous Processing Right Toolset for the Right Job Real-time Ingest • Highly Scalable • Durable • Elastic • Replay-able Reads Continuous Processing FX • Load-balancing incoming streams • Fault-tolerance, Checkpoint / Replay • Elastic • Enable multiple apps to process in parallel Enable data movement into Stores/ Processing Engines Managed Service Low end-to-end latency Continuous, real-time workloads
  • 10. Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data Software/ Technology IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence Digital Ad Tech./ Marketing Advertising Data aggregation Advertising metrics like coverage, yield, conversion Analytics on User engagement with Ads, Optimized bid/ buy engines Financial Services Market/ Financial Transaction order data collection Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data Consumer Online/ E-Commerce Online customer engagement data aggregation Consumer engagement metrics like page views, CTR Customer clickstream analytics, Recommendation engines Customer Scenarios across Industry Segments 1 2 3
  • 11. Big streaming data comes from the small { "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" } Metering Record 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Common Log Entry <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"] Syslog Entry “SeattlePublicWater/Kinesis/123/Realtime” – 412309129140 MQTT Record <R,AMZN ,T,G,R1> NASDAQ OMX Record
  • 12. What Biz. Problem needs to be solved? Mobile/ Social Gaming Digital Advertising Tech. Deliver continuous/ real-time delivery of game insight data by 100’s of game servers Generate real-time metrics, KPIs for online ad performance for advertisers/ publishers Custom-built solutions operationally complex to manage, & not scalable Store + Forward fleet of log servers, and Hadoop based processing pipeline • Delay with critical business data delivery • Developer burden in building reliable, scalable platform for real-time data ingestion/ processing • Slow-down of real-time customer insights • Lost data with Store/ Forward layer • Operational burden in managing reliable, scalable platform for real-time data ingestion/ processing • Batch-driven real-time customer insights Accelerate time to market of elastic, real-time applications – while minimizing operational overhead Generate freshest analytics on advertiser performance to optimize marketing spend, and increase responsiveness to clients
  • 13. Amazon Kinesis Managed Service for streaming data ingestion, and processing
  • 14. Amazon Kinesis Architecture Amazon Web Services AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  • 15. Kinesis Stream: Managed ability to capture and store data • Streams are made of Shards • Each Shard ingests data up to 1MB/sec, and up to 1000 TPS • Each Shard emits up to 2 MB/sec • All data is stored for 24 hours • Scale Kinesis streams by splitting or merging Shards • Replay data inside of 24Hr. Window
  • 16. Putting Data into Kinesis Simple Put interface to store data in Kinesis • Producers use a PUT call to store data in a Stream • PutRecord {Data, PartitionKey, StreamName} • A Partition Key is supplied by producer and used to distribute the PUTs across Shards • Kinesis MD5 hashes supplied partition key over the hash key range of a Shard • A unique Sequence # is returned to the Producer upon a successful PUT call
  • 17. Building Kinesis Processing Apps: Kinesis Client Library Open Source library for fault-tolerant, continuous processing apps • Java client library, source available on Github • Build app with KCL on your EC2 instance(s) • KCL is intermediary b/w your application & stream • Automatically starts a Kinesis Worker for each shard • Simplifies reading by abstracting individual shards • Increase / Decrease Workers as # of shards changes • Checkpoints to keep track of a Worker’s location in the stream, Restarts Workers if they fail • Deploy app on your EC2 instances • Integrates with AutoScaling groups to redistribute workers to new instances
  • 18. Amazon Kinesis Connector Library Open Source code to Connect Kinesis with S3, Redshift, DynamoDB S3 DynamoDB Redshift Kinesis ITransformer • Defines the transformation of records from the Amazon Kinesis stream in order to suit the user- defined data model IFilter • Excludes irrelevant records from the processing. IBuffer • Buffers the set of records to be processed by specifying size limit (# of records)& total byte count IEmitter • Makes client calls to other AWS services and persists the records stored in the buffer.
  • 19. Sending & Reading Data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Consuming AWS Mobile SDK
  • 20. Amazon Kinesis & Elastic MapReduce
  • 21. Amazon Elastic MapReduce (EMR) Managed Service for Hadoop based data processing • Managed service • Easy to tune clusters and trim costs • Support for multiple data stores • Unique features that ensure customer success on AWS
  • 22. Applying batch processing to streamed data Client/ Sensor Recording Service Aggregator/ Sequencer Continuous processor for dashboard Storage Analytics and Reporting Amazon Kinesis Amazon EMR Streaming Data Ingestion
  • 23. What would this look like? Processing Input • User • Dev My Website Kinesis Log4J Appender push to Kinesis EMR Hive Pig Cascading MapReduce pull from
  • 24. • Features offered starting EMR AMI 3.0.4 – Simply spin up the EMR cluster like normal • Logical names – Labels that define units of work (Job A vs Job B) • Iterations – Provide idempotency (pessimistic locking of the Logical name) • Checkpoints – Creating an input start and end points to allow batch processing Features and Functionality
  • 25. Iterations – the run of a Job Iteration 1 Iteration 2 Iteration 3 Iteration 4 Trim Horizon seqID 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 -24 hours Logical Name Stream NOW Latest seqID Next
  • 26. Logical Names & Checkpointing – allows efficient batching Kinesis Stream NOW Latest seqIDTrim Horizon seqID -24 hours Logical Name Stream
  • 27. • Dynamo DB Metadata Storage Logical Name A Mapper 1 Mapper 2 Mapper 3 Mapper 4 Logical Name B Mapper 1 Mapper 2 Mapper 3 Mapper 4
  • 28. Each Kinesis shard maps 1:1 to a Hadoop map task 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 Mapper 2 Kinesis Hadoop Next Logical Name Mapper 1 Shard 2 Shard 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 -24 hours Start seq ID End seq ID NOW Latest seqID
  • 29. Handling stream scaling events Trim Horizon seqID 1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00 Mapper 2 Kinesis Hadoop Logical Name Mapper 1 Shard 2 Shard 1 Mapper 2 Mapper 1 Mapper 2 Mapper 1 3 Mapper 3 1 2 4 5 Split Shard 2 Shard 1 Shard 2 Shard 1 Shard 3 Shard 2 Shard 1 Shard 3 Split Merge -24 hours Latest seqID NOW Next
  • 30. • InputFormat handles service errors – Throttling: 400 – Service unavailable errors : 503 – Internal server 500 – Http Client exceptions : socket connection timeout • Hadoop handles retry of failed map tasks • Iterations allow retrys – Fixed input boundaries on a stream (idempotency for reruns) – Enable multiple queries on the same input boundaries Handling errors
  • 31. Hadoop Ecosystem Implementation • Hadoop Input format • Hive Storage Handler • Pig Load Function • Cascading Scheme and Tap • Join multiple data sources for analysis • Filter and preprocess streams • Export and archive streaming data Use CasesImplementations
  • 32. Writing to Kinesis using Log4J Option Default Description log4j.appender.KINESIS.streamName AccessLog Stream Stream name to which data is to be published. log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into bytes before sending to Amazon Kinesis. log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs to publish a log message. log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt. log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to configured Kinesis stream. log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to keep in memory. log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application JVM quits normally. .error("Cannot find resource XYX… go do something about it!");
  • 33. Run the Ad-hoc Hive Query
  • 34. Run the Ad-hoc Hive Query
  • 35. Amazon Kinesis & Redshift
  • 36. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. 24 Hours to 10 Minutes How Upworthy’s Data Pipeline uses Kinesis Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz
  • 37. What’s Upworthy • We’ve been called – “Social media with a mission” by our About Page – “The fastest growing media site of all time” by Fast Company – “The Fastest Rising Startup” by The Crunchies – “That thing that’s all over my newsfeed” by my annoyed friends – “The most data-driven media company in history” by me, optimistically
  • 38. What We Do • We aim to drive massive amounts of attention to things that really matter. • We do that by finding, packaging, and distributing great, meaningful content.
  • 40. When We Started • Had built a data warehouse from scratch • Hadoop-based batch workflow • Nightly ETL cycle • 2.5 Engineers • Wanted to do all three: – Comprehensive – Ad Hoc – Real-Time
  • 41. The Decision • Speed up our current system, rather than building a parallel one • Had looked at alternative stream processors – Cost – Maintenance • Comfortable with concept of application log stream
  • 42. How It Works • Log Drain receives, formats, batches and zips • PUTs 50k GZIP batches on Kinesis stream • Three types of Kinesis consumers: 1. Archiver – Batch and write permanent record 2. Stats – Filter, sample and count; Report to StatHat 3. Transformer – Filter, batch, validate; writes temporary BSVs to S3 • Database Importer handles manifest files. • S3 handles garbage collection.
  • 43. Our system now • Stats: – Average: ~1085 events/second – Peak: ~2500 events/second • Data is available in Redshift < 10 min • Kinesis has been cheap, stable, and gives us redundancy and resiliency. • Computation model that’s easy to reason about
  • 44. Resiliency • When something goes wrong, you have 24 hours. • Timestamp at outset. Track lag at each step. • Bigger workers (more CPU, RAM, deeper queues) can catch us up very fast.
  • 46. Some Lessons • You can use one pipeline for everything. • High-cardinality fact data belongs in Kinesis. • EDN works well with Kinesis. • We prefer explicit checkpointing. (Your mileage may vary.) • Languages that run on the JVM can take advantage of AWS Client Libraries.
  • 47. Kinesis Pricing Simple, Pay-as-you-go, & no up-front costs Pricing Dimension Value Hourly Shard Rate $0.015 Per 1,000,000 PUT transactions: $0.028 • Customers specify throughput requirements in shards, that they control • Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress • Inbound data transfer is free • EC2 instance charges apply for Kinesis processing applications
  • 48. Canonical Data flows with Amazon Kinesis Continuous Metric Extraction Incremental Stats Computation Record Archiving Live Dashboard
  • 49. Try out Amazon Kinesis • Try out Amazon Kinesis – http://aws.amazon.com/kinesis/ • Thumb through the Developer Guide – http://aws.amazon.com/documentation/kinesis/ • Test drive the sample app – https://github.com/awslabs/amazon-kinesis-data-visualization-sample • Kinesis Connector Framework – https://github.com/awslabs/amazon-kinesis-connectors • Read EMR-Kinesis FAQs – http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector • Visit, and Post on Kinesis Forum – https://forums.aws.amazon.com/forum.jspa?forumID=169#
  • 50. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Thank You! Adi Krishnan, Product Management, AWS