Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

Big Data on AWS
Structured, Unstructured & Streaming
Russell Nash

v
Structure
High Low
Large
Size
Small
Traditional
Database
Hadoop
NoSQL
MPP Database

Structured Unstructured Streaming
MPP Databases
Amazon Redshift
Hadoop
Amazon EMR
Real-time Analysis
Amazon Kinesis

v
• Standard SQL
• Optimized for fast analysis
• Very scalable

v MPP SQL Database
Optimised for Analytics
Gigabytes to Petabytes
Fully relational
Fully managed
Amazon
Redshift

JDBC/ODBC
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail

Dramatically reduces I/O
v
• Column storage
• Data compression
• Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With row storage you do unnecessary I/O
• To get average Amount by State, you have
to read everything

v
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With column storage, you only
read the data you need
• Column storage
• Zone maps

v
• Column storage
• Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum
value for each block
• Skip over blocks that don’t contain
relevant data

v
Q3. What’s good about it?
Performance, Scalability, Ease of Use, Cost

Performance Evaluation on 2B Rows
v
Traditional
SQL Database
Amazon
Redshift
Aggregate by month 02:08:35 00:35:46 00:00:12

DW2.L 160 GB
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
2 PB

v
Q4. How do I integrate with Redshift?

Works with your existing analysis tools
v
JDBC/ODBC
Amazon Redshift

S3
Redshift
DynamoDB
EMR
Linux
Loading data

Amazon
Redshift
Source
Systems
ETL

Input
File
Functions Output
Hadoop cluster
1. Very Flexible
2. Very Scalable
3. Often Transient

Amazon Elastic MapReduce (EMR)
v

v
Q1. What is it?
Managed Hadoop

Input
File
EC2
EC2
EC2
Functions Output
EC2
EC2
EC2
EMR cluster

v
S3 EMR Cluster
EMR
1. Put the
data into S3
2. Choose: Hadoop
distribution, # of nodes, types
of nodes, Hadoop apps like
Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK,
or APIs

v
EMR Cluster
EMR
S3
You can easily resize
the cluster
And launch parallel
clusters using the same
data

v
EMR Cluster
EMR
S3
Use Spot
nodes to save
time and
money

v
S3 EMR Cluster
When processing is complete, you
can terminate the cluster (and stop
paying)

v
EMR Cluster
Or just store
everything in HDFS
(local disk)

v
Scalability, Cost & Ease of Use

v
Scenario #1
Duration:
14 Hours
Scenario #2
Duration:
7 Hours
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%

Master instance group
EMR cluster
HDFS HDFS
Core instance group Task instance group
Amazon S3
Great for
Spot Instances

v
Q4. How are customers using it?

Big Data Verticals
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendations
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial Services
Monte Carlo
Simulations
Risk Analysis
Security
Anti-virus
Fraud Detection
Image
Recognition
Social
Network/Gaming
User
Demographics
Usage analysis
In-game metrics

Log Ingest Continual Metrics Real Time Data Analytics Complex Stream
v
Processing
Software/
Technology
IT server logs ingestion IT operational metrics
dashboards
Devices / Sensor Operational
Intelligence
Digital Ad Tech./
Marketing
Advertising Data aggregation Advertising metrics like
coverage, yield, conversion
Analytics on User engagement
with Ads
Optimized bid/ buy
engines
Financial Services Market/ Financial Transaction
order data collection
Financial market data metrics Fraud monitoring, and Value-at-
Risk assessment
Auditing of market
order data
E-Commerce
Online customer engagement
data aggregation
Consumer engagement metrics
like page views, CTR
Customer clickstream analytics Recommendation
engines
Real-time Scenarios in Industry Segments

v Kinesis
Movement or activity in response to a stimulus.
A fully managed service for real-time processing
of high-volume, streaming data.

Availability
Zone
Stream
Availability
Zone
Availability
Zone
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Logging
Metrics
Analysis
Machine
Learning
S3
DynamoDB
Redshift
EMR
Kinesis

Putting data into Kinesis
• Each shard
• 1000 Tx Per Second
• 1MB Per Second
• 50KB Payload Per Tx
• Messages kept for 24 hours
• Simple PUT interface to store data in Kinesis
• A Partition Key is used to distribute the PUTs across Shards
• A unique Sequence # is created

Getting data out of Kinesis
v
Kinesis Client Library (KCL):
• Abstracts code from individual shards
• Starts a Kinesis Worker for each shard
• Increases and decreases workers
• Tracks a Worker’s location in the stream

v

v
Easy Administration Real-time Performance High Throughput.
Elastic
Integration
S3
Redshift
DynamoDB
Storm
ElasticSearch
Build Real-time
Applications
.
Low Cost

Online Labs | Training
Gain confidence and hands-on
experience with AWS. Watch free
Instructional Videos and explore Self-
Paced Labs
Instructor Led Classes
Learn how to design, deploy and operate
highly available, cost-effective and secure
applications on AWS in courses led by
qualified AWS instructors
AWS Certification
Validate your technical expertise
with AWS and use practice exams
to help you prepare for AWS
Certification
http://aws.amazon.com/training

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

Similar a Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things