Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Scaling Your Analytics
with Amazon Elastic MapReduce
Peter Sirota, General Manager - Amazon Elastic MapReduce
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Agenda
• Amazon EMR: Hadoop in the cloud
• Hadoop Ecosystem on Amazon EMR
• Customer Use Cases

Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics

Challenges with Hadoop
On Premise

On Amazon EC2

• Manage HDFS, upgrades,
and system administration
• Pay for expensive support
contracts
• Select hardware in
advance and stick with
predictions

• Difficult to integrate with
AWS storage services
• Independently manage
and monitor clusters

Amazon EMR is the

easiest way to run Hadoop in the cloud

Why Amazon EMR?
•
•
•
•

Managed services
Easy to tune clusters and trim costs
Support for multiple data stores
Unique features and ecosystem support

S3, DynamoDB, Redshift
Input data

Input data

Code

Elastic
MapReduce

Input data

Code

Elastic
MapReduce

Name
node

Input data

Code

Elastic
MapReduce

Name
node

S3/HDFS

Elastic
cluster

Input data

Code

Elastic
MapReduce

Name
node

Queries
+ BI

S3/HDFS

Via JDBC, Pig, Hive

Elastic
cluster

Input data

Code

Elastic
MapReduce

Output

Name
node

Queries
+ BI

S3/HDFS

Via JDBC, Pig, Hive

Elastic
cluster

Input data

Output

Elastic clusters
Customize size and type to reduce costs

Choose your instance types
Try out different configurations to find your
optimal architecture

CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge

Memory
m1.large
m2.2xlarge
m2.4xlarge

Disk
hs1.8xlarge

Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need

=

Resizable clusters
Easy to add and remove compute
capacity on your cluster

10 hours

Resizable clusters

6 hours

Resizable clusters

Peak capacity

Resizable clusters

Matched compute
demands with cluster sizing
10 hours

Use Spot and Reserved Instances
Minimize costs by supplementing on-demand pricing

Easy to use Spot Instances
Name-your-price supercomputing to minimize costs

Spot for
task nodes

On-demand for
core nodes

Up to 90%
off Amazon
EC2
on-demand
pricing

Standard
Amazon EC2
pricing for
on-demand
capacity

24/7 clusters on Reserved Instances
Minimize cost for consistent capacity
Reserved
Instances for
long running
clusters
Up to 65% off
on-demand
pricing

Your data, your choice
Easy to integrate Amazon EMR with your data stores

Using Amazon S3 and HDFS
Data aggregated
and stored in
Amazon S3

Ad-hoc Query
Long running EMR cluster
holding data in HDFS for
Hive interactive queries

Weekly Report
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports

Use Amazon EMR with Amazon Redshift
and Amazon S3

Processed data
loaded into
Amazon Redshift
data warehouse

Daily data
aggregated in
Amazon S3

Data Sources

Amazon EMR
cluster used to
process data

Use the Hadoop Ecosystem
on Amazon EMR
Leverage a diverse set of tools to get the most out of your data

Hadoop 2.x

•
•
•
•
•

and much more...

Databases
Machine learning
Metadata stores
Exchange formats
Diverse query languages

Use Hive on Amazon EMR to interact with
your data in HDFS and Amazon S3
• Data warehouse for Hadoop
• Integration with Amazon S3 for
better performance reading and
writing to Amazon S3
• SQL-like query language to make
iterative queries easier
• Easy to scale in HDFS on a
persistent Amazon EMR cluster

Use HBase on a persistent Amazon EMR cluster
as a column-oriented scalable data store

• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3

• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system

Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data

Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data

Impala (coming soon!)
• Parallel database
engine for Hadoop
• Use SQL to query data
in HDFS on your cluster
in real-time

“Hadoop-as-a-Service [Amazon EMR] offers a
better price-performance ratio [than bare-metal Hadoop].”

1. Elastic clusters and cost optimization
2. Rapid, tuned provisioning
3. Agility for experimentation
4. Easy integration with diverse datastores

Diverse set of partners to build on Amazon EMR

BI / Visualization

Hadoop Distribution

Monitoring

Business Intelligence

Data Transformation

Data Transfer

Performance Tuning

Available on AWS Marketplace

BI / Visualization

ETL Tool

Graphical IDE

Available as a distribution in Amazon Elastic MapReduce

BI / Visualization

Encryption

Graphical IDE

How Netflix scales Big Data Platform on
Amazon EMR
Eva Tse, Director of Big Data Platform, Netflix
November 14, 2013


Hadoop ecosystem as
our Data Analytics platform
in the cloud

Separate compute and
storage layers

S3
S3mper-enabled
Source
of
truth

Ad hoc

SLA

zone y

zone x

S3
Source
of
truth

Ad hoc

SLA
Bonus
zone x

Bonus
zone y

S3
Source
of
truth

Bonus
zone z

Unified and global big data
collection pipeline

Events Pipeline

SLA

cloud
apps
Suro

Ursula

S3
Bonus
Source
of
truth

Aegisthus

Dimension Pipeline

Adhoc

Innovate – services and tools

Putting into perspective …
•
•
•
•
•

Billions of viewing hours of data
~3000 nodes clusters
Hundred billion events / day
Few petabytes DW on Amazon S3
Thousands of jobs / day

Analytics and statistical modeling

What works for us?

Scalability

What works for us?

Hadoop integration on Amazon EC2 / AWS

What works for us?

Let us focus on innovation and build a solution

What works for us?

Tight engagement with Amazon EMR & Amazon
EC2 teams for tactical issues and strategic
roadmap

Next Steps …
• Heterogeneous node cluster
• Auto expand shrink
• Richer monitoring infrastructure

We strive to build the best of class
big data platform in the cloud

Big Data at Channel 4
Amazon Elastic MapReduce for Competitive Advantage
Bob Harris – Channel 4 Television
14th November 2013


Channel 4 – Background
•

Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.

•

We have a remit to deliver innovative, experimental, distinctive, and diverse
content across television, film, and digital media.

•

We are funded predominantly by television advertising, competing with the other
established UK commercial broadcasters, and increasingly with emerging,
Internet based, providers.

•

Our content, is available across our portfolio of around 10 core and time-shift
channels, and our on demand service 4oD is accessible across multiple devices
and platforms.

Business Intelligence at C4
•

Well established Business Intelligence capability

•

Based on industry standard proprietary products

•

Real-time data warehousing

•

Comprehensive business reporting

•

Excellent internal skills

•

Good external skills availability

Big Data Technology at C4
•

2011 - Embarked on Big Data initiative
–
–

•

2012 - Ran Amazon EMR in parallel with
conventional BI
–
–

•

Ran in-house and cloud-based PoCs
Selected Amazon EMR

Hive deployed to Data Analysts
Amazon EMR workflows deployed to production

2013 – Amazon EMR confirmed as primary Big Data
platform
–
–

Amazon EMR usage growing, focus on automation
Experimenting with Mahout for Machine Learning

What problems are we solving?

Single view of the viewer
recognising them across
devices and serving
relevant content

Personalising the viewer experience

How are we doing this?
• Principal tasks…
– Audience segmentation
– Personalisation
– Recommendations

• What data do we process…
–
–
–
–

Website clickstream logs
4oD activity and viewing history
Over 9m registered users
Majority of activity now from “logged-in” users

High-Level Architecture
•

Amazon EMR and existing BI technology are
complementary

•

Process billions of data rows in Amazon EMR,
store millions of result rows in RDBMS

•

No need to “rip and replace”, existing technology
investment is protected

•

Amazon EMR will continue to underpin major
growth in data volumes and processing
complexity

Where Next?
• Continued growth in usage of Amazon EMR
• Migrate to Hadoop 2.x
• Adopt Amazon Redshift
• Improved integration between C4 and AWS
• Shift toward “near real-time” processing

Please give us your feedback on this
presentation

BDT301
As a thank you, we will select prize
winners daily for completed surveys!

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Similar a Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013 (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013