SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz
Senior Product Manager, Amazon EMR
May 20, 2015
Getting Started with
Amazon EMR
Easy, fast, secure, and cost-effective Hadoop on AWS.
Agenda
• Is Hadoop the answer?
• Amazon EMR 101
• Integration with AWS storage and database services
• Common Amazon EMR design patterns
• Q+A
When leveraging your data to derive new insights,
Big Data problems are everywhere
• Data lacks structure
• Analyzing streams of information
• Processing large datasets
• Warehousing large datasets
• Flexibility for undefined ad hoc analysis
• Speed of queries on large data sets
Hadoop is the right system for Big Data
• Massively parallel
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Pig, Hive, Cascading, Mahout, Giraph
HBase
Presto
Impala
Hadoop 2
Batch
MapReduce
Storage
S3, HDFS
Hadoop 1
Applications
Customers across many verticals
Amazon Elastic MapReduce (EMR) is the
easiest way to run Hadoop in the cloud.
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
Easy to deploy
AWS Management Console AWS Command Line Interface
You can also use the Amazon EMR API with your favorite SDK
or use AWS Data Pipeline to start your clusters.
Try different configurations to find your optimal architecture.
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Low Cost
Pay an hourly rate
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Mix on-demand and EC2 Spot capacity for low costs
Meet SLA at predictable cost Exceed SLA at lower cost
Use multiple EMR instance groups
Master Node
r3.2xlarge
Example Amazon
EMR Cluster
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge (EC2 Spot)
Slave Group – Task
m3.2xlarge (EC2 Spot)
Core nodes run HDFS
(DataNode). Task nodes do
not run HDFS. Core and
Task nodes each run YARN
(NodeManager).
Master node runs
NameNode (HDFS),
ResourceManager (YARN),
and serves as a gateway.
Elastic
Easily add or remove capacity
Easy to add and remove compute
capacity in your cluster from the console, CLI, or API.
Match compute
demands with
cluster sizing.
Resizable clusters
Use S3 instead of HDFS for your data layer to decouple
your compute capacity and storage
Amazon S3
Amazon EMR
Shut down your EMR
clusters when you
are not processing
data, and stop paying
for them!
Reliable
Spend less time monitoring
Easy to monitor and debug
Monitor with Amazon CloudWatch or Ganglia
Cluster, Node, and IO
Monitor Debug
EMR logging to S3 makes logs easily available
Secure
Integrates with AWS
security features
Use Identity and Access Management (IAM) roles with
your Amazon EMR cluster
• IAM roles give AWS services fine grained
control over delegating permissions to AWS
services and access to AWS resources
• EMR uses two IAM roles:
• EMR service role is for the Amazon EMR
control plane
• EC2 instance profile is for the actual
instances in the Amazon EMR cluster
• Default IAM roles can be easily created and
used from the AWS Console and AWS CLI
EMR Security Groups: default and custom
A security group is a virtual firewall which controls
access to the EC2 instances in your Amazon EMR
cluster
• There is a single default master and default
slave security group across all of your clusters
• The master security group has port 22 access
for SSHing to your cluster
You can add additional security groups to the master
and slave groups on a cluster to separate them from
the default master and slave security groups, and
further limit ingress and egress policies.
Slave
Security
Group
Master
Security
Group
Other Amazon EMR security features
EMRFS encryption options
• S3 server-side encryption
• S3 client-side encryption (use AWS Key Management Service keys or custom keys)
CloudTrail integration
• Track Amazon EMR API calls for auditing
Launch your Amazon EMR clusters in a VPC
• Logically isolated portion of the cloud (“Virtual Private Network”)
• Enhanced networking on certain instance types
Flexible
Customize the cluster
Hadoop applications available in EMR
Use Hive on EMR to interact with your data in HDFS
and Amazon S3
• Batch or ad hoc workloads
• Integration with EMRFS for better
performance reading and writing
to S3
• SQL-like query language to make
iterative queries easier
• Schema-on-read to query data
without needing pre-processing
• Use Tez engine for faster queries
Use Pig to easily create ETL workflows
• Uses high-level “Pig Latin” language to
easily script data transformations in
Hadoop
• Strong optimizer for workloads
• Options to create robust user defined
functions
Use HBase on a persistent EMR cluster as a noSQL
scalable database
• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3
• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system
Impala: a fast SQL query engine for EMR Clusters
• Low-latency SQL query engine for Hadoop
• Perfect for fast ad hoc, interactive queries on
structured on unstructured data
• Can be easily installed on an EMR cluster,
and queried from the CLI or a 3rd party BI tool
• Perfect for memory optimized instances
• Currently uses HDFS as data layer
Hadoop User Experience (Hue)
Query Editor
Hue
Job Browser
Hue
File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
To install anything else, use Bootstrap Actions
https://github.com/awslabs/emr-bootstrap-actions
Spark: an alternative engine to Hadoop with its
own ecosystem of applications
• Does not use map-reduce framework
• In-memory for fast queries
• Great for machine learning or other
iterative queries
• Use Spark SQL to create a low-latency
data warehouse
• Spark Streaming for real-time
workloads
Also use Bootstrap Actions to configure your
applications
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop
--keyword-config-file (Merge values in new config to existing)
--keyword-key-value (Override values provided)
Configuration File
Name
Configuration File
Keyword
File Name Shortcut
Key-Value Pair
Shortcut
core-site.xml core C c
hdfs-site.xml hdfs H h
mapred-site.xml mapred M m
yarn-site.xml yarn Y y
EMR Step API
• EMR step can be a map-
reduce job, Hive program, Pig
script, or even an arbitrary
script
• Easily submit Step from
console, CLI, or API
• Submit multiple steps to use
EMR as a sequential workflow
engine
Submit work via the EMR Step API or SSH to the
EMR master node
Connect to Master Node
• Connect to HUE, interact with
application CLIs, or submit
work directly to the Hadoop
APIs
• View the Hadoop UI
• Useful for long-running clusters
and interactive use cases
Let’s see it!
Quick tour of the EMR Console and HUE on an EMR
cluster
Diverse set of partners to use with Amazon EMR
BI / Visualization Business Intelligence BI / Visualization BI / Visualization
Hadoop Distribution Data Transfer Data Transformation
Monitoring Performance Tuning Graphical IDE Graphical IDE
Available on AWS Marketplace Available as a distribution in Amazon EMR
ETL Tool
BI / Visualization
Integration with AWS storage
and database services
Choose your data stores
Amazon S3 as your persistent data store
Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
Resize and shut down Amazon EMR clusters
with no data loss
Point multiple Amazon EMR clusters at same
data in Amazon S3 using the EMR File
System (EMRFS)
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Consistent view
• For consistent list and read-after-write for new puts
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
Amazon S3 EMRFS metadata
in Amazon DynamoDB
• List and read-after-write consistency
• Faster list operations
Number
of objects
Without
Consistent
Views
With Consistent
Views
1,000,000 147.72 29.70
100,000 12.70 3.69
Consistent view and fast listing using the optional
EMRFS metadata
*Tested using a single node cluster with a m3.xlarge instance.
EMRFS support for Amazon S3 client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Read data directly into Hive,
Apache Pig, and Hadoop
Streaming and Cascading from
Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real-time sources into
batch-oriented systems
Multi-application support and automatic
checkpointing
Amazon EMR Integration with Amazon Kinesis
Use Hive with EMR to query data DynamoDB
• Export data stored in DynamoDB to
Amazon S3
• Import data in Amazon S3 to
DynamoDB
• Query live DynamoDB data using SQL-
like statements (HiveQL)
• Join data stored in DynamoDB and
export it or query against the joined data
• Load DynamoDB data into HDFS and
use it in your EMR job
Use AWS Data Pipeline and EMR to transform
data and load into Amazon Redshift
Unstructured Data Processed Data
Pipeline orchestrated and scheduled by AWS Data Pipeline
Amazon EMR design patterns
Amazon EMR example #1: Batch processing
GBs of logs pushed
to Amazon S3 hourly
Daily Amazon EMR
cluster using Hive to
process data
Input and output
stored in Amazon S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
Using Amazon S3 and HDFS
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data in HDFS for
Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
Amazon Confidential
Multiple EMR workflows using the same S3
dataset
Computations
S3DistCp
CascalogLZO
Input Amazon
S3 bucket
Intermediate
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Final
Amazon S3
bucket
Crashlytics (part of Twitter) uses EMR to
analyze data in S3 to power dashboards
on its Answers platform.
Amazon EMR example #2: Long-running cluster
Data pushed to
Amazon S3
Daily Amazon EMR cluster
Extract, Transform, and Load
(ETL) data into database 24/7 Amazon EMR cluster
running HBase holds last 2
years’ worth of data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
Amazon EMR example #3: Interactive query
TBs of logs sent daily
Logs stored in
Amazon S3
Amazon EMR cluster using Presto for ad hoc
analysis of entire log set
Interactive query using Presto on multipetabyte warehouse
http://techblog.netflix.com/2014/10/using-presto-in-our-big-
data-platform.html
EMR example #4: EMR for ETL and query engine for
investigations which require all raw data
TBs of logs sent
daily
Logs stored in S3
Hourly EMR cluster
using Spark for ETL
Load subset into
Redshift DW
Transient EMR cluster using Spark for ad hoc
analysis of entire log set
Client/Sensor Recording
Service
Aggregator/
Sequencer
Continuous
Processor
Data Warehouse Analytics and
Reporting
EMR Example #5: Streaming Data
Client/Sensor Recording Service Aggregator/
Sequencer
Continuous
Processor
Data Warehouse Analytics and
Reporting
Kafka
Common Tools
Amazon Kinesis
Streaming Data Repository
Amazon Kinesis
Client/ Sensor Recording Service Aggregator/
Sequencer
Continuous
Processor for
Dashboard
Data Warehouse Analytics and
Reporting
Amazon Kinesis Amazon EMR
Streaming Data RepositoryLogging Data Processing
Log4J
Amazon Kinesis + Amazon EMR = Fewer
Moving Parts
Processed
output in real-time
and batch
workflows
Input
push with Log 4J to
Hive
Pig
Cascading
pull from
Spark
Amazon EMR
Amazon Kinesis
Customer Application
Amazon DynamoDB
Real-time processing with Spark Streaming and batch
workloads on Kinesis streams with the Hadoop stack
AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new
customers about the AWS platform, best practices and new cloud services.
Details
• July 1, 2015
• Chicago, Illinois
• @ McCormick Place
Featuring
• New product launches
• 36+ sessions, labs, and bootcamps
• Executive and partner networking
Registration is now open
• Come and see what AWS and the cloud can do for you.
CTA Script
- If you are interested in learning more about how to navigate the cloud to grow
your business - then attend the AWS Summit Chicago, July 1st.
- Register today to learn from technical sessions led by AWS engineers, hear best
practices from AWS customers and partners, and participate in some of the 30+
paid sessions and labs.
- Simply go to
https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc
amps&trk=Webinar_slide
to register today.
- Registration is FREE.
TRACKING CODE:
- Listed above.
Thank you!
www.aws.amazon.com/elasticmapreduce
www.blogs.aws.amazon.com/bigdata
jonfritz@amazon.com

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)Introduction to Amazon Web Services (AWS)
Introduction to Amazon Web Services (AWS)
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)Deep Dive on Amazon RDS (Relational Database Service)
Deep Dive on Amazon RDS (Relational Database Service)
 
Amazon SQS overview
Amazon SQS overviewAmazon SQS overview
Amazon SQS overview
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
 
Getting Started with Amazon EC2
Getting Started with Amazon EC2Getting Started with Amazon EC2
Getting Started with Amazon EC2
 
AWS EC2
AWS EC2AWS EC2
AWS EC2
 
Intro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute ServicesIntro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute Services
 
Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless Applications
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
(DVO315) Log, Monitor and Analyze your IT with Amazon CloudWatch
(DVO315) Log, Monitor and Analyze your IT with Amazon CloudWatch(DVO315) Log, Monitor and Analyze your IT with Amazon CloudWatch
(DVO315) Log, Monitor and Analyze your IT with Amazon CloudWatch
 
AWS RDS
AWS RDSAWS RDS
AWS RDS
 
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
 
Basics AWS Presentation
Basics AWS PresentationBasics AWS Presentation
Basics AWS Presentation
 
Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)Introduction to Amazon Relational Database Service (Amazon RDS)
Introduction to Amazon Relational Database Service (Amazon RDS)
 
AWS Webcast - Disaster Recovery
AWS Webcast - Disaster RecoveryAWS Webcast - Disaster Recovery
AWS Webcast - Disaster Recovery
 
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Elastic  Load Balancing Deep Dive - AWS Online Tech TalkElastic  Load Balancing Deep Dive - AWS Online Tech Talk
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
 
Amazon Web Services - Elastic Beanstalk
Amazon Web Services - Elastic BeanstalkAmazon Web Services - Elastic Beanstalk
Amazon Web Services - Elastic Beanstalk
 

Similar a AWS May Webinar Series - Getting Started with Amazon EMR

Similar a AWS May Webinar Series - Getting Started with Amazon EMR (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

AWS May Webinar Series - Getting Started with Amazon EMR

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz Senior Product Manager, Amazon EMR May 20, 2015 Getting Started with Amazon EMR Easy, fast, secure, and cost-effective Hadoop on AWS.
  • 2. Agenda • Is Hadoop the answer? • Amazon EMR 101 • Integration with AWS storage and database services • Common Amazon EMR design patterns • Q+A
  • 3. When leveraging your data to derive new insights, Big Data problems are everywhere • Data lacks structure • Analyzing streams of information • Processing large datasets • Warehousing large datasets • Flexibility for undefined ad hoc analysis • Speed of queries on large data sets
  • 4. Hadoop is the right system for Big Data • Massively parallel • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  • 5. Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Pig, Hive, Cascading, Mahout, Giraph HBase Presto Impala Hadoop 2 Batch MapReduce Storage S3, HDFS Hadoop 1 Applications
  • 7. Amazon Elastic MapReduce (EMR) is the easiest way to run Hadoop in the cloud.
  • 8. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  • 9. Easy to Use Launch a cluster in minutes
  • 10. Easy to deploy AWS Management Console AWS Command Line Interface You can also use the Amazon EMR API with your favorite SDK or use AWS Data Pipeline to start your clusters.
  • 11. Try different configurations to find your optimal architecture. CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 12. Low Cost Pay an hourly rate
  • 13. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Mix on-demand and EC2 Spot capacity for low costs Meet SLA at predictable cost Exceed SLA at lower cost
  • 14. Use multiple EMR instance groups Master Node r3.2xlarge Example Amazon EMR Cluster Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge (EC2 Spot) Slave Group – Task m3.2xlarge (EC2 Spot) Core nodes run HDFS (DataNode). Task nodes do not run HDFS. Core and Task nodes each run YARN (NodeManager). Master node runs NameNode (HDFS), ResourceManager (YARN), and serves as a gateway.
  • 15. Elastic Easily add or remove capacity
  • 16. Easy to add and remove compute capacity in your cluster from the console, CLI, or API. Match compute demands with cluster sizing. Resizable clusters
  • 17. Use S3 instead of HDFS for your data layer to decouple your compute capacity and storage Amazon S3 Amazon EMR Shut down your EMR clusters when you are not processing data, and stop paying for them!
  • 19. Easy to monitor and debug Monitor with Amazon CloudWatch or Ganglia Cluster, Node, and IO Monitor Debug
  • 20. EMR logging to S3 makes logs easily available
  • 22. Use Identity and Access Management (IAM) roles with your Amazon EMR cluster • IAM roles give AWS services fine grained control over delegating permissions to AWS services and access to AWS resources • EMR uses two IAM roles: • EMR service role is for the Amazon EMR control plane • EC2 instance profile is for the actual instances in the Amazon EMR cluster • Default IAM roles can be easily created and used from the AWS Console and AWS CLI
  • 23. EMR Security Groups: default and custom A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster • There is a single default master and default slave security group across all of your clusters • The master security group has port 22 access for SSHing to your cluster You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies. Slave Security Group Master Security Group
  • 24. Other Amazon EMR security features EMRFS encryption options • S3 server-side encryption • S3 client-side encryption (use AWS Key Management Service keys or custom keys) CloudTrail integration • Track Amazon EMR API calls for auditing Launch your Amazon EMR clusters in a VPC • Logically isolated portion of the cloud (“Virtual Private Network”) • Enhanced networking on certain instance types
  • 27. Use Hive on EMR to interact with your data in HDFS and Amazon S3 • Batch or ad hoc workloads • Integration with EMRFS for better performance reading and writing to S3 • SQL-like query language to make iterative queries easier • Schema-on-read to query data without needing pre-processing • Use Tez engine for faster queries
  • 28. Use Pig to easily create ETL workflows • Uses high-level “Pig Latin” language to easily script data transformations in Hadoop • Strong optimizer for workloads • Options to create robust user defined functions
  • 29. Use HBase on a persistent EMR cluster as a noSQL scalable database • Billions of rows and millions of columns • Backup to and restore from Amazon S3 • Flexible datatypes • Modulate your HBase tables when adding new data to your system
  • 30. Impala: a fast SQL query engine for EMR Clusters • Low-latency SQL query engine for Hadoop • Perfect for fast ad hoc, interactive queries on structured on unstructured data • Can be easily installed on an EMR cluster, and queried from the CLI or a 3rd party BI tool • Perfect for memory optimized instances • Currently uses HDFS as data layer
  • 31. Hadoop User Experience (Hue) Query Editor
  • 33. Hue File Browser: Amazon S3 and the Hadoop Distributed File System (HDFS)
  • 34. To install anything else, use Bootstrap Actions https://github.com/awslabs/emr-bootstrap-actions
  • 35. Spark: an alternative engine to Hadoop with its own ecosystem of applications • Does not use map-reduce framework • In-memory for fast queries • Great for machine learning or other iterative queries • Use Spark SQL to create a low-latency data warehouse • Spark Streaming for real-time workloads
  • 36. Also use Bootstrap Actions to configure your applications --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --keyword-config-file (Merge values in new config to existing) --keyword-key-value (Override values provided) Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  • 37. EMR Step API • EMR step can be a map- reduce job, Hive program, Pig script, or even an arbitrary script • Easily submit Step from console, CLI, or API • Submit multiple steps to use EMR as a sequential workflow engine Submit work via the EMR Step API or SSH to the EMR master node Connect to Master Node • Connect to HUE, interact with application CLIs, or submit work directly to the Hadoop APIs • View the Hadoop UI • Useful for long-running clusters and interactive use cases
  • 38. Let’s see it! Quick tour of the EMR Console and HUE on an EMR cluster
  • 39. Diverse set of partners to use with Amazon EMR BI / Visualization Business Intelligence BI / Visualization BI / Visualization Hadoop Distribution Data Transfer Data Transformation Monitoring Performance Tuning Graphical IDE Graphical IDE Available on AWS Marketplace Available as a distribution in Amazon EMR ETL Tool BI / Visualization
  • 40. Integration with AWS storage and database services
  • 42. Amazon S3 as your persistent data store Amazon S3 • Designed for 99.999999999% durability • Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
  • 43. EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications – just read/write to “s3://” Consistent view • For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata
  • 44. Amazon S3 EMRFS metadata in Amazon DynamoDB • List and read-after-write consistency • Faster list operations Number of objects Without Consistent Views With Consistent Views 1,000,000 147.72 29.70 100,000 12.70 3.69 Consistent view and fast listing using the optional EMRFS metadata *Tested using a single node cluster with a m3.xlarge instance.
  • 45. EMRFS support for Amazon S3 client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 46. Read data directly into Hive, Apache Pig, and Hadoop Streaming and Cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real-time sources into batch-oriented systems Multi-application support and automatic checkpointing Amazon EMR Integration with Amazon Kinesis
  • 47. Use Hive with EMR to query data DynamoDB • Export data stored in DynamoDB to Amazon S3 • Import data in Amazon S3 to DynamoDB • Query live DynamoDB data using SQL- like statements (HiveQL) • Join data stored in DynamoDB and export it or query against the joined data • Load DynamoDB data into HDFS and use it in your EMR job
  • 48. Use AWS Data Pipeline and EMR to transform data and load into Amazon Redshift Unstructured Data Processed Data Pipeline orchestrated and scheduled by AWS Data Pipeline
  • 49. Amazon EMR design patterns
  • 50. Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Daily Amazon EMR cluster using Hive to process data Input and output stored in Amazon S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 51. Using Amazon S3 and HDFS Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Ad-hoc Query Data aggregated and stored in Amazon S3 Amazon Confidential
  • 52. Multiple EMR workflows using the same S3 dataset Computations S3DistCp CascalogLZO Input Amazon S3 bucket Intermediate Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Final Amazon S3 bucket Crashlytics (part of Twitter) uses EMR to analyze data in S3 to power dashboards on its Answers platform.
  • 53. Amazon EMR example #2: Long-running cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years’ worth of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 54. Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multipetabyte warehouse http://techblog.netflix.com/2014/10/using-presto-in-our-big- data-platform.html
  • 55. EMR example #4: EMR for ETL and query engine for investigations which require all raw data TBs of logs sent daily Logs stored in S3 Hourly EMR cluster using Spark for ETL Load subset into Redshift DW Transient EMR cluster using Spark for ad hoc analysis of entire log set
  • 57. Client/Sensor Recording Service Aggregator/ Sequencer Continuous Processor Data Warehouse Analytics and Reporting Kafka Common Tools
  • 58. Amazon Kinesis Streaming Data Repository Amazon Kinesis
  • 59. Client/ Sensor Recording Service Aggregator/ Sequencer Continuous Processor for Dashboard Data Warehouse Analytics and Reporting Amazon Kinesis Amazon EMR Streaming Data RepositoryLogging Data Processing Log4J Amazon Kinesis + Amazon EMR = Fewer Moving Parts
  • 60. Processed output in real-time and batch workflows Input push with Log 4J to Hive Pig Cascading pull from Spark Amazon EMR Amazon Kinesis Customer Application Amazon DynamoDB Real-time processing with Spark Streaming and batch workloads on Kinesis streams with the Hadoop stack
  • 61. AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services. Details • July 1, 2015 • Chicago, Illinois • @ McCormick Place Featuring • New product launches • 36+ sessions, labs, and bootcamps • Executive and partner networking Registration is now open • Come and see what AWS and the cloud can do for you.
  • 62. CTA Script - If you are interested in learning more about how to navigate the cloud to grow your business - then attend the AWS Summit Chicago, July 1st. - Register today to learn from technical sessions led by AWS engineers, hear best practices from AWS customers and partners, and participate in some of the 30+ paid sessions and labs. - Simply go to https://aws.amazon.com/summits/chicago/?trkcampaign=summit_chicago_bootc amps&trk=Webinar_slide to register today. - Registration is FREE. TRACKING CODE: - Listed above.