SlideShare una empresa de Scribd logo
1 de 113
Masterclass
Elastic MapReduce
Ian Massingham – Technical Evangelist
@IanMmmm
A technical deep dive beyond the basics
Help educate you on how to get the best from AWS technologies
Show you how things work and how to get things done
Broaden your knowledge in ~45 mins
Masterclass
A key tool in the toolbox to help with ‘Big Data’ challenges
Makes possible analytics processes previously not feasible
Cost effective when leveraged with EC2 spot market
Broad ecosystem of tools to handle specific use cases
Amazon Elastic MapReduce
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
A framework
Splits data into pieces
Lets processing occur
Gathers the results
Very large
click log
(e.g TBs)
Very large
click log
(e.g TBs)
Lots of actions by
John Smith
Very large
click log
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Very large
click log
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Very large
click log
(e.g TBs)
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
Very large
click log
(e.g TBs)
What
John
Smith
did
Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an
EMR cluster
Aggregate
the results
from all the
nodes
What
John
Smith
did
Very large
click log
(e.g TBs)
Insight in a fraction of the time
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
HDFS
Reliable storage
MapReduce
Data analysis
Big Data != Hadoop
Big Data != Hadoop
Velocity
Variety
Volume
Big Data != Hadoop
When you need to innovate to
collect, store, analyze, and manage
your data
ComputeStorage Big Data
How do you get your data into AWS?
AWS Direct Connect
Dedicated low latency
bandwidth
Amazon Kinesis
Highly scalable stream
processing for event/record
buffering
Amazon Storage Gateway
Sync local storage to the cloud
AWS Import/Export
Physical media shipping
Relational Database
Service
Fully managed database
(PostgreSQL, MySQL,
Oracle, MSSQL)
DynamoDB
NoSQL, Schemaless,
Provisioned throughput
database
S3
Object datastore up to 5TB
per object
99.999999999% durability
ComputeStorage Big Data
Where do you put your data once it’s in AWS?
Elastic MapReduce
Getting started, tools & operating modes
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
EMR cluster
Start an EMR
cluster using
console or cli
tools
Master instance group
EMR cluster
Master instance
group created
that controls the
cluster (runs
MySQL)
Master instance group
EMR cluster
Core instance group
Core instance
group created for
life of cluster
Master instance group
EMR cluster
Core instance group
HDFS HDFS
Core instances
run DataNode
and TaskTracker
daemons
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Optional task
instances can be
added or
subtracted to
perform work
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
S3 can be used
as underlying ‘file
system’ for
input/output data
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Master node
coordinates
distribution of
work and
manages cluster
state
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Core and Task
instances read-
write to S3
map
Input
file reduce
Output
file
EC2 instance
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
EC2 instance
EC2 instance
EC2 instance
HDFS
EMR
Pig
HDFS
EMR
S3 DynamoDB
Kinesis
HDFS
EMR
Data management
Kinesis
S3 DynamoDB
HDFS
EMR
Pig
Analytics languagesData management
Kinesis
S3 DynamoDB
HDFS
EMR
Pig
RDS
Analytics languagesData management
Kinesis
S3 DynamoDB
HDFS
EMR
Pig
Analytics languages/enginesData management
Redshift AWS Data Pipeline
Kinesis
S3 DynamoDB
RDS
EMR
Create
Cluster
Run Job
Step 1
Run Job
Step 2
Run Job
Step n
End Job
Flow
Cluster
created
Master, Core &
Task instances
groups
Processing
step
Execute a script
such as java, hive
script, python etc
Chained
step
Execute subsequent
step using data from
previous step
Chained
step
Add steps dynamically
or predefine them as a
pipeline
Cluster
terminated
Cluster instances
terminated on end of job
flow
EMR
Create
Cluster
Run Job
Step 1
Run Job
Step 2
Run Job
Step n
End Job
Flow
Hadoop breaks a data
set into multiple sets if
the data set is too large
to process quickly on a
single cluster node.
Step can be
compiled java .jar or
script executed by
the Hadoop
Streamer
Data typically passed
between steps using
HDFS file system on
Core Instances
Group nodes or S3
Output data
generally placed in
S3 for subsequent
use
Cluster
created
Master, Core &
Task instances
groups
Processing
step
Execute a script
such as java, hive
script, python etc
Chained
step
Execute subsequent
step using data from
previous step
Chained
step
Add steps dynamically
or predefine them as a
pipeline
Cluster
terminated
Cluster instances
terminated on end of job
flow
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Utility that comes with the Hadoop distribution
Allows you to create and run Map/Reduce jobs with
any executable or script as the mapper and/or the
reducer
Reads the input from standard input and the reducer
outputs data through standard output
By default, each line of input/output represents a
record with tab separated key/value
Hadoop Streaming
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Mapper Script
Location (S3)
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
Read words from StdIn line by line
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
Output to StdOut tab delimited record
e.g. “LongValueSum:abacus 1”
Input source
Output location
Reducer script
or function
Aggregate: Sort inputs
and add up totals
e.g
“Abacus 1”
“Abacus 1”
becomes
“Abacus 2”
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Manages the job flow: coordinating
the distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health
of the instance groups.
To monitor the progress of the job
flow, you can SSH into the master
node as the Hadoop user and either
look at the Hadoop log files directly
or access the user interface that
Hadoop
Master instance group
Instance groups
Manages the job flow: coordinating
the distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health
of the instance groups.
To monitor the progress of the job
flow, you can SSH into the master
node as the Hadoop user and either
look at the Hadoop log files directly
or access the user interface that
Hadoop
Master instance group
Contains all of the core nodes of a
job flow.
A core node is an EC2 instance that
runs Hadoop map and reduce tasks
and stores data using the Hadoop
Distributed File System (HDFS).
The EC2 instances you assign as
core nodes are capacity that must
be allotted for the entire job flow
run.
Core nodes run both the
DataNodes and TaskTracker
Hadoop daemons.
Core instance group
Instance groups
Manages the job flow: coordinating
the distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health
of the instance groups.
To monitor the progress of the job
flow, you can SSH into the master
node as the Hadoop user and either
look at the Hadoop log files directly
or access the user interface that
Hadoop
Master instance group
Contains all of the core nodes of a
job flow.
A core node is an EC2 instance that
runs Hadoop map and reduce tasks
and stores data using the Hadoop
Distributed File System (HDFS).
The EC2 instances you assign as
core nodes are capacity that must
be allotted for the entire job flow
run.
Core nodes run both the
DataNodes and TaskTracker
Hadoop daemons.
Core instance group
Contains all of the task nodes in a
job flow.
The task instance group is optional.
You can add it when you start the
job flow or add a task instance
group to a job flow in progress.
You can increase and decrease the
number of task nodes. Because
they don't store data and can be
added and removed from a job flow,
you can use task nodes to manage
the EC2 instance capacity your job
flow uses
Task instance group
Instance groups
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
SSH onto master
node with this
keypair
Keep cluster
running so you
can continue to
work on it
…
aakar 3
abdal 3
abdelaziz 18
abolish 3
aboriginal 12
abraham 6
abseron 9
abstentions 15
accession 73
accord 90
achabar 3
achievements 3
acic4
acknowledged 6
acquis 9
acquisitions 3
…
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket/output
--reducer
aggregate
Installation guide for EMR CLI tools: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket/output
--reducer
aggregate
Installation guide for EMR CLI tools: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket/output
--reducer
aggregate
Installation guide for EMR CLI tools: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket/output
--reducer
aggregate
Installation guide for EMR CLI tools: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket/output
--reducer
aggregate
Installation guide for EMR CLI tools: docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html
launches a job flow to run on a single-node cluster
using an Amazon EC2 m1.small instance
when your steps are running correctly on a small set
of sample data, you can launch job flows to run on
multiple nodes
You can specify the number of nodes and the type of
instance to run with the --num-instances and --
instance-type parameters
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–-alive
--log-uri s3n://mybucket/EMR/log
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–-alive
--log-uri s3n://mybucket/EMR/log
Instance type/count
Don’t terminate cluster –
keep it running so steps
can be added
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–-alive
--log-uri s3n://mybucket/EMR/log
Adding Hive, Pig and
Hbase to the job flow
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–-alive
--pig-interactive --pig-versions latest
--hive-interactive –-hive-versions latest
--hbase
--log-uri s3n://mybucket/EMR/log
Using a bootstrapping
action to configure the
clusterelastic-mapreduce
--create
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop -–args
”-s,dfs.block.size=1048576”
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–-alive
--pig-interactive --pig-versions latest
--hive-interactive –-hive-versions latest
--hbase
--log-uri s3n://mybucket/EMR/log
71
You can specify up to 16 bootstrap actions per job flow by
providing multiple bootstrap-action parameters
s3://elasticmapreduce/bootstrap-
actions/configure-daemons
Configure JVM properties
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-daemons --args --namenode-heap-
size=2048,--namenode-opts=-XX:GCTimeRatio=19
s3://elasticmapreduce/bootstrap-
actions/configure-hadoop
Configure cluster wide hadoop settings
./elastic-mapreduce –create --bootstrap-action
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -
-args "--site-config-file,s3://myawsbucket/config.xml,-
s,mapred.tasktracker.map.tasks.maximum=2"
s3://elasticmapreduce/bootstrap-
actions/configurations/latest/memory-
intensive
Configure cluster wide memory settings
./elastic-mapreduce –create --bootstrap-action
s3://elasticmapreduce/bootstrap-
actions/configurations/latest/memory-intensive
72
You can specify up to 16 bootstrap actions per job flow by
providing multiple bootstrap-action parameters
s3://elasticmapreduce/bootstrap-
actions/run-if
Conditionally run a command
./elastic-mapreduce --create --alive 
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/run-if 
--args "instance.isMaster=true,echo running on
master node"
Shutdown actions Sun scripts on shutdown
A bootstrap action script can create one or more shutdown
actions by writing scripts to the /mnt/var/lib/instance-
controller/public/shutdown-actions/ directory. Each script
must run and complete within 60 seconds.
Custom actions Run a custom script
--bootstrap-action "s3://elasticmapreduce/bootstrap-
actions/download.sh"
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Integrated tools
Selecting the right one
Hadoop streaming
Unstructured data with wide variation
of formats, high flexibility on language
Broad language support (Python,
Perl, Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Hadoop streaming
Unstructured data with wide variation
of formats, high flexibility on language
Broad language support (Python,
Perl, Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Doing analytics in Eclipse is wrong
2-stage dataflow is rigid
Joins are lengthy and error-prone
Prototyping/exploration requires compile & deploy
Hadoop streaming
Unstructured data with wide variation
of formats, high flexibility on language
Broad language support (Python,
Perl, Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Hive
Structured data which is lower
value or where higher latency is
tolerable
Pig
Semi-structured data which can be
mined using set operations
Hbase
Near real time K/V store for
structured data
Hive integration
Schema on read
SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Language called HiveQL, similar to SQL
An example of a query could be:
SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDes to make it possible to query different
input formats
Powerful data types (Array & Map..)
SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported
JDBC Listener started by default on EMR
Port 10003 for Hive 0.8.1, Ports 9100-9101 for Hadoop
Download hive-jdbc-0.8.1.jar
Access by:
Open port on security group ElasticMapReduce-master
SSH Tunnel
./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://myawsbucket/input,-
d,OUTPUT=s3://myawsbucket/output
HiveQL to execute
./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large 
--hive-interactive
Interactive hive session
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent, userCookie,
ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent, userCookie,
ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Table structure
to create
(happens fast
as just
mapping to
source)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent, userCookie,
ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Source data in
S3
hive> select * from impressions limit 5;
Selecting from source
data directly via Hadoop
Streaming Hive Pig DynamoDB Redshift
Unstructured Data ✓ ✓
Structured Data ✓ ✓ ✓ ✓
Language Support Any* HQL Pig Latin Client SQL
SQL ✓ ✓
Volume Unlimited Unlimited Unlimited Relatively Low 1.6 PB
Latency Medium Medium Medium Ultra Low Low
Cost considerations
Making the most of EC2 resources
1 instance for 100 hours
=
100 instances for 1 hour
Small instance = $8
1 instance for 1,000
hours
=
1,000 instances for 1
hour
Small instance = $80
100%
Achieving economies of scale
Time
Reserved capacity
100%
Achieving economies of scale
Time
On
Reserved capacity
100%
On-demand
Time
Achieving economies of scale
On
Reserved capacity
100%
On-demand
Time
Achieving economies of scale
Spot
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Job Flow
14 Hours
Duration:
Scenario #1
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 =
$28
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 =
$28
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 =
$28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 =
$28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings:
50%
Cost Savings:
~22%
Some other things…
Hints and tips
Managing big data
Dynamic Creation of Clusters for Processing Jobs
Bulk of Data Stored in S3
Use Reduced Redundancy Storage for Lower Value / Re-
creatable Data, and Lower Cost by 30%
DynamoDB for High Performance Data Access
Migrate data to Glacier for Archive and Periodic Access
VPC & IAM for Network Security
Shark can return results up to 30 times
faster than the same queries run on Hive.
elastic-mapreduce --create --alive --name "Spark/Shark
Cluster”
--bootstrap-action
s3://elasticmapreduce/samples/spark/install-spark-
shark.sh
--bootstrap-name "install Mesos/Spark/Shark”
--hive-interactive --hive-versions 0.7.1.3
--ami-version 2.0
--instance-type m1.xlarge --instance-count 3
--key-pair ec2-keypair
aws.amazon.com/blogs/aws/
Summary
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
Try the tutorial:Contextual Advertising using Apache Hive and Amazon EMR
aws.amazon.com/articles/2855
Find out more:
aws.amazon.com/big-data
http://aws.amazon.com/elasticmapreduce/getting-started/
Ian Massingham – Technical Evangelist
@IanMmmm
@AWS_UKI for local AWS events & news
@AWScloud for Global AWS News and Announcements
©Amazon.com, Inc. and its affiliates. All rights reserved.
AWS Training & Certification
Certification
aws.amazon.com/certification
Demonstrate your skills,
knowledge, and expertise
with the AWS platform
Self-Paced Labs
aws.amazon.com/training/
self-paced-labs
Try products, gain new
skills, and get hands-on
practice working with
AWS technologies
aws.amazon.com/training
Training
Skill up and gain
confidence to design,
develop, deploy and
manage your applications
on AWS
We typically see customers start by trying our
services
Get started now at : aws.amazon.com/getting-started
Design your application for the AWS Cloud
More details on the AWS Architecture Center at : aws.amazon.com/architecture

Más contenido relacionado

La actualidad más candente

Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMRArman Iman
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 

La actualidad más candente (20)

Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMR
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 

Similar a Masterclass Webinar - Amazon Elastic MapReduce (EMR)

Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 

Similar a Masterclass Webinar - Amazon Elastic MapReduce (EMR) (20)

Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Último (20)

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

Masterclass Webinar - Amazon Elastic MapReduce (EMR)

  • 1. Masterclass Elastic MapReduce Ian Massingham – Technical Evangelist @IanMmmm
  • 2. A technical deep dive beyond the basics Help educate you on how to get the best from AWS technologies Show you how things work and how to get things done Broaden your knowledge in ~45 mins Masterclass
  • 3. A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases Amazon Elastic MapReduce
  • 4. What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services
  • 5. A framework Splits data into pieces Lets processing occur Gathers the results
  • 7. Very large click log (e.g TBs) Lots of actions by John Smith
  • 8. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces
  • 9. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster
  • 10. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes
  • 11. Very large click log (e.g TBs) What John Smith did Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes
  • 12. What John Smith did Very large click log (e.g TBs) Insight in a fraction of the time
  • 15. Big Data != Hadoop
  • 16. Big Data != Hadoop Velocity Variety Volume
  • 17. Big Data != Hadoop When you need to innovate to collect, store, analyze, and manage your data
  • 18. ComputeStorage Big Data How do you get your data into AWS? AWS Direct Connect Dedicated low latency bandwidth Amazon Kinesis Highly scalable stream processing for event/record buffering Amazon Storage Gateway Sync local storage to the cloud AWS Import/Export Physical media shipping
  • 19. Relational Database Service Fully managed database (PostgreSQL, MySQL, Oracle, MSSQL) DynamoDB NoSQL, Schemaless, Provisioned throughput database S3 Object datastore up to 5TB per object 99.999999999% durability ComputeStorage Big Data Where do you put your data once it’s in AWS?
  • 20. Elastic MapReduce Getting started, tools & operating modes
  • 22. EMR cluster Start an EMR cluster using console or cli tools
  • 23. Master instance group EMR cluster Master instance group created that controls the cluster (runs MySQL)
  • 24. Master instance group EMR cluster Core instance group Core instance group created for life of cluster
  • 25. Master instance group EMR cluster Core instance group HDFS HDFS Core instances run DataNode and TaskTracker daemons
  • 26. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Optional task instances can be added or subtracted to perform work
  • 27. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 S3 can be used as underlying ‘file system’ for input/output data
  • 28. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 Master node coordinates distribution of work and manages cluster state
  • 29. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 Core and Task instances read- write to S3
  • 31. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file EC2 instance EC2 instance EC2 instance
  • 37. HDFS EMR Pig Analytics languages/enginesData management Redshift AWS Data Pipeline Kinesis S3 DynamoDB RDS
  • 38. EMR Create Cluster Run Job Step 1 Run Job Step 2 Run Job Step n End Job Flow Cluster created Master, Core & Task instances groups Processing step Execute a script such as java, hive script, python etc Chained step Execute subsequent step using data from previous step Chained step Add steps dynamically or predefine them as a pipeline Cluster terminated Cluster instances terminated on end of job flow
  • 39. EMR Create Cluster Run Job Step 1 Run Job Step 2 Run Job Step n End Job Flow Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single cluster node. Step can be compiled java .jar or script executed by the Hadoop Streamer Data typically passed between steps using HDFS file system on Core Instances Group nodes or S3 Output data generally placed in S3 for subsequent use Cluster created Master, Core & Task instances groups Processing step Execute a script such as java, hive script, python etc Chained step Execute subsequent step using data from previous step Chained step Add steps dynamically or predefine them as a pipeline Cluster terminated Cluster instances terminated on end of job flow
  • 42. Utility that comes with the Hadoop distribution Allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer Reads the input from standard input and the reducer outputs data through standard output By default, each line of input/output represents a record with tab separated key/value Hadoop Streaming
  • 45. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout
  • 46. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout Read words from StdIn line by line
  • 47. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout Output to StdOut tab delimited record e.g. “LongValueSum:abacus 1”
  • 51. Aggregate: Sort inputs and add up totals e.g “Abacus 1” “Abacus 1” becomes “Abacus 2”
  • 53. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Instance groups
  • 54. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. Core instance group Instance groups
  • 55. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. Core instance group Contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress. You can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses Task instance group Instance groups
  • 57. SSH onto master node with this keypair
  • 58. Keep cluster running so you can continue to work on it
  • 59. … aakar 3 abdal 3 abdelaziz 18 abolish 3 aboriginal 12 abraham 6 abseron 9 abstentions 15 accession 73 accord 90 achabar 3 achievements 3 acic4 acknowledged 6 acquis 9 acquisitions 3 …
  • 65. launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes You can specify the number of nodes and the type of instance to run with the --num-instances and -- instance-type parameters
  • 66. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log
  • 67. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log Instance type/count
  • 68. Don’t terminate cluster – keep it running so steps can be added elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --log-uri s3n://mybucket/EMR/log
  • 69. Adding Hive, Pig and Hbase to the job flow elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log
  • 70. Using a bootstrapping action to configure the clusterelastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop -–args ”-s,dfs.block.size=1048576” --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –-alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log
  • 71. 71 You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-action parameters s3://elasticmapreduce/bootstrap- actions/configure-daemons Configure JVM properties --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons --args --namenode-heap- size=2048,--namenode-opts=-XX:GCTimeRatio=19 s3://elasticmapreduce/bootstrap- actions/configure-hadoop Configure cluster wide hadoop settings ./elastic-mapreduce –create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop - -args "--site-config-file,s3://myawsbucket/config.xml,- s,mapred.tasktracker.map.tasks.maximum=2" s3://elasticmapreduce/bootstrap- actions/configurations/latest/memory- intensive Configure cluster wide memory settings ./elastic-mapreduce –create --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configurations/latest/memory-intensive
  • 72. 72 You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-action parameters s3://elasticmapreduce/bootstrap- actions/run-if Conditionally run a command ./elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap- actions/run-if --args "instance.isMaster=true,echo running on master node" Shutdown actions Sun scripts on shutdown A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance- controller/public/shutdown-actions/ directory. Each script must run and complete within 60 seconds. Custom actions Run a custom script --bootstrap-action "s3://elasticmapreduce/bootstrap- actions/download.sh"
  • 76. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes
  • 77. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes Doing analytics in Eclipse is wrong 2-stage dataflow is rigid Joins are lengthy and error-prone Prototyping/exploration requires compile & deploy
  • 78. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes Hive Structured data which is lower value or where higher latency is tolerable Pig Semi-structured data which can be mined using set operations Hbase Near real time K/V store for structured data
  • 80. SQL Interface for working with data Simple way to use Hadoop Create Table statement references data location on S3 Language called HiveQL, similar to SQL An example of a query could be: SELECT COUNT(1) FROM sometable; Requires to setup a mapping to the input data Uses SerDes to make it possible to query different input formats Powerful data types (Array & Map..)
  • 81. SQL HiveQL Updates UPDATE, INSERT, DELETE INSERT, OVERWRITE TABLE Transactions Supported Not supported Indexes Supported Not supported Latency Sub-second Minutes Functions Hundreds Dozens Multi-table inserts Not supported Supported Create table as select Not valid SQL-92 Supported
  • 82. JDBC Listener started by default on EMR Port 10003 for Hive 0.8.1, Ports 9100-9101 for Hadoop Download hive-jdbc-0.8.1.jar Access by: Open port on security group ElasticMapReduce-master SSH Tunnel
  • 83. ./elastic-mapreduce –create --name "Hive job flow” --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,- d,OUTPUT=s3://myawsbucket/output HiveQL to execute
  • 84. ./elastic-mapreduce --create --alive --name "Hive job flow” --num-instances 5 --instance-type m1.large --hive-interactive Interactive hive session
  • 85. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ;
  • 86. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Table structure to create (happens fast as just mapping to source)
  • 87. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Source data in S3
  • 88. hive> select * from impressions limit 5; Selecting from source data directly via Hadoop
  • 89. Streaming Hive Pig DynamoDB Redshift Unstructured Data ✓ ✓ Structured Data ✓ ✓ ✓ ✓ Language Support Any* HQL Pig Latin Client SQL SQL ✓ ✓ Volume Unlimited Unlimited Unlimited Relatively Low 1.6 PB Latency Medium Medium Medium Ultra Low Low
  • 90. Cost considerations Making the most of EC2 resources
  • 91. 1 instance for 100 hours = 100 instances for 1 hour
  • 93. 1 instance for 1,000 hours = 1,000 instances for 1 hour
  • 100. Job Flow 14 Hours Duration: Scenario #1 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
  • 101. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
  • 102. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75
  • 103. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Time Savings: 50% Cost Savings: ~22%
  • 105. Managing big data Dynamic Creation of Clusters for Processing Jobs Bulk of Data Stored in S3 Use Reduced Redundancy Storage for Lower Value / Re- creatable Data, and Lower Cost by 30% DynamoDB for High Performance Data Access Migrate data to Glacier for Archive and Periodic Access VPC & IAM for Network Security
  • 106. Shark can return results up to 30 times faster than the same queries run on Hive. elastic-mapreduce --create --alive --name "Spark/Shark Cluster” --bootstrap-action s3://elasticmapreduce/samples/spark/install-spark- shark.sh --bootstrap-name "install Mesos/Spark/Shark” --hive-interactive --hive-versions 0.7.1.3 --ami-version 2.0 --instance-type m1.xlarge --instance-count 3 --key-pair ec2-keypair aws.amazon.com/blogs/aws/
  • 108. What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services
  • 109. Try the tutorial:Contextual Advertising using Apache Hive and Amazon EMR aws.amazon.com/articles/2855 Find out more: aws.amazon.com/big-data http://aws.amazon.com/elasticmapreduce/getting-started/
  • 110. Ian Massingham – Technical Evangelist @IanMmmm @AWS_UKI for local AWS events & news @AWScloud for Global AWS News and Announcements ©Amazon.com, Inc. and its affiliates. All rights reserved.
  • 111. AWS Training & Certification Certification aws.amazon.com/certification Demonstrate your skills, knowledge, and expertise with the AWS platform Self-Paced Labs aws.amazon.com/training/ self-paced-labs Try products, gain new skills, and get hands-on practice working with AWS technologies aws.amazon.com/training Training Skill up and gain confidence to design, develop, deploy and manage your applications on AWS
  • 112. We typically see customers start by trying our services Get started now at : aws.amazon.com/getting-started
  • 113. Design your application for the AWS Cloud More details on the AWS Architecture Center at : aws.amazon.com/architecture

Notas del editor

  1. 45 minutes is not a lot of time to cover EC2, so if there’s anything we don’t cover that you would like to know more about, then ask a question and we will get back to you with the answer, or point you to some resources that can help you learn more.
  2. Vertical scaling on commodity hardware. Perfect for Hadoop.
  3. Vertical scaling on commodity hardware. Perfect for Hadoop.
  4. Vertical scaling on commodity hardware. Perfect for Hadoop.
  5. Vertical scaling on commodity hardware. Perfect for Hadoop.
  6. Vertical scaling on commodity hardware. Perfect for Hadoop.
  7. Vertical scaling on commodity hardware. Perfect for Hadoop.
  8. Speaker Notes: We have several programs available to help you deepen your knowledge and proficiency with AWS. We encourage you to check out the following resources: Get hands-on experience testing products and gaining practical experience working with AWS technologies by taking an AWS Self-Paced Lab at run.qwiklab.com. Available anywhere, anytime, you have freedom to take self-paced labs on-demand and learn at your own pace. AWS self-paced labs were designed by AWS subject matter experts and provide an opportunity to use the AWS console in a variety of pre-designed scenarios and common use cases, giving you hands-on practice in a live AWS environment to help you gain confidence working with AWS. You have flexibility to choose from topics and products about which you want to learn more. Take an instructor-led AWS Training course. We have a variety of role-based courses to meet the requirements of your job role and business need, whether you’re a Solutions Architect, Developer, SysOps Administrator, or just interested in learning AWS fundamentals. AWS Certification validates your skills, knowledge an expertise in working with AWS services. Earning certification enables you to gain visibility and credibility for your skills.
  9. 750 hours for EC2 per month, for Linux and Windows, 5Gb for S3, and 30Gb for EBS
  10. If you already have a project in mind then you may just want to start by designing the application.