So einfach geht modernes Roaming fuer Notes und Nomad.pdf
Hadoop in the cloud with AWS' EMR
1. Hadoop in the Cloud: AWS Elastic Map Reduce
• What is EMR?
• How does EMR compare to Hadoop?
• Use cases
2. EMR is an AWS Service
• AWS review helpful to understand
• Infiniteskills offers a course!
– http://bit.ly/learn-aws
• AWS constantly changing and evolving
http://aws.amazon.com/documentation/elasticmapreduce/
3. EMR Overview
• Abstracts out cluster setup & management
– Integrated provisioning, tooling, debug, monitoring
– AWS constantly tuning and optimizing
– Failed nodes automatically re-provisioned by AWS
• Reduced costs
– Clusters shut down automatically by default
– Excellent for sporadic MapReduce needs
• Integration to AWS
– Leverage cost-effective EC2 instances for processing, S3 for storage
– Monitoring done via CloudWatch
4. EMR Architecture
Master Instance Group
EC2
S3
Core Instance Group
EC2EC2
HDFS HDFS
Task Instance Group
EC2 EC2
EC2 EC2
• Master group controls cluster
• Core group runs DataNode &
TaskTracker daemons
• Task group runs tasks
• Can be added & removed
• S3 can be used for data input / output
• Master group coordinates core + task
activities and manages cluster state
• Core + task instances read / write to /
from S3
5. EMR AWS Integration
• Datastore pull / push to
– RDS
– DynamoDB
– S3
• Derived data can be stored in RedShift
– Via AWS DataPipelines
– Further post-processing
• Data can be pre-processed with Kinesis
6. What you give up with EMR
• Control
– Always 2-3 months behind Hadoop releases
– Cannot use CDH or HDP releases (although MapR is supported)
• Speed (if you’re not an AWS customer)
• Vendor lock-in
7. EMR Use Cases
• Already AWS customer
– Lots of data in S3 / DynamoDB / RDS
• Sporadic MapReduce needs
• Proof-of-concepting Hadoop
• Ease of use
– Seamless, near-infinite scale
– Simple administration
8. Hadoop in the Cloud: AWS Elastic Map Reduce
• What is EMR?
• How does EMR compare to Hadoop?
• Benefits & downsides
• Use cases