Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Amazon EMR
August 23, 2016
Introducing Amazon EMR
Release 5.0
Faster, Easier Hadoop, Spark, and Presto

Agenda
• Quick Introduction to Amazon EMR
• What’s New in Amazon EMR release 5.0
• Interactive Query Demo
• Use Cases
• Best Practices

Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster

Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service

Options to submit jobs – Off Cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster

Options to submit jobs – On Cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, and more!
Connect with ODBC / JDBC using
HiveServer2/Spark Thriftserver
Use Spark Actions in your Apache Oozie
workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
EMR File System
(EMRFS)
Amazon S3
Amazon EMR

Quick introduction to Spark
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle

Spark components to match your use case

Spark 2.0 – Performance Enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at runtime
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput

Datasets and DataFrames (Spark 2.0)
• Datasets
• Distributed collection of data
• Strong typing, ability to use Lambda functions
• Object-oriented operations (similar to RDD API)
• Optimized encoders which increase performance and
minimize serialization/deserialization overhead
• Compile-time type safety for more robust applications
• DataFrames
• Dataset organized into named columns
• Represented as a Dataset of rows

Spark SQL (Spark 2.0)
• SparkSession – replaces the old SQLContext and
HiveContext
• Seamlessly mix SQL with Spark programs
• ANSI SQL Parser and subquery support
• HiveQL compatibility and can directly use tables in
Hive metastore
• Connect through JDBC / ODBC using the Spark Thrift
server

Spark 2.0 – ML Updates
• Additional distributed algorithms in SparkR, including K-
Means, Generalized Linear Models, and Naive Bayes
• ML pipeline persistence is now supported across all
languages

Spark 2.0 – Structured Streaming
• Structured Streaming API is an extension to the
DataFrame/Dataset API (instead of DStream)
• SparkSession is the new entry point for streaming
• Better merges processing on static and streaming
datasets, abstracting the velocity of the data

Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group

Apache Hive 2.1.0 with
Apache Tez 0.8.4

Tez is the default engine for Hive and Pig

Hive 2.1 – New Features
• Improvements to Hive’s cost-based optimizer
• LLAP (beta) for faster processing (coming soon)
• Predicate pushdown for Parquet file format
• HPL/SQL for procedural SQL
• Similar to Oracle’s PL/SQL and Teradata’s stored
procedures
• Hive-On-Spark improvements
• Apache HBase as Hive Metastore (alpha)
• CLI mode in Beeline (Hive CLI deprecation)

Use RDS for an external Hive metastore
Amazon Aurora
Hive Metastore with
schema for tables in S3
Amazon S3Set metastore
location in hive-site

In-memory distributed query engine
Support standard ANSI-SQL
Support rich analytical functions
Support wide range of data sources
Combine data from multiple sources in single
query
Response time ranges from seconds to
minutes

High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+
PB dataset in S3 with 350 active platform users
Extensibility
• Pluggable backends: Hive, Cassandra, JMX, Kafka,
MySQL, PostgreSQL, MySQL, and more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various
languages (Python, Ruby, PHP, Node.js, Java(JDBC),
C#,…)
ANSI SQL
• complex queries, joins, aggregations, various functions
(Window functions)

Presto: In-memory processing and pipelining

High Level Architecture
Components: a coordinator and multiple workers.
Queries are submitted by a client such as the Presto CLI to the coordinator.
The coordinator parses, analyzes and plans the query execution, then
distributes the processing to the workers.

Hue 3.10 – New Features
• Fully redesigned SQL editor with table assist, better
autocomplete of values and nested types, more
charts, and search and replace functionality
• New notebook UI with graphical widgets
• Dry-run Oozie jobs to test options before execution
• Email action on Oozie job failure
• Improved security features including TLS certificate
chain support, passwords in file scripts, and inactive
user timeouts

Zeppelin 0.6.1 – New Features
• Shiro Authentication
• Notebook Authorization
Save your notebook in S3 by setting zeppelin-env:
export ZEPPELIN_NOTEBOOK_S3_BUCKET =
bucket_name
export ZEPPELIN_NOTEBOOK_S3_USER = username
(optional) export
ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-
id

Build your own Apache Bigtop
Application

Creating an Apache Bigtop application
https://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-

Twitter (Answers) uses EMR as the batch layer
in their Lambda architecture

FINRA saves money with comparable
performance with Hive on Tez with S3

Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local

Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads

Use EC2 Spot Instances to save money
• Use the Spot Bid Advisor to help find the optimal bid
price
• Resize your cluster with EMR task groups to add
capacity to YARN without adding HDFS data nodes
• Store data in S3 so cluster can be recreated if Spot
reclaims nodes

Configuring VPC private subnets
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess

Thank you!
Jonathan Fritz - jonfritz@amazon.com
aws.amazon.com/emr
blogs.aws.amazon.com/bigdata

Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Similar a Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series