SlideShare una empresa de Scribd logo
1 de 43
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Amazon EMR
August 23, 2016
Introducing Amazon EMR
Release 5.0
Faster, Easier Hadoop, Spark, and Presto
Agenda
• Quick Introduction to Amazon EMR
• What’s New in Amazon EMR release 5.0
• Interactive Query Demo
• Use Cases
• Best Practices
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Manage firewalls
Flexible
Customize the cluster
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver (JDBC/ODBC)
Amazon EMR service
Options to submit jobs – Off Cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Options to submit jobs – On Cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, and more!
Connect with ODBC / JDBC using
HiveServer2/Spark Thriftserver
Use Spark Actions in your Apache Oozie
workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
EMR 5.0 - Applications
Quick introduction to Spark
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark 2.0 – Performance Enhancements
• Second generation Tungsten engine
• Whole-stage code generation to create optimized
bytecode at runtime
• Improvements to Catalyst optimizer for query
performance
• New vectorized Parquet decoder to increase throughput
Datasets and DataFrames (Spark 2.0)
• Datasets
• Distributed collection of data
• Strong typing, ability to use Lambda functions
• Object-oriented operations (similar to RDD API)
• Optimized encoders which increase performance and
minimize serialization/deserialization overhead
• Compile-time type safety for more robust applications
• DataFrames
• Dataset organized into named columns
• Represented as a Dataset of rows
Spark SQL (Spark 2.0)
• SparkSession – replaces the old SQLContext and
HiveContext
• Seamlessly mix SQL with Spark programs
• ANSI SQL Parser and subquery support
• HiveQL compatibility and can directly use tables in
Hive metastore
• Connect through JDBC / ODBC using the Spark Thrift
server
Spark 2.0 – ML Updates
• Additional distributed algorithms in SparkR, including K-
Means, Generalized Linear Models, and Naive Bayes
• ML pipeline persistence is now supported across all
languages
Spark 2.0 – Structured Streaming
• Structured Streaming API is an extension to the
DataFrame/Dataset API (instead of DStream)
• SparkSession is the new entry point for streaming
• Better merges processing on static and streaming
datasets, abstracting the velocity of the data
Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group
Apache Hive 2.1.0 with
Apache Tez 0.8.4
Tez is the default engine for Hive and Pig
Hive 2.1 – New Features
• Improvements to Hive’s cost-based optimizer
• LLAP (beta) for faster processing (coming soon)
• Predicate pushdown for Parquet file format
• HPL/SQL for procedural SQL
• Similar to Oracle’s PL/SQL and Teradata’s stored
procedures
• Hive-On-Spark improvements
• Apache HBase as Hive Metastore (alpha)
• CLI mode in Beeline (Hive CLI deprecation)
Use RDS for an external Hive metastore
Amazon Aurora
Hive Metastore with
schema for tables in S3
Amazon S3Set metastore
location in hive-site
Presto 0.150
In-memory distributed query engine
Support standard ANSI-SQL
Support rich analytical functions
Support wide range of data sources
Combine data from multiple sources in single
query
Response time ranges from seconds to
minutes
High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+
PB dataset in S3 with 350 active platform users
Extensibility
• Pluggable backends: Hive, Cassandra, JMX, Kafka,
MySQL, PostgreSQL, MySQL, and more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various
languages (Python, Ruby, PHP, Node.js, Java(JDBC),
C#,…)
ANSI SQL
• complex queries, joins, aggregations, various functions
(Window functions)
Presto: In-memory processing and pipelining
High Level Architecture
Components: a coordinator and multiple workers.
Queries are submitted by a client such as the Presto CLI to the coordinator.
The coordinator parses, analyzes and plans the query execution, then
distributes the processing to the workers.
Hue 3.10
Hue 3.10 – New Features
• Fully redesigned SQL editor with table assist, better
autocomplete of values and nested types, more
charts, and search and replace functionality
• New notebook UI with graphical widgets
• Dry-run Oozie jobs to test options before execution
• Email action on Oozie job failure
• Improved security features including TLS certificate
chain support, passwords in file scripts, and inactive
user timeouts
Apache Zeppelin 0.6.1
Zeppelin 0.6.1 – New Features
• Shiro Authentication
• Notebook Authorization
Save your notebook in S3 by setting zeppelin-env:
export ZEPPELIN_NOTEBOOK_S3_BUCKET =
bucket_name
export ZEPPELIN_NOTEBOOK_S3_USER = username
(optional) export
ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-
id
Build your own Apache Bigtop
Application
Creating an Apache Bigtop application
https://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-
Demo
Customer Use Cases
Twitter (Answers) uses EMR as the batch layer
in their Lambda architecture
FINRA saves money with comparable
performance with Hive on Tez with S3
A Few Tips
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Use EC2 Spot Instances to save money
• Use the Spot Bid Advisor to help find the optimal bid
price
• Resize your cluster with EMR task groups to add
capacity to YARN without adding HDFS data nodes
• Store data in S3 so cluster can be recreated if Spot
reclaims nodes
Configuring VPC private subnets
• Use Amazon S3 Endpoints for
connectivity to S3
• Use Managed NAT for connectivity to
other services or the Internet
• Control the traffic using Security Groups
• ElasticMapReduce-Master-Private
• ElasticMapReduce-Slave-Private
• ElasticMapReduce-ServiceAccess
Thank you!
Jonathan Fritz - jonfritz@amazon.com
aws.amazon.com/emr
blogs.aws.amazon.com/bigdata

Más contenido relacionado

La actualidad más candente

Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
Amazon Web Services
 

La actualidad más candente (20)

Getting Started with Amazon Aurora
 Getting Started with Amazon Aurora Getting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
AWS re:Invent 2016: Event Handling at Scale: Designing an Auditable Ingestion...
AWS re:Invent 2016: Event Handling at Scale: Designing an Auditable Ingestion...AWS re:Invent 2016: Event Handling at Scale: Designing an Auditable Ingestion...
AWS re:Invent 2016: Event Handling at Scale: Designing an Auditable Ingestion...
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Log Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & KibanaLog Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & Kibana
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Getting Started with AWS Security
Getting Started with AWS SecurityGetting Started with AWS Security
Getting Started with AWS Security
 
Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
AWS re:Invent 2016: Infrastructure Continuous Delivery Using AWS CloudFormati...
AWS re:Invent 2016: Infrastructure Continuous Delivery Using AWS CloudFormati...AWS re:Invent 2016: Infrastructure Continuous Delivery Using AWS CloudFormati...
AWS re:Invent 2016: Infrastructure Continuous Delivery Using AWS CloudFormati...
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 
Deep Dive on MySQL Databases on AWS - AWS Online Tech Talks
Deep Dive on MySQL Databases on AWS - AWS Online Tech TalksDeep Dive on MySQL Databases on AWS - AWS Online Tech Talks
Deep Dive on MySQL Databases on AWS - AWS Online Tech Talks
 
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksSelecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 

Destacado

Run Spark on EMRってどんな仕組みになってるの?
Run Spark on EMRってどんな仕組みになってるの?Run Spark on EMRってどんな仕組みになってるの?
Run Spark on EMRってどんな仕組みになってるの?
Satoshi Noto
 

Destacado (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS re:Invent 2016: Securing Enterprise Big Data Workloads on AWS (SEC308)
AWS re:Invent 2016: Securing Enterprise Big Data Workloads on AWS (SEC308)AWS re:Invent 2016: Securing Enterprise Big Data Workloads on AWS (SEC308)
AWS re:Invent 2016: Securing Enterprise Big Data Workloads on AWS (SEC308)
 
Domain name system
Domain name systemDomain name system
Domain name system
 
Run Spark on EMRってどんな仕組みになってるの?
Run Spark on EMRってどんな仕組みになってるの?Run Spark on EMRってどんな仕組みになってるの?
Run Spark on EMRってどんな仕組みになってるの?
 
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
AWS re:Invent 2016: Global Traffic Management with Amazon Route 53 Traffic Fl...
 
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web ServicesAWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
AWS March 2016 Webinar Series - Managed Database Services on Amazon Web Services
 
Security Innovations in the Cloud
Security Innovations in the CloudSecurity Innovations in the Cloud
Security Innovations in the Cloud
 
LDAP, SAML and Hue
LDAP, SAML and HueLDAP, SAML and Hue
LDAP, SAML and Hue
 
Introduction to Three AWS Security Services - November 2016 Webinar Series
Introduction to Three AWS Security Services - November 2016 Webinar SeriesIntroduction to Three AWS Security Services - November 2016 Webinar Series
Introduction to Three AWS Security Services - November 2016 Webinar Series
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Real-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicReal-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo Logic
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
AWS re:Invent 2016: DNS Demystified: Getting Started with Amazon Route 53, fe...
AWS re:Invent 2016: DNS Demystified: Getting Started with Amazon Route 53, fe...AWS re:Invent 2016: DNS Demystified: Getting Started with Amazon Route 53, fe...
AWS re:Invent 2016: DNS Demystified: Getting Started with Amazon Route 53, fe...
 
Streamline Identity Management & Administration on AWS
Streamline Identity Management & Administration on AWSStreamline Identity Management & Administration on AWS
Streamline Identity Management & Administration on AWS
 
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
 
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
 
データ活用をもっともっと円滑に! ~データ処理・分析基盤編を少しだけ~
データ活用をもっともっと円滑に!~データ処理・分析基盤編を少しだけ~データ活用をもっともっと円滑に!~データ処理・分析基盤編を少しだけ~
データ活用をもっともっと円滑に! ~データ処理・分析基盤編を少しだけ~
 
Hiveを高速化するLLAP
Hiveを高速化するLLAPHiveを高速化するLLAP
Hiveを高速化するLLAP
 
Fine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and HiveFine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and Hive
 
Domain name system
Domain name systemDomain name system
Domain name system
 

Similar a Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
What's New in AWS Serverless and Containers
What's New in AWS Serverless and ContainersWhat's New in AWS Serverless and Containers
What's New in AWS Serverless and Containers
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similar a Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series (20)

Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
What's New in AWS Serverless and Containers
What's New in AWS Serverless and ContainersWhat's New in AWS Serverless and Containers
What's New in AWS Serverless and Containers
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Amazon EMR August 23, 2016 Introducing Amazon EMR Release 5.0 Faster, Easier Hadoop, Spark, and Presto
  • 2. Agenda • Quick Introduction to Amazon EMR • What’s New in Amazon EMR release 5.0 • Interactive Query Demo • Use Cases • Best Practices
  • 3. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  • 4. Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service
  • 5. Options to submit jobs – Off Cluster Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 6. Options to submit jobs – On Cluster Web UIs: Hue SQL editor, Zeppelin notebooks, R Studio, and more! Connect with ODBC / JDBC using HiveServer2/Spark Thriftserver Use Spark Actions in your Apache Oozie workflow to create DAGs of jobs. (start using start-thriftserver.sh) Or, use the native APIs and CLIs for each application
  • 7. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 8. EMR 5.0 - Applications
  • 9.
  • 10. Quick introduction to Spark join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 11. Spark components to match your use case
  • 12. Spark 2.0 – Performance Enhancements • Second generation Tungsten engine • Whole-stage code generation to create optimized bytecode at runtime • Improvements to Catalyst optimizer for query performance • New vectorized Parquet decoder to increase throughput
  • 13. Datasets and DataFrames (Spark 2.0) • Datasets • Distributed collection of data • Strong typing, ability to use Lambda functions • Object-oriented operations (similar to RDD API) • Optimized encoders which increase performance and minimize serialization/deserialization overhead • Compile-time type safety for more robust applications • DataFrames • Dataset organized into named columns • Represented as a Dataset of rows
  • 14. Spark SQL (Spark 2.0) • SparkSession – replaces the old SQLContext and HiveContext • Seamlessly mix SQL with Spark programs • ANSI SQL Parser and subquery support • HiveQL compatibility and can directly use tables in Hive metastore • Connect through JDBC / ODBC using the Spark Thrift server
  • 15. Spark 2.0 – ML Updates • Additional distributed algorithms in SparkR, including K- Means, Generalized Linear Models, and Naive Bayes • ML pipeline persistence is now supported across all languages
  • 16. Spark 2.0 – Structured Streaming • Structured Streaming API is an extension to the DataFrame/Dataset API (instead of DStream) • SparkSession is the new entry point for streaming • Better merges processing on static and streaming datasets, abstracting the velocity of the data
  • 17. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  • 18. Apache Hive 2.1.0 with Apache Tez 0.8.4
  • 19. Tez is the default engine for Hive and Pig
  • 20. Hive 2.1 – New Features • Improvements to Hive’s cost-based optimizer • LLAP (beta) for faster processing (coming soon) • Predicate pushdown for Parquet file format • HPL/SQL for procedural SQL • Similar to Oracle’s PL/SQL and Teradata’s stored procedures • Hive-On-Spark improvements • Apache HBase as Hive Metastore (alpha) • CLI mode in Beeline (Hive CLI deprecation)
  • 21. Use RDS for an external Hive metastore Amazon Aurora Hive Metastore with schema for tables in S3 Amazon S3Set metastore location in hive-site
  • 23. In-memory distributed query engine Support standard ANSI-SQL Support rich analytical functions Support wide range of data sources Combine data from multiple sources in single query Response time ranges from seconds to minutes
  • 24. High Performance • E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active platform users Extensibility • Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and more • JDBC, ODBC for commercial BI tools or dashboards • Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js, Java(JDBC), C#,…) ANSI SQL • complex queries, joins, aggregations, various functions (Window functions)
  • 26. High Level Architecture Components: a coordinator and multiple workers. Queries are submitted by a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.
  • 28. Hue 3.10 – New Features • Fully redesigned SQL editor with table assist, better autocomplete of values and nested types, more charts, and search and replace functionality • New notebook UI with graphical widgets • Dry-run Oozie jobs to test options before execution • Email action on Oozie job failure • Improved security features including TLS certificate chain support, passwords in file scripts, and inactive user timeouts
  • 30. Zeppelin 0.6.1 – New Features • Shiro Authentication • Notebook Authorization Save your notebook in S3 by setting zeppelin-env: export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name export ZEPPELIN_NOTEBOOK_S3_USER = username (optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key- id
  • 31. Build your own Apache Bigtop Application
  • 32. Creating an Apache Bigtop application https://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-
  • 33. Demo
  • 35. Twitter (Answers) uses EMR as the batch layer in their Lambda architecture
  • 36.
  • 37. FINRA saves money with comparable performance with Hive on Tez with S3
  • 39. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 40. Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 41. Use EC2 Spot Instances to save money • Use the Spot Bid Advisor to help find the optimal bid price • Resize your cluster with EMR task groups to add capacity to YARN without adding HDFS data nodes • Store data in S3 so cluster can be recreated if Spot reclaims nodes
  • 42. Configuring VPC private subnets • Use Amazon S3 Endpoints for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using Security Groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  • 43. Thank you! Jonathan Fritz - jonfritz@amazon.com aws.amazon.com/emr blogs.aws.amazon.com/bigdata