SlideShare una empresa de Scribd logo
1 de 44
1© Cloudera, Inc. All rights reserved.
Yarns about YARN:
Migrating to MapReduce v2
Kathleen Ting, kate@cloudera.com
Miklos Christine, mwc@cloudera.com
Hadoop Summit San Jose, 10 June 2015
2© Cloudera, Inc. All rights reserved.
Kathleen Ting
•Joined Cloudera in 2011
•Former customer
operations engineer
•Technical account
manager
•Apache Sqoop Cookbook
co-author
Miklos Christine
•Joined Cloudera in 2013
•Former customer
operations engineer
•Systems engineer
•Apache Spark expert
$ whoami
3© Cloudera, Inc. All rights reserved.
Cloudera and Apache Hadoop
• Apache Hadoop is an open source software framework for distributed
storage and distributed processing of Big Data on clusters of commodity
hardware.
• Cloudera is revolutionizing enterprise data management by offering the
first unified Platform for Big Data, an enterprise data hub built on
Apache Hadoop.
• Distributes CDH, a Hadoop distribution.
• Teaches, consults, and supports customers building applications on the Hadoop
stack.
• The world-wide Cloudera Customer Operations Engineering team has closed tens of
thousands of support incidents over six years.
4© Cloudera, Inc. All rights reserved.
Outline
• YARN motivation
• Upgrading MR1 to MR2
• YARN upgrade pitfalls
• YARN applications
5© Cloudera, Inc. All rights reserved.
YARN motivation
Yet Another Resource Negotiator
6© Cloudera, Inc. All rights reserved.
One platform, many workloads:
batch, interactive, real-time
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
An Apache YARN timeline
January 2008:
Yahoo started
work on YARN
October 2013:
Hadoop 2.0 GA
August 2012:
YARN promoted
to Apache
Hadoop sub-
project
April 2014:
YARN/MR2 is
default in CDH 5
June 2012:
CDH 4.0.0
included YARN
9© Cloudera, Inc. All rights reserved.
MapReduce v1 JobTracker
TaskTracker
Reduce Task
Map Task
TaskTracker
Map Task
Reduce Task
Schedule Jobs
Manage Jobs
Store Jobs
10© Cloudera, Inc. All rights reserved.
MapReduce v2
NodeManager NodeManager
Map
Task
Container Container
App
Master
Reduce
Task
Container
Job History
Server
Resource
Manager
Schedule Jobs
Manage
Jobs
Store Jobs
11© Cloudera, Inc. All rights reserved.
YARN motivation
MR1 MR2
Scalability JobTracker tracks all jobs, tasks
Max out at 4k nodes , 40k tasks
Split up tracking between
ResourceManager, ApplicationMaster
Scale up to 10k nodes, 100k tasks
Availability JT HA RM HA & for per-application basis
Utilization Fixed size slots for map, reduce Allocate only as many resources as
needed, allows cluster utilization > 70%
Multi-tenancy N/A Cluster resource management system
Data locality & lowered operational
costs from sharing resources between
frameworks
12© Cloudera, Inc. All rights reserved.
Upgrading MR1 to MR2
13© Cloudera, Inc. All rights reserved.
MR1 to MR2 functionality mapping
• Completely revamped architecture in MR2 on YARN
• While most translate, some configurations don’t
14© Cloudera, Inc. All rights reserved.
MR2 on YARN Applications on YARN
Memory > heap to account for overhead:
Memory per Container:
mapreduce.[map|reduce].memory.mb
(1.5GB)
Map/Reduce Task Maximum Heap Size:
mapreduce.[map|reduce].java.opts.max.heap
(1GB)
CPU per Container:
mapreduce.[map|reduce].cpu.vcores
Mem/CPU thresholds:
Container Memory Minimum:
yarn.scheduler.minimum-allocation-mb
Container Memory Maximum:
yarn.scheduler.maximum-allocation-mb
Container Virtual CPU Cores Minimum:
yarn.scheduler.minimum-allocation-vcores
Container Virtual CPU Cores Maximum:
yarn.scheduler.maximum-allocation-vcores
15© Cloudera, Inc. All rights reserved.
Migration path Binary
support
MR1 (CDH4) to MR2 (CDH5) ✔
MR1 (CDH4) to MR1 (CDH5) ✔
MR2 (CDH4) to MR1/MR2 (CDH5) ✖
• Migrating to MR2 on YARN
• For Operators:
http://blog.cloudera.com/blog/2013/11/migratin
g-to-mapreduce-2-on-yarn-for-operators/
• For Users:
http://blog.cloudera.com/blog/2013/11/migratin
g-to-mapreduce-2-on-yarn-for-users/
• http://blog.cloudera.com/blog/2014/04/apache-
hadoop-yarn-avoiding-6-time-consuming-
gotchas/
• Getting MR2 Up to Speed
• http://blog.cloudera.com/blog/2014/02/getting-
mapreduce-2-up-to-speed/
• Avoiding YARN Gotchas
• http://blog.cloudera.com/blog/2014/04/apache-
hadoop-yarn-avoiding-6-time-consuming-
gotchas/
YARN compatibility
CDH has complete binary/source
compatibility for almost all programs.
Virtually every job compiled against MR1 in
CDH 4 will be able to run without any
modifications on an MR2 cluster.
16© Cloudera, Inc. All rights reserved.
YARN upgrade pitfalls
17© Cloudera, Inc. All rights reserved.
MapReduce v2
NodeManager NodeManager
Map
Task
Container Container
App
Master Reduce
Task
Container
Job History
Server
Resource
Manager
Schedule Jobs
Manage
Jobs
Store Jobs
18© Cloudera, Inc. All rights reserved.
MapReduce v2
NodeManager NodeManager
Map
Task
Container Container
App
Master Reduce
Task
Container
Job History
Server
Resource
Manager
Schedule Jobs
Manage
Jobs
Store Jobs
Logs
HDFS
19© Cloudera, Inc. All rights reserved.
General log related configuration properties
Log configuration parameter What it does
yarn.nodemanager.log-dirs Determines where the container-logs are stored on the
node when the containers are running. Default is
${yarn.log.dir}/userlogs. For MapReduce applications,
each container directory will contain the files stderr,
stdin, and syslog generated by that container.
yarn.log-aggregation-enable Whether to enable log aggregation or not. If disabled,
NMs will keep the logs locally and not aggregate them.
20© Cloudera, Inc. All rights reserved.
YARN applications
Llama, Slider, Spark
21© Cloudera, Inc. All rights reserved.
YARN applications
• Llama (Low Latency Application MAster)
• Reserves memory in YARN for short-lived processes (e.g. Impala)
• Registers one long-lived AM per YARN pool
• Caches resources allocated by YARN for a short time, so that they can be
quickly re-allocated to Impala queries
• Long-term solution is to run Impala on YARN but currently recommend
setting up admission control
22© Cloudera, Inc. All rights reserved.
YARN applications
• Apache Slider (incubating) née Hoya
• Runs long-lived persistent services on YARN (e.g. HBase)
• Not currently recommended as it doesn’t provide IO isolation
23© Cloudera, Inc. All rights reserved.
Spark on YARN
24© Cloudera, Inc. All rights reserved.
• Application corresponds
to an instance of the
SparkContext class
• Executors are long lived
processes
• Applications take up
resource until the app
completes
Spark Overview
25© Cloudera, Inc. All rights reserved.
• Built in scheduler for resource management (Isolation,
Prioritization)
• Sharing resources within a cluster (MapReduce, Spark)
• YARN is the only cluster manager for Spark that supports
security (Kerberized Hadoop)
Why Spark on Yarn?
26© Cloudera, Inc. All rights reserved.
• Designed for interactive queries and iterative algorithms
• In-memory caching, DAG engine, and APIs
• Set yarn.scheduler.maximum-allocation-mb as high as 64G on a machine
with 192GB of memory
• Won’t run with small (< 1 GB) containers due to overhead
Reference:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Configuring YARN for Spark
27© Cloudera, Inc. All rights reserved.
Deploying Spark Jobs
28© Cloudera, Inc. All rights reserved.
Spark – YARN-CLIENT
29© Cloudera, Inc. All rights reserved.
Spark – YARN-CLUSTER
30© Cloudera, Inc. All rights reserved.
- Allows SparkContext to grow and shrink during its lifetime
- Meant for multiple applications sharing resources in a shared cluster
- Idling resources get returned to the pool of resources for other apps
Knobs to control dynamic allocation:
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.executorIdleTimeout (how quickly executors get removed)
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout (how quickly executors get added)
Dynamic Allocation
31© Cloudera, Inc. All rights reserved.
• Fair scheduler for resource sharing
•spark.yarn.queue
• Standalone cluster mode currently only supports a simple FIFO scheduler
across applications
Spark Scheduling
32© Cloudera, Inc. All rights reserved.
• Symptom:
• Use spark-submit to run a python job, but only see the resources being used
on one machine.
Spark Not Running On Yarn?
33© Cloudera, Inc. All rights reserved.
• Workaround:
• Ensure that you have the options in the right position
• Cause:
•$ spark-submit pi.py –master yarn-client
• Fix:
•$ spark-submit --master yarn-client pi.py 1000
• Usage:
spark-submit [options] <app jar | python file> [app options]
• Lot of improvements made to Spark 1.2 for spark-submit SPARK-1652
Spark Not Running On Yarn?
34© Cloudera, Inc. All rights reserved.
• Symptom:
$ spark-submit --master yarn-cluster pi.py 1000
Error: Cluster deploy mode is currently not
supported for python applications.
Run with --help for usage help or --verbose for
debug output
PySpark on Yarn Limitation
35© Cloudera, Inc. All rights reserved.
• Workaround:
$ spark-submit --master yarn-client pi.py 1000
…
Pi is roughly 3.132290
15/02/11 09:41:34 INFO SparkUI: Stopped Spark web UI
at http://sparktest-1.ent.cloudera.com:4040
15/02/11 09:41:34 INFO DAGScheduler: Stopping
DAGScheduler
• Future work: SPARK-5162 / SPARK-5173 (Fixed in Spark 1.3)
PySpark on Yarn Limitation
36© Cloudera, Inc. All rights reserved.
• Symptom:
•Spark Driver WARN Messages
14/12/08 17:11:08 WARN scheduler.TaskSetManager: Lost task 205.0 in
stage 2.0 (TID 352, test-1.cloudera.com: ExecutorLostFailure (executor
lost)
•NodeManager Logs
2014-12-08 17:10:32,860 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Conta
inersMonitorImpl: Container
[pid=26842,containerID=container_1418059756626_0010_01_000093_01]
is running beyond physical memory limits.
Current usage: 26.2 GB of 26 GB physical memory used;
27.1 GB of 54.6 GB virtual memory used. Killing container.
Lost Spark Executors
37© Cloudera, Inc. All rights reserved.
• Workaround:
• Increase
spark.yarn.[executor|driver].memoryOverhead
• Test for your specific use case. 1GB to 4GB
• 384M default for spark
Lost Spark Executors
38© Cloudera, Inc. All rights reserved.
• A lot of changes to improve shuffle logic
•Add Sort based shuffle : SPARK-2045
•Add MR-Style Merge Sort : SPARK-2926
•Improve SparkSQL Shuffle : SPARK-3056
• Bugs are still being uncovered
•Multiple concurrent attempts for fetch failures : SPARK-7308
Spark Shuffle
39© Cloudera, Inc. All rights reserved.
•PySpark in YARN cluster mode
•Spark action executor
•Spark Streaming WAL on HDFS, preventing any data loss on driver failure
•Spark external shuffle service
•Improvements in the collection of task metrics
•Kafka connector for Spark Streaming to avoid the need for the HDFS WAL
Spark Improvements
40© Cloudera, Inc. All rights reserved.
Conclusion
41© Cloudera, Inc. All rights reserved.
• Improved cluster utilization
• Can run more jobs in smaller clusters
• Run in uber mode for smaller jobs
(reduces AM overhead)
• Dynamic resource sharing between
frameworks
• One framework can use the entire
cluster
• Tom White’s Hadoop: The Definitive
Guide 4th Ed
• Chapter 4 is on YARN
YARN performance
42© Cloudera, Inc. All rights reserved.
Questions?
@kate_ting
@miklos_c
43© Cloudera, Inc. All rights reserved.
• spark.shuffle.consolidateFiles=true
• spark.yarn.executor.memoryOverhead
• spark.yarn.driver.memoryOverhead
• spark.shuffle.manager=SORT
• spark.rdd.compress=true
• spark.serializer=org.apache.spark.serializer.KryoSerializer
Spark Tuning Parameters
44© Cloudera, Inc. All rights reserved.
YARN vs Mesos: Resource Manager’s role
YARN Mesos
Asks for resources Offers resources
Evolved into a resource manager Evolved into managing Hadoop
Written in Java Written in C++
Locality aware More customizable

Más contenido relacionado

La actualidad más candente

An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnMike Frampton
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureDataWorks Summit
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 

La actualidad más candente (20)

Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Yarn
YarnYarn
Yarn
 
Hadoop YARN
Hadoop YARN Hadoop YARN
Hadoop YARN
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
December 2013 HUG: Spark at Yahoo!
December 2013 HUG: Spark at Yahoo!December 2013 HUG: Spark at Yahoo!
December 2013 HUG: Spark at Yahoo!
 

Similar a Yarns about YARN: Migrating to MapReduce v2

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...Sunil Govindan
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionCloudera, Inc.
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...DataWorks Summit
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Hakka Labs
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark OperationsCloudera, Inc.
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 

Similar a Yarns about YARN: Migrating to MapReduce v2 (20)

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
YARN
YARNYARN
YARN
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...
Cloudera DataTalks 2019 Bangalore - YuniKorn A next generation scheduler for ...
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And CloudYARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Yarns about YARN: Migrating to MapReduce v2

  • 1. 1© Cloudera, Inc. All rights reserved. Yarns about YARN: Migrating to MapReduce v2 Kathleen Ting, kate@cloudera.com Miklos Christine, mwc@cloudera.com Hadoop Summit San Jose, 10 June 2015
  • 2. 2© Cloudera, Inc. All rights reserved. Kathleen Ting •Joined Cloudera in 2011 •Former customer operations engineer •Technical account manager •Apache Sqoop Cookbook co-author Miklos Christine •Joined Cloudera in 2013 •Former customer operations engineer •Systems engineer •Apache Spark expert $ whoami
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera and Apache Hadoop • Apache Hadoop is an open source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. • Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. • Distributes CDH, a Hadoop distribution. • Teaches, consults, and supports customers building applications on the Hadoop stack. • The world-wide Cloudera Customer Operations Engineering team has closed tens of thousands of support incidents over six years.
  • 4. 4© Cloudera, Inc. All rights reserved. Outline • YARN motivation • Upgrading MR1 to MR2 • YARN upgrade pitfalls • YARN applications
  • 5. 5© Cloudera, Inc. All rights reserved. YARN motivation Yet Another Resource Negotiator
  • 6. 6© Cloudera, Inc. All rights reserved. One platform, many workloads: batch, interactive, real-time
  • 7. 7© Cloudera, Inc. All rights reserved.
  • 8. 8© Cloudera, Inc. All rights reserved. An Apache YARN timeline January 2008: Yahoo started work on YARN October 2013: Hadoop 2.0 GA August 2012: YARN promoted to Apache Hadoop sub- project April 2014: YARN/MR2 is default in CDH 5 June 2012: CDH 4.0.0 included YARN
  • 9. 9© Cloudera, Inc. All rights reserved. MapReduce v1 JobTracker TaskTracker Reduce Task Map Task TaskTracker Map Task Reduce Task Schedule Jobs Manage Jobs Store Jobs
  • 10. 10© Cloudera, Inc. All rights reserved. MapReduce v2 NodeManager NodeManager Map Task Container Container App Master Reduce Task Container Job History Server Resource Manager Schedule Jobs Manage Jobs Store Jobs
  • 11. 11© Cloudera, Inc. All rights reserved. YARN motivation MR1 MR2 Scalability JobTracker tracks all jobs, tasks Max out at 4k nodes , 40k tasks Split up tracking between ResourceManager, ApplicationMaster Scale up to 10k nodes, 100k tasks Availability JT HA RM HA & for per-application basis Utilization Fixed size slots for map, reduce Allocate only as many resources as needed, allows cluster utilization > 70% Multi-tenancy N/A Cluster resource management system Data locality & lowered operational costs from sharing resources between frameworks
  • 12. 12© Cloudera, Inc. All rights reserved. Upgrading MR1 to MR2
  • 13. 13© Cloudera, Inc. All rights reserved. MR1 to MR2 functionality mapping • Completely revamped architecture in MR2 on YARN • While most translate, some configurations don’t
  • 14. 14© Cloudera, Inc. All rights reserved. MR2 on YARN Applications on YARN Memory > heap to account for overhead: Memory per Container: mapreduce.[map|reduce].memory.mb (1.5GB) Map/Reduce Task Maximum Heap Size: mapreduce.[map|reduce].java.opts.max.heap (1GB) CPU per Container: mapreduce.[map|reduce].cpu.vcores Mem/CPU thresholds: Container Memory Minimum: yarn.scheduler.minimum-allocation-mb Container Memory Maximum: yarn.scheduler.maximum-allocation-mb Container Virtual CPU Cores Minimum: yarn.scheduler.minimum-allocation-vcores Container Virtual CPU Cores Maximum: yarn.scheduler.maximum-allocation-vcores
  • 15. 15© Cloudera, Inc. All rights reserved. Migration path Binary support MR1 (CDH4) to MR2 (CDH5) ✔ MR1 (CDH4) to MR1 (CDH5) ✔ MR2 (CDH4) to MR1/MR2 (CDH5) ✖ • Migrating to MR2 on YARN • For Operators: http://blog.cloudera.com/blog/2013/11/migratin g-to-mapreduce-2-on-yarn-for-operators/ • For Users: http://blog.cloudera.com/blog/2013/11/migratin g-to-mapreduce-2-on-yarn-for-users/ • http://blog.cloudera.com/blog/2014/04/apache- hadoop-yarn-avoiding-6-time-consuming- gotchas/ • Getting MR2 Up to Speed • http://blog.cloudera.com/blog/2014/02/getting- mapreduce-2-up-to-speed/ • Avoiding YARN Gotchas • http://blog.cloudera.com/blog/2014/04/apache- hadoop-yarn-avoiding-6-time-consuming- gotchas/ YARN compatibility CDH has complete binary/source compatibility for almost all programs. Virtually every job compiled against MR1 in CDH 4 will be able to run without any modifications on an MR2 cluster.
  • 16. 16© Cloudera, Inc. All rights reserved. YARN upgrade pitfalls
  • 17. 17© Cloudera, Inc. All rights reserved. MapReduce v2 NodeManager NodeManager Map Task Container Container App Master Reduce Task Container Job History Server Resource Manager Schedule Jobs Manage Jobs Store Jobs
  • 18. 18© Cloudera, Inc. All rights reserved. MapReduce v2 NodeManager NodeManager Map Task Container Container App Master Reduce Task Container Job History Server Resource Manager Schedule Jobs Manage Jobs Store Jobs Logs HDFS
  • 19. 19© Cloudera, Inc. All rights reserved. General log related configuration properties Log configuration parameter What it does yarn.nodemanager.log-dirs Determines where the container-logs are stored on the node when the containers are running. Default is ${yarn.log.dir}/userlogs. For MapReduce applications, each container directory will contain the files stderr, stdin, and syslog generated by that container. yarn.log-aggregation-enable Whether to enable log aggregation or not. If disabled, NMs will keep the logs locally and not aggregate them.
  • 20. 20© Cloudera, Inc. All rights reserved. YARN applications Llama, Slider, Spark
  • 21. 21© Cloudera, Inc. All rights reserved. YARN applications • Llama (Low Latency Application MAster) • Reserves memory in YARN for short-lived processes (e.g. Impala) • Registers one long-lived AM per YARN pool • Caches resources allocated by YARN for a short time, so that they can be quickly re-allocated to Impala queries • Long-term solution is to run Impala on YARN but currently recommend setting up admission control
  • 22. 22© Cloudera, Inc. All rights reserved. YARN applications • Apache Slider (incubating) née Hoya • Runs long-lived persistent services on YARN (e.g. HBase) • Not currently recommended as it doesn’t provide IO isolation
  • 23. 23© Cloudera, Inc. All rights reserved. Spark on YARN
  • 24. 24© Cloudera, Inc. All rights reserved. • Application corresponds to an instance of the SparkContext class • Executors are long lived processes • Applications take up resource until the app completes Spark Overview
  • 25. 25© Cloudera, Inc. All rights reserved. • Built in scheduler for resource management (Isolation, Prioritization) • Sharing resources within a cluster (MapReduce, Spark) • YARN is the only cluster manager for Spark that supports security (Kerberized Hadoop) Why Spark on Yarn?
  • 26. 26© Cloudera, Inc. All rights reserved. • Designed for interactive queries and iterative algorithms • In-memory caching, DAG engine, and APIs • Set yarn.scheduler.maximum-allocation-mb as high as 64G on a machine with 192GB of memory • Won’t run with small (< 1 GB) containers due to overhead Reference: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Configuring YARN for Spark
  • 27. 27© Cloudera, Inc. All rights reserved. Deploying Spark Jobs
  • 28. 28© Cloudera, Inc. All rights reserved. Spark – YARN-CLIENT
  • 29. 29© Cloudera, Inc. All rights reserved. Spark – YARN-CLUSTER
  • 30. 30© Cloudera, Inc. All rights reserved. - Allows SparkContext to grow and shrink during its lifetime - Meant for multiple applications sharing resources in a shared cluster - Idling resources get returned to the pool of resources for other apps Knobs to control dynamic allocation: spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.minExecutors spark.dynamicAllocation.maxExecutors spark.dynamicAllocation.executorIdleTimeout (how quickly executors get removed) spark.dynamicAllocation.schedulerBacklogTimeout spark.dynamicAllocation.sustainedSchedulerBacklogTimeout (how quickly executors get added) Dynamic Allocation
  • 31. 31© Cloudera, Inc. All rights reserved. • Fair scheduler for resource sharing •spark.yarn.queue • Standalone cluster mode currently only supports a simple FIFO scheduler across applications Spark Scheduling
  • 32. 32© Cloudera, Inc. All rights reserved. • Symptom: • Use spark-submit to run a python job, but only see the resources being used on one machine. Spark Not Running On Yarn?
  • 33. 33© Cloudera, Inc. All rights reserved. • Workaround: • Ensure that you have the options in the right position • Cause: •$ spark-submit pi.py –master yarn-client • Fix: •$ spark-submit --master yarn-client pi.py 1000 • Usage: spark-submit [options] <app jar | python file> [app options] • Lot of improvements made to Spark 1.2 for spark-submit SPARK-1652 Spark Not Running On Yarn?
  • 34. 34© Cloudera, Inc. All rights reserved. • Symptom: $ spark-submit --master yarn-cluster pi.py 1000 Error: Cluster deploy mode is currently not supported for python applications. Run with --help for usage help or --verbose for debug output PySpark on Yarn Limitation
  • 35. 35© Cloudera, Inc. All rights reserved. • Workaround: $ spark-submit --master yarn-client pi.py 1000 … Pi is roughly 3.132290 15/02/11 09:41:34 INFO SparkUI: Stopped Spark web UI at http://sparktest-1.ent.cloudera.com:4040 15/02/11 09:41:34 INFO DAGScheduler: Stopping DAGScheduler • Future work: SPARK-5162 / SPARK-5173 (Fixed in Spark 1.3) PySpark on Yarn Limitation
  • 36. 36© Cloudera, Inc. All rights reserved. • Symptom: •Spark Driver WARN Messages 14/12/08 17:11:08 WARN scheduler.TaskSetManager: Lost task 205.0 in stage 2.0 (TID 352, test-1.cloudera.com: ExecutorLostFailure (executor lost) •NodeManager Logs 2014-12-08 17:10:32,860 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Conta inersMonitorImpl: Container [pid=26842,containerID=container_1418059756626_0010_01_000093_01] is running beyond physical memory limits. Current usage: 26.2 GB of 26 GB physical memory used; 27.1 GB of 54.6 GB virtual memory used. Killing container. Lost Spark Executors
  • 37. 37© Cloudera, Inc. All rights reserved. • Workaround: • Increase spark.yarn.[executor|driver].memoryOverhead • Test for your specific use case. 1GB to 4GB • 384M default for spark Lost Spark Executors
  • 38. 38© Cloudera, Inc. All rights reserved. • A lot of changes to improve shuffle logic •Add Sort based shuffle : SPARK-2045 •Add MR-Style Merge Sort : SPARK-2926 •Improve SparkSQL Shuffle : SPARK-3056 • Bugs are still being uncovered •Multiple concurrent attempts for fetch failures : SPARK-7308 Spark Shuffle
  • 39. 39© Cloudera, Inc. All rights reserved. •PySpark in YARN cluster mode •Spark action executor •Spark Streaming WAL on HDFS, preventing any data loss on driver failure •Spark external shuffle service •Improvements in the collection of task metrics •Kafka connector for Spark Streaming to avoid the need for the HDFS WAL Spark Improvements
  • 40. 40© Cloudera, Inc. All rights reserved. Conclusion
  • 41. 41© Cloudera, Inc. All rights reserved. • Improved cluster utilization • Can run more jobs in smaller clusters • Run in uber mode for smaller jobs (reduces AM overhead) • Dynamic resource sharing between frameworks • One framework can use the entire cluster • Tom White’s Hadoop: The Definitive Guide 4th Ed • Chapter 4 is on YARN YARN performance
  • 42. 42© Cloudera, Inc. All rights reserved. Questions? @kate_ting @miklos_c
  • 43. 43© Cloudera, Inc. All rights reserved. • spark.shuffle.consolidateFiles=true • spark.yarn.executor.memoryOverhead • spark.yarn.driver.memoryOverhead • spark.shuffle.manager=SORT • spark.rdd.compress=true • spark.serializer=org.apache.spark.serializer.KryoSerializer Spark Tuning Parameters
  • 44. 44© Cloudera, Inc. All rights reserved. YARN vs Mesos: Resource Manager’s role YARN Mesos Asks for resources Offers resources Evolved into a resource manager Evolved into managing Hadoop Written in Java Written in C++ Locality aware More customizable