SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Apache Hadoop YARN
and Tez
Future of Data Processing with Hadoop
Page 1
Chris Harris
charris@hortonworks.com
Agenda
Page 2
• Overview
• Status Quo
• Architecture
• Improvements and Updates
• Tez
Hadoop MapReduce Classic
•  JobTracker
–  Manages cluster resources and job scheduling
•  TaskTracker
–  Per-node agent
–  Manage tasks
Current Limitations
•  Scalability
–  Maximum Cluster size – 4,000 nodes
–  Maximum concurrent tasks – 40,000
–  Coarse synchronization in JobTracker
•  Single point of failure
–  Failure kills all queued and running jobs
–  Jobs need to be re-submitted by users
•  Restart is very tricky due to complex state
4
Current Limitations
•  Hard partition of resources into map and reduce slots
–  Low resource utilization
•  Lacks support for alternate paradigms
–  Iterative applications implemented using MapReduce are
10x slower
–  Hacks for the likes of MPI/Graph Processing
•  Lack of wire-compatible protocols
–  Client and cluster must be of same version
–  Applications and workflows cannot migrate to different
clusters
5
Requirements
•  Reliability
•  Availability
•  Utilization
•  Wire Compatibility
•  Agility & Evolution – Ability for customers to control
upgrades to the grid software stack.
•  Scalability - Clusters of 6,000-10,000 machines
–  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
disks
–  100,000+ concurrent tasks
–  10,000 concurrent jobs
6
Design Centre
•  Split up the two major functions of JobTracker
–  Cluster resource management
–  Application life-cycle management
•  MapReduce becomes user-land library
7
Concepts
•  Application
–  Application is a job submitted to the framework
–  Example – Map Reduce Job
•  Container
–  Basic unit of allocation
–  Example – container A = 2GB, 1CPU
–  Fine-grained resource allocation
–  Replaces the fixed map/reduce slots
8
Architecture
•  Resource Manager
–  Global resource scheduler
–  Hierarchical queues
•  Node Manager
–  Per-machine agent
–  Manages the life-cycle of container
–  Container resource monitoring
•  Application Master
–  Per-application
–  Manages application scheduling and task execution
–  E.g. MapReduce Application Master
9
Architecture
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
App Mstr Container
Node
Manager
App Mstr
Node Status
Resource Request
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request
Improvements vis-à-vis classic MapReduce
11
•  Utilization
–  Generic resource model
•  Data Locality (node, rack etc.)
•  Memory
•  CPU
•  Disk b/q
•  Network b/w
–  Remove fixed partition of map and reduce slot
•  Scalability
–  Application life-cycle management is very expensive
–  Partition resource management and application life-cycle management
–  Application management is distributed
–  Hardware trends - Currently run clusters of 4,000 machines
•  6,000 2012 machines > 12,000 2009 machines
•  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
•  Fault Tolerance and Availability
–  Resource Manager
•  No single point of failure – state saved in ZooKeeper (coming
soon)
•  Application Masters are restarted automatically on RM restart
–  Application Master
•  Optional failover via application-specific checkpoint
•  MapReduce applications pick up where they left off via state saved
in HDFS
•  Wire Compatibility
–  Protocols are wire-compatible
–  Old clients can talk to new servers
–  Rolling upgrades
12
Improvements vis-à-vis classic MapReduce
•  Innovation and Agility
–  MapReduce now becomes a user-land library
–  Multiple versions of MapReduce can run in the same cluster
(a la Apache Pig)
•  Faster deployment cycles for improvements
–  Customers upgrade MapReduce versions on their schedule
–  Users can customize MapReduce
13
Improvements vis-à-vis classic MapReduce
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
To Talk to Tez..You Have To First Talk Yarn
•  New Processing layer in Hadoop 2.0 that decouples Resource
management from application management
•  Created to manage resource needs across all uses
•  Ensures predictable performance & QoS for all apps
•  Enables apps to run “IN” Hadoop rather than “ON”
– Key to leveraging all other common services of the Hadoop platform:
security, data lifecycle management, etc.
Page 14
Applica'ons	
  Run	
  Na'vely	
  IN	
  Hadoop	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
BATCH	
  
(MapReduce)	
  
INTERACTIVE	
  
(Tez)	
  
STREAMING	
  
(Storm,	
  S4,…)	
  
GRAPH	
  
(Giraph)	
  
IN-­‐MEMORY	
  
(Spark)	
  
HPC	
  MPI	
  
(OpenMPI)	
  
ONLINE	
  
(HBase)	
  
OTHER	
  
(Search)	
  
(Weave…)	
  
© Hortonworks Inc. 2013
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request
Tez: High Throughput and Low Latency
Tez runs in
YARN
Accelerate
High Throughput
AND Low
Latency
Processing
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Tez - Core Idea
Task with pluggable Input, Processor & Output
Page 16
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Tez Hive Performance
• Low level data-processing, execution engine on YARN
• Base for MapReduce, Hive, Pig, Cascading, etc.
• Re-usable data processing primitives (ex: sort, merge,
intermediate data management)
• All Hive SQL, can be expressed as single job
– Jobs are no longer interrupted (efficient pipeline)
– Avoid writing intermediate output to HDFS when performance
outweights job re-start (speed and network/disk usage savings)
– Break MR contract to turn MRMRMR to MRRR (flexible DAG)
• Removes task and job overhead (10s savings is huge
for a 2s query!)
Page 17
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Tez Service
• MR Query Startup Expensive
– Job launch & task-launch latencies are fatal for short queries (in
order of 5s to 30s)
• Tez Service
– An always-on pool of Application Masters
– Hive (and soon others like Pig) jobs are executed on an
Application Masterin the pool instead of starting a new Application
Master(saving precious seconds)
– Container Preallocation/Ware Containers
• In Summary…
– Removes job-launch overhead (Application Master)
– Removes task-launch overhead (Pre-warmed Containers)
Page 18
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Stinger Initiative: Making Apache Hive 100X Faster
Page 19
© Hortonworks Inc. 2013
Page 19DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & SENSITIVE INFORMATION
Hadoop
Base	
  Op'miza'ons	
  
	
  
Generate	
  simplified	
  DAGs	
  
In-­‐memory	
  Hash	
  Joins	
  
Current
3 – 9 Months
9 – 18 Months
Deep	
  Analy'cs	
  
	
  
SQL	
  CompaEble	
  Types	
  
SQL	
  CompaEble	
  Windowing	
  
More	
  SQL	
  Subqueries	
  
Hive
YARN	
  
	
  
Next-­‐gen	
  Hadoop	
  data	
  
processing	
  framework	
  
Tez	
  
	
  
Express	
  data	
  processing	
  
tasks	
  more	
  simply	
  
Eliminate	
  disk	
  writes	
  
Hive	
  Tez	
  Service	
  
	
  
Pre-­‐warmed	
  Containers	
  
Low-­‐latency	
  dispatch	
  
ORCFile	
  
	
  
Column	
  Store	
  
High	
  Compression	
  
Predicate	
  /	
  Filter	
  Pushdowns	
  
Buffer	
  Caching	
  
	
  
Cache	
  accessed	
  data	
  
OpEmized	
  for	
  vector	
  engine	
  
Query	
  Planner	
  
	
  
Intelligent	
  Cost-­‐Based	
  
OpEmizer	
  
Vector	
  Query	
  Engine	
  
	
  
OpEmized	
  for	
  modern	
  
processor	
  architectures	
  
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Demo Benchmark Spec
• The TPC-DS benchmark data+query set
• Query 27
– big table(store_sales) joins lots of small tables
– A.K.A Star Schema Join
• What does Query 27 do?
For all items sold in stores located in specified states during a given
year, find the average quantity, average list price, average list sales
price, average coupon amount for a given gender, marital status,
education and customer demographic..
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
SELECT	
  col5,	
  avg(col6)	
  
FROM	
  store_sales_fact	
  ssf	
  
	
  join	
  item_dim	
  on	
  (ssf.col1	
  =	
  item_dim	
  .col1)	
  	
  	
  
	
  join	
  date_dim	
  on	
  (ssf.col2	
  =	
  date_dim.col2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  join	
  custdmgrphcs_dim	
  on	
  (ssf.col3	
  =custdmgrphcs_dim.col3)	
  	
  	
  
	
  join	
  store_dim	
  on	
  (ssf.col4	
  =	
  store_dim.col4)	
  	
  
GROUP	
  BY	
  col5	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
ORDER	
  BY	
  col5	
  
LIMIT	
  100;	
  
Query 27 - Star Schema Join
• Derived from TPC-DS Query 27
Page 21
41 GB
58 MB
11MB
80MB
106 KB
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Benchmark Cluster Specs
• 6 Node HDP 1.3 Cluster
– 2 Master Nodes
– 4 Data/Slave Nodes
• Slave Node Specs
– Dual Processors
– 14 GB
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution Before Hive 11-Text Format
Query spawned 5 MR Jobs
The intermediate output of
each job is written to HDFS
Job 1 of 5 – Mappers stream Fact table and first
dimension table sorted by join key. Reducers do the
join operation between the two tables. Job 2 of 5: Mappers and Reducers join the
output of job 1 and second dimension table.
Last job performs the order
by and group operation
Query Response Time
149 total mappers got executed
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution With Hive 11-Text Format
Query spawned of 1 job
with Hive 11 compared to 5
MR Jobs with Hive 10
Job 1 of 1 – Each Mapper loads into memory the 4
small dimension tables and streams parts of the
large fact table. Joins then occur in Mapper hence
the name MapJoin
Increase in performance with Hive
11 as query time went down from
21 minutes to about 4 minutes
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution With Hive 11- RC Format
Conversion from Text to
RC file format decreased
size of dimension data set
from 38 GB to 8.21 GB
Smaller file equates to less
IO causing the query time
to decrease from 246
seconds to 136 seconds
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution With Hive 11- ORC Format
ORC File type consolidates
data more tighly than
RCFile as the size of
dataset decreased from
8.21 GB to 2.83 GB
Smaller file equates to less
IO causing the query time
to decrease from 136
seconds to 104 seconds
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution With Hive 11- RC Format/
Partitioned, Bucketed and Sorted
Partitioned the data
decreases the file input
size to 1.73 GB
Smaller file equates to less
IO causing the query time
to decrease from 136
seconds to 104 seconds
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Query27 Execution With Hive 11- ORCFormat/
Partitioned, Bucketed and Sorted
Partitioned Data with ORC
file format produces the
smallest input size of 687
MB a decrease from 43 GB
Smallest Input Size allows
us the fastest response
time for the query: 79
seconds
© Hortonworks Inc. 2013© Hortonworks Inc. 2013
Summary of Results
File	
  Type	
   Number	
  of	
  
MR	
  Jobs	
  
Input	
  Size	
   Mappers	
   Time	
  
Text/Hive	
  10	
   5	
   43.1	
  GB	
   179	
   21	
  minutes	
  
Text/Hive	
  11	
   1	
   38	
  GB	
   151	
   246	
  seconds	
  
RC/Hive	
  11	
   1	
   8.21	
  GB	
   76	
   136	
  seconds	
  
ORC/Hive	
  11	
   1	
   2.83	
  GB	
   38	
   104	
  seconds	
  
RC/Hive	
  11/
Partitioned/
Bucketed	
  
1	
   1.73	
  GB	
   19	
   104	
  seconds	
  
ORC/Hive	
  11/
Partitioned/
Bucketed	
  
1	
   687	
  MB	
   27	
   79.62	
  
Resources
Page 30
hadoop-2.0.3-alpha:
http://hadoop.apache.org/common/releases.html
Release Documentation:
http://hadoop.apache.org/common/docs/r2.0.3-alpha/
Blogs:
http://hortonworks.com/blog/category/apache-hadoop/yarn/
http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-
overview/
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-
overview/
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Thank You!
Questions & Answers
Page 31

Más contenido relacionado

La actualidad más candente

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 

La actualidad más candente (20)

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 

Similar a Apache Hadoop YARN - The Future of Data Processing with Hadoop

Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Apache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's NextApache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's NextDataWorks Summit
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Caserta
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHortonworks
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
 

Similar a Apache Hadoop YARN - The Future of Data Processing with Hadoop (20)

Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's NextApache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's Next
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next GenHadoop World 2011, Apache Hadoop MapReduce Next Gen
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 

Más de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Más de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Último (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Apache Hadoop YARN - The Future of Data Processing with Hadoop

  • 1. Apache Hadoop YARN and Tez Future of Data Processing with Hadoop Page 1 Chris Harris charris@hortonworks.com
  • 3. Hadoop MapReduce Classic •  JobTracker –  Manages cluster resources and job scheduling •  TaskTracker –  Per-node agent –  Manage tasks
  • 4. Current Limitations •  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker •  Single point of failure –  Failure kills all queued and running jobs –  Jobs need to be re-submitted by users •  Restart is very tricky due to complex state 4
  • 5. Current Limitations •  Hard partition of resources into map and reduce slots –  Low resource utilization •  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are 10x slower –  Hacks for the likes of MPI/Graph Processing •  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different clusters 5
  • 6. Requirements •  Reliability •  Availability •  Utilization •  Wire Compatibility •  Agility & Evolution – Ability for customers to control upgrades to the grid software stack. •  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks –  100,000+ concurrent tasks –  10,000 concurrent jobs 6
  • 7. Design Centre •  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management •  MapReduce becomes user-land library 7
  • 8. Concepts •  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job •  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Fine-grained resource allocation –  Replaces the fixed map/reduce slots 8
  • 9. Architecture •  Resource Manager –  Global resource scheduler –  Hierarchical queues •  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring •  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master 9
  • 10. Architecture Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request Resource Manager Client MapReduce Status Job Submission Client Node Manager Node Manager App Mstr Container Node Manager App Mstr Node Status Resource Request Resource Manager Client MapReduce Status Job Submission Client Node Manager Container Container Node Manager App Mstr Container Node Manager Container App Mstr Node Status Resource Request
  • 11. Improvements vis-à-vis classic MapReduce 11 •  Utilization –  Generic resource model •  Data Locality (node, rack etc.) •  Memory •  CPU •  Disk b/q •  Network b/w –  Remove fixed partition of map and reduce slot •  Scalability –  Application life-cycle management is very expensive –  Partition resource management and application life-cycle management –  Application management is distributed –  Hardware trends - Currently run clusters of 4,000 machines •  6,000 2012 machines > 12,000 2009 machines •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
  • 12. •  Fault Tolerance and Availability –  Resource Manager •  No single point of failure – state saved in ZooKeeper (coming soon) •  Application Masters are restarted automatically on RM restart –  Application Master •  Optional failover via application-specific checkpoint •  MapReduce applications pick up where they left off via state saved in HDFS •  Wire Compatibility –  Protocols are wire-compatible –  Old clients can talk to new servers –  Rolling upgrades 12 Improvements vis-à-vis classic MapReduce
  • 13. •  Innovation and Agility –  MapReduce now becomes a user-land library –  Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) •  Faster deployment cycles for improvements –  Customers upgrade MapReduce versions on their schedule –  Users can customize MapReduce 13 Improvements vis-à-vis classic MapReduce
  • 14. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 To Talk to Tez..You Have To First Talk Yarn •  New Processing layer in Hadoop 2.0 that decouples Resource management from application management •  Created to manage resource needs across all uses •  Ensures predictable performance & QoS for all apps •  Enables apps to run “IN” Hadoop rather than “ON” – Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. Page 14 Applica'ons  Run  Na'vely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  • 15. © Hortonworks Inc. 2013 Resource Manager Client MapReduce Status Job Submission Client Node Manager Container Container Node Manager App Mstr Container Node Manager Container App Mstr Node Status Resource Request Tez: High Throughput and Low Latency Tez runs in YARN Accelerate High Throughput AND Low Latency Processing
  • 16. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Tez - Core Idea Task with pluggable Input, Processor & Output Page 16 YARN ApplicationMaster to run DAG of Tez Tasks Input Processor Task Output Tez Task - <Input, Processor, Output>
  • 17. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Tez Hive Performance • Low level data-processing, execution engine on YARN • Base for MapReduce, Hive, Pig, Cascading, etc. • Re-usable data processing primitives (ex: sort, merge, intermediate data management) • All Hive SQL, can be expressed as single job – Jobs are no longer interrupted (efficient pipeline) – Avoid writing intermediate output to HDFS when performance outweights job re-start (speed and network/disk usage savings) – Break MR contract to turn MRMRMR to MRRR (flexible DAG) • Removes task and job overhead (10s savings is huge for a 2s query!) Page 17
  • 18. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Tez Service • MR Query Startup Expensive – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Tez Service – An always-on pool of Application Masters – Hive (and soon others like Pig) jobs are executed on an Application Masterin the pool instead of starting a new Application Master(saving precious seconds) – Container Preallocation/Ware Containers • In Summary… – Removes job-launch overhead (Application Master) – Removes task-launch overhead (Pre-warmed Containers) Page 18
  • 19. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Stinger Initiative: Making Apache Hive 100X Faster Page 19 © Hortonworks Inc. 2013 Page 19DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & SENSITIVE INFORMATION Hadoop Base  Op'miza'ons     Generate  simplified  DAGs   In-­‐memory  Hash  Joins   Current 3 – 9 Months 9 – 18 Months Deep  Analy'cs     SQL  CompaEble  Types   SQL  CompaEble  Windowing   More  SQL  Subqueries   Hive YARN     Next-­‐gen  Hadoop  data   processing  framework   Tez     Express  data  processing   tasks  more  simply   Eliminate  disk  writes   Hive  Tez  Service     Pre-­‐warmed  Containers   Low-­‐latency  dispatch   ORCFile     Column  Store   High  Compression   Predicate  /  Filter  Pushdowns   Buffer  Caching     Cache  accessed  data   OpEmized  for  vector  engine   Query  Planner     Intelligent  Cost-­‐Based   OpEmizer   Vector  Query  Engine     OpEmized  for  modern   processor  architectures  
  • 20. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Demo Benchmark Spec • The TPC-DS benchmark data+query set • Query 27 – big table(store_sales) joins lots of small tables – A.K.A Star Schema Join • What does Query 27 do? For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic..
  • 21. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 SELECT  col5,  avg(col6)   FROM  store_sales_fact  ssf    join  item_dim  on  (ssf.col1  =  item_dim  .col1)        join  date_dim  on  (ssf.col2  =  date_dim.col2                                    join  custdmgrphcs_dim  on  (ssf.col3  =custdmgrphcs_dim.col3)        join  store_dim  on  (ssf.col4  =  store_dim.col4)     GROUP  BY  col5                                                                                                         ORDER  BY  col5   LIMIT  100;   Query 27 - Star Schema Join • Derived from TPC-DS Query 27 Page 21 41 GB 58 MB 11MB 80MB 106 KB
  • 22. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Benchmark Cluster Specs • 6 Node HDP 1.3 Cluster – 2 Master Nodes – 4 Data/Slave Nodes • Slave Node Specs – Dual Processors – 14 GB
  • 23. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution Before Hive 11-Text Format Query spawned 5 MR Jobs The intermediate output of each job is written to HDFS Job 1 of 5 – Mappers stream Fact table and first dimension table sorted by join key. Reducers do the join operation between the two tables. Job 2 of 5: Mappers and Reducers join the output of job 1 and second dimension table. Last job performs the order by and group operation Query Response Time 149 total mappers got executed
  • 24. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution With Hive 11-Text Format Query spawned of 1 job with Hive 11 compared to 5 MR Jobs with Hive 10 Job 1 of 1 – Each Mapper loads into memory the 4 small dimension tables and streams parts of the large fact table. Joins then occur in Mapper hence the name MapJoin Increase in performance with Hive 11 as query time went down from 21 minutes to about 4 minutes
  • 25. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution With Hive 11- RC Format Conversion from Text to RC file format decreased size of dimension data set from 38 GB to 8.21 GB Smaller file equates to less IO causing the query time to decrease from 246 seconds to 136 seconds
  • 26. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution With Hive 11- ORC Format ORC File type consolidates data more tighly than RCFile as the size of dataset decreased from 8.21 GB to 2.83 GB Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
  • 27. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution With Hive 11- RC Format/ Partitioned, Bucketed and Sorted Partitioned the data decreases the file input size to 1.73 GB Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
  • 28. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Query27 Execution With Hive 11- ORCFormat/ Partitioned, Bucketed and Sorted Partitioned Data with ORC file format produces the smallest input size of 687 MB a decrease from 43 GB Smallest Input Size allows us the fastest response time for the query: 79 seconds
  • 29. © Hortonworks Inc. 2013© Hortonworks Inc. 2013 Summary of Results File  Type   Number  of   MR  Jobs   Input  Size   Mappers   Time   Text/Hive  10   5   43.1  GB   179   21  minutes   Text/Hive  11   1   38  GB   151   246  seconds   RC/Hive  11   1   8.21  GB   76   136  seconds   ORC/Hive  11   1   2.83  GB   38   104  seconds   RC/Hive  11/ Partitioned/ Bucketed   1   1.73  GB   19   104  seconds   ORC/Hive  11/ Partitioned/ Bucketed   1   687  MB   27   79.62  
  • 31. © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Thank You! Questions & Answers Page 31