Más contenido relacionado
La actualidad más candente (20)
Similar a February 2014 HUG : Tez Details and Insides (20)
Más de Yahoo Developer Network (20)
February 2014 HUG : Tez Details and Insides
- 1. Apache Tez : Accelerating Hadoop
Query Processing
Bikas Saha
@bikassaha
© Hortonworks Inc. 2013
Page 1
- 2. Tez – Introduction
• Distributed execution framework
targeted towards data-processing
applications.
• Based on expressing a computation
as a dataflow graph.
• Highly customizable to meet a
broad spectrum of use cases.
• Built on top of YARN – the resource
management framework for
Hadoop.
• Open source Apache incubator
project and Apache licensed.
© Hortonworks Inc. 2013
Page 2
- 3. Tez – Design Themes
• Empowering End Users
• Execution Performance
© Hortonworks Inc. 2013
Page 3
- 4. Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
© Hortonworks Inc. 2013
Page 4
- 5. Tez – Empowering End Users
• Expressive dataflow definition API’s
Task-1
Task-2
Preprocessor Stage
Task-1
Task-2
Partition Stage
Samples
Sampler
Ranges
Distributed Sort
Task-1
© Hortonworks Inc. 2013
Task-2
Aggregate Stage
Page 5
- 6. Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
– Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
– End goal is to have a library of inputs, outputs and processors that can
be programmatically composed to generate useful tasks.
HDFSInput
ShuffleInput
MapProcessor
ReduceProcessor
JoinProcessor
FileSortedOutput
HDFSOutput
FileSortedOutput
Mapper
FinalReduce
IntermediateJoiner
© Hortonworks Inc. 2013
Input1
Input2
Page 6
- 7. Tez – Empowering End Users
• Data type agnostic
– Tez is only concerned with the movement of data. Files and streams of
bytes.
– Clean separation between logical application layer and physical
framework layer. Design important to be a platform for a variety of
applications.
Tez Task
File
User Code
Key Value
Bytes
Bytes
Tuples
Stream
© Hortonworks Inc. 2013
Page 7
- 8. Tez – Empowering End Users
• Simplifying deployment
– Tez is a completely client side application.
– No deployments to do. Simply upload to any accessible FileSystem and
change local Tez configuration to point to that.
– Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
– Leverages YARN local resources.
HDFS
Tez Lib 1
Tez Lib 2
TezClient
TezTask
TezTask
TezClient
Client
Machine
Node
Manager
Node
Manager
Client
Machine
© Hortonworks Inc. 2013
Page 8
- 9. Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities
Tez is a framework on which end user applications can be built
© Hortonworks Inc. 2013
Page 9
- 10. Tez – Execution Performance
• Performance gains over Map Reduce
• Optimal resource management
• Plan reconfiguration at runtime
• Dynamic physical data flow decisions
© Hortonworks Inc. 2013
Page 10
- 11. Tez – Execution Performance
• Performance gains over Map Reduce
– Eliminate replicated write barrier between successive computations.
– Eliminate job launch overhead of workflow jobs.
– Eliminate extra stage of map reads in every workflow job.
– Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
Pig/Hive - Tez
Pig/Hive - MR
© Hortonworks Inc. 2013
Page 11
- 12. Tez – Execution Performance
• Plan reconfiguration at runtime
– Dynamic runtime concurrency control based on data size, user operator
resources, available cluster resources and locality.
– Advanced changes in dataflow graph structure.
– Progressive graph construction in concert with user optimizer.
HDFS
Blocks
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Only 10GB’s
of data
Stage 2
100 10
reducers
YARN
Resources
© Hortonworks Inc. 2013
Page 12
- 13. Tez – Execution Performance
• Optimal resource management
– Reuse YARN containers to launch new tasks.
– Reuse YARN containers to enable shared objects across tasks.
– TezSession to encapsulate all this for the user
Start Task
Tez
Application Master
Task Done
Start Task
YARN Container
TezTask1
TezTask2
Shared Objects
TezTask Host
YARN Container
© Hortonworks Inc. 2013
Page 13
- 14. Tez – Execution Performance
• Dynamic physical data flow decisions
– Decide the type of physical byte movement and storage on the fly.
– Store intermediate data on distributed store, local store or in-memory.
– Transfer bytes via blocking files or streaming and the spectrum in
between.
Producer
(small size)
Producer
Local File
At Runtime
In-Memory
Consumer
Consumer
© Hortonworks Inc. 2013
Page 14
- 15. Tez – Automatic Reduce Parallelism
Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Vertex Manager
Map Vertex
Vertex Manager
Pluggable user logic
that understands the
data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
© Hortonworks Inc. 2013
Page 15
- 16. Tez – Automatic Reduce Parallelism
Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Data Size Statistics
Vertex Manager
Map Vertex
Vertex Manager
Pluggable user logic
that understands the
data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
© Hortonworks Inc. 2013
Page 16
- 17. Tez – Automatic Reduce Parallelism
Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Vertex Manager
Pluggable user logic
that understands the
data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism
Data Size Statistics
Vertex Manager
Map Vertex
Set Parallelism
Re-Route
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
© Hortonworks Inc. 2013
Page 17
- 18. Tez – Now and Next
© Hortonworks Inc. 2013
Page 18
- 19. Tez – Bridge the Data Spectrum
Fact Table
Dimension
Table 1
Dimension
Table 1
Fact Table
Broadcast
Join
Result
Table 1
Dimension
Table 2
Broadcast join
for small data sets
Dimension
Table 1
Dimension
Table 1
Broadcast
Join
Result
Table 2
Dimension
Table 3
Shuffle
Join
Typical pattern in a
TPC-DS query
Result
Table 3
© Hortonworks Inc. 2013
Based on data size,
the query optimizer
can run either plan
as a single Tez job
Page 19
- 20. Tez – Current status
• Apache Incubator Project
– Rapid development. Over 800 jiras opened. Over 600 resolved.
– Growing community of contributors and users
• Focus on stability
– Testing and quality are highest priority.
– Code ready and deployed on multi-node environments.
• Support for a vast topology of DAGs
– Already functionally equivalent to Map Reduce. Existing Map Reduce
jobs can be executed on Tez with few or no changes.
– Hive retargeted to use Tez for execution of queries (HIVE-4660).
– Pig to use Tez for execution of scripts (PIG-3446).
© Hortonworks Inc. 2013
Page 20
- 21. Tez – Roadmap
• Richer DAG support
– Support for co-scheduling
– Efficient iterations
• Performance optimizations
– More efficiencies in transfer of data
– Improve session performance
• Usability.
– Stability and testability
– Recovery and history
– Tools for performance analysis and debugging
© Hortonworks Inc. 2013
Page 21
- 22. Tez – Community
• Early adopters and code contributors welcome
– Adopters to drive more scenarios. Contributors to make them happen.
– Hive and Pig communities are on-board and making great progress - HIVE-4660
and PIG-3446
• Tez meetup for developers and users
– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-dataprocessing/ (will soon be available on the Apache Wiki)
• Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/incubator-tez
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
© Hortonworks Inc. 2013
Page 22
- 23. Tez – Takeaways
• Distributed execution framework that works on computations
represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Customizable execution architecture designed to enable
dynamic performance optimizations at runtime
• Works out of the box with the platform figuring out the hard
stuff
• Span the spectrum of interactive latency to batch
• Open source Apache project – your use-cases and code are
welcome
• It works and is already being used by Hive and Pig
© Hortonworks Inc. 2013
Page 23
- 24. Tez
Thanks for your time and attention!
Deep dive on Tez video at
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
© Hortonworks Inc. 2013
Page 24