More Related Content Similar to Taming Latency: Case Studies in MapReduce Data Analytics Similar to Taming Latency: Case Studies in MapReduce Data Analytics (20) Taming Latency: Case Studies in MapReduce Data Analytics1. Taming latency:
case studies in
MapReduce data
analytics
Simon Tao
EMC Labs China
Office of the CTO
© Copyright 2013 EMC Corporation. All rights reserved.
1
2. Roadmap Information Disclaimer
EMC makes no representation and undertakes no obligations with
regard to product planning information, anticipated product
characteristics, performance specifications, or anticipated release
dates (collectively, “Roadmap Information”).
Roadmap Information is provided by EMC as an accommodation to the
recipient solely for purposes of discussion and without intending to be
bound thereby.
Roadmap information is EMC Restricted Confidential and is provided
under the terms, conditions and restrictions defined in the EMC NonDisclosure Agreement in place with your organization.
© Copyright 2013 EMC Corporation. All rights reserved.
2
3. Agenda
Motivation
Combating latency in general
Latency reducing approach for MR
Case studies for MR with focus on low
latency
Summary
© Copyright 2013 EMC Corporation. All rights reserved.
3
4. Introduction
What this presentation is about
– Approaches that improve performance by enhancing
existing MapReduce platform
– Focus on per-job latency in wall-clock time, among other
performance metrics
– With case studies from both academia and industry
What it is not about
– Performance improvement by manipulating MapReduce
framework tuning knobs
© Copyright 2013 EMC Corporation. All rights reserved.
4
5. Low latency: Motivations
Faster decision-making
– Fraud detection, system monitoring, trending topic
identification
Interactivity
– Targeted advertising, personalized news feeds, Online
recommendations
“pay-as-you-go” services
– Economic advantage in “pay-as-you-go” billing model
© Copyright 2013 EMC Corporation. All rights reserved.
5
6. Sources of latency
Computer science is a thousand layers of abstraction
Latency is everywhere
– Hardware Infrastructure: Processors,
Memory, Storage I/O, Network I/O
– Software Infrastructure: OS Kernel,
JVM, Server software
– Architectural design and system
implementation
– Communication protocol: DNS, TCP
© Copyright 2013 EMC Corporation. All rights reserved.
— One colleague of Shimon Schocken
I see latency.... They're everywhere.
6
7. Combating latency by extrapolation
Approach to minimize latency for systems in general
– Address every latency bottleneck in system
– Minimize it's latency contribution
Apply latency minimizing approach to MapReduce
– What are the layers in MapReduce data processing stack?
– How can the latency contributions from them be mitigated?
© Copyright 2013 EMC Corporation. All rights reserved.
7
8. MapReduce Recap: logical view
Map(k1,v1) → list(k2,v2)
– User defined Map function that processes a key/value pair
to generate a set of intermediate key/value pairs
Reduce(k2, list (v2)) → list(v2)
– Reduce function merges all values associated with the
same intermediate key
Simple, yet expressive
– Real world applications: Word Count, Distributed Grep,
Count of URL Access Frequency , Inverted Index, etc
© Copyright 2013 EMC Corporation. All rights reserved.
8
10. MapReduce Recap: system view
Embarrassingly parallel
– Partitioned parallelism in both Map and Reduce phases
Distributed and scalable
– Computations distributed across large cluster of
commodity machines
– Master schedules tasks to workers
Fault tolerant
– Reschedule task in case of failure
– Materialize task output to disk
Performance Optimized
– Combiner function
– Locality-aware scheduling
– Redundant execution
© Copyright 2013 EMC Corporation. All rights reserved.
10
11. MR latency mitigation: a systematic way
Latency improvement opportunities in aspects from
the whole MR processing stack
– Architectural design
▪ HOP
– Programming model
▪ S4
– Resource scheduling
▪ Delay scheduling
– Dataflow: processing and
transmission
▪ Spark, Tenzing, Bolt MR, etc
– Data persistence
▪ Stinger
© Copyright 2013 EMC Corporation. All rights reserved.
11
12. Trade-offs
Every good quality is noxious if unmixed
— Ralph Waldo Emerson
Latency, sometimes at odds with throughput
– Speculative execution
▪ Backup executions of “straggler” tasks decrease per-job
latency at the expense of cluster throughput
Trade-off between latency and fault tolerance
– Naïve pipelining
▪ Direct output transmission from Mapper to Reducer
alleviates latency bottleneck, but hurts fault tolerance
Need to preserve other critical system
characteristics
– Throughput, fault tolerance, scalability…
© Copyright 2013 EMC Corporation. All rights reserved.
12
13. Case Studies
Approach to mitigate latency,
from HOP, Tenzing, S4, Spark,
Stinger and LUMOS
© Copyright 2013 EMC Corporation. All rights reserved.
13
14. HOP: Hadoop Online Prototype
A pipelining version of Hadoop from UC Berkeley
– “MapReduce Online”, NSDI'10 paper
– Open sourced with Apache License 2.0
In HOP’s modified MapReduce architecture,
intermediate data is pipelined between operators
HOP preserves the programming interfaces and fault
tolerance models of previous MapReduce
frameworks
© Copyright 2013 EMC Corporation. All rights reserved.
14
15. Stock Hadoop: a blocking architecture
Intermediate data produced
by each Mapper is pulled by
Reducer in its entirety
– Simplified fault tolerance
▪ Data output are materialized
before consumption
– Underutilized resource:
▪ Completely decoupled execution
between Mapper and Reducer
© Copyright 2013 EMC Corporation. All rights reserved.
15
16. HOP: from blocking to pipelining
HOP offers a modified MapReduce architecture that
allows data to be pipelined between operators
– Improved system utilization and reduced completion times
with increased parallelism
– Extends programming model beyond batch processing
▪ Online aggregation
— Allows users to see “early returns” from a job as it is being computed
▪ Continuous queries
— Enable applications such as event monitoring and stream processing
© Copyright 2013 EMC Corporation. All rights reserved.
16
17. Latency decreasing in HOP
Challenge: Latency backfire
– Increased job response time resulting from eager pipelining
▪ Eager pipelining prevents use of “combiner” optimization
▪ Reducer may be overloaded by shifted sorting work from Mappers
Solution: Adaptive load moving
1. Buffer the output, with a threshold size in Mapper
2. On filled buffer, apply combiner function, sort and spill
output to disk
3. Spill files are pipelined to reduce tasks adaptively
▪
Accumulated spill files may be further merged
© Copyright 2013 EMC Corporation. All rights reserved.
17
18. Preserving fault tolerance in HOP
Challenges:
– Reducer failure
▪ Make fault tolerance difficult in purely pipelined architecture
– Mapper failure
▪ Limit on the reducer’s ability to merge spill files
Solution:
– Materialization
▪ The intermediate data are materialized, retaining fault tolerance in Hadoop
– Checkpointing
▪ The reached offset in Mapper input split is bookkept
▪ Only Mapper output produced before the offset is merged by Reducer
© Copyright 2013 EMC Corporation. All rights reserved.
18
19. Performance evaluation from HOP
Some initial performance results disclose that pipelining can
reduce job completion times by up to 25% in some scenarios
– Word-count on 10GB input data, 20 map tasks and 20 reduce tasks
– CDF of Map and Reduce task completion times for Blocking and
Pipelining, respectively
– Pipelining reduces total job runtimes by 19.7%
© Copyright 2013 EMC Corporation. All rights reserved.
19
20. Tenzing: Hive the Google way
SQL query engine on top of MapReduce for ad hoc
data analysis from Google
– “Tenzing A SQL Implementation On The MapReduce
Framework”, VLDB'11 paper
– Featured by:
▪
▪
▪
▪
Strong SQL support
Low latency, comparable with parallel databases
Highly scalable and reliable, atop MapReduce
Support heterogeneous backend storage
© Copyright 2013 EMC Corporation. All rights reserved.
20
21. Low latency approaches in Tenzing
MR execution enhancement
– Process pool
▪ Master pool
▪ Worker pool
– Streaming and In-memory Chaining
– Sort Avoidance for certain hash based operators
▪ Block Shuffle
– Local Execution
SQL Query enhancement
– Metadata-aware query plan optimization
– Projection and Filtering, Aggregation, Joins, etc
Experimental Query Engine optimization
– LLVM query engine
© Copyright 2013 EMC Corporation. All rights reserved.
21
22. Tenzing performance
“Using this approach, we were able to bring down
the latency of the execution of a Tenzing query itself
to around 7seconds.”
“There are other bottlenecks in the system however,
such as computation of map splits, updating the
metadata service, …, etc. which means the typical
latency varies between 10 and 20 seconds currently.”
© Copyright 2013 EMC Corporation. All rights reserved.
22
23. S4: Simple Scalable Streaming System
A research project for stream processing in Yahoo!
– Open sourced in Sep, 2009 and entered Apache Incubation
Oct 2011
– A general-purpose stream processing engine
▪ With a simple programming interface
▪ Distributed and scalable
▪ Partially fault-tolerant
– Design for use cases different from batch model processing
▪ Infinite data stream
▪ Stream of events that flow into the system at variety data rate
▪ Real-time processing with low latency expected
© Copyright 2013 EMC Corporation. All rights reserved.
23
24. S4 overview
Data abstraction
– Data are streams of key-value,
dispatched and processed by
Processing Elements
Design inspired by
– Actors model
– MapReduce model
▪ key-value based data dispatching
© Copyright 2013 EMC Corporation. All rights reserved.
TopK, stream processing
24
25. Low latency design in S4
Simple programming paradigm that operates on
data streams in real-time
Minimize latency by using local memory to avoid
disk I/O bottlenecks
– Lossy failover: Partially fault tolerant
Pluggable architecture to select network protocol for
data communication
– Communication layer allows data be sent without a
guarantee in trade for performance
© Copyright 2013 EMC Corporation. All rights reserved.
25
26. Spark
Research project at UC Berkeley on big data analytics
– “Spark: Cluster Computing with Working Sets”, HotCloud'10
A parallel cluster computing framework
– Supports applications with working sets
▪ Iterative algorithm
▪ Interactive data analysis
– Retaining the scalability and fault tolerance of MapReduce
Allow interactive large data analyzing on clusters
efficiently, with a general purpose programming language
© Copyright 2013 EMC Corporation. All rights reserved.
26
27. Latency decreasing in Spark
In Spark, data can be cached in memory
explicitly
– The core data abstraction for Spark is RDD, the readonly, partitioned collection of objects
Keeping working set of data in memory can
improve performance by an order of magnitude
– Outperform Hadoop by 20 for iterative jobs
– Can be used interactively to search a 1 TB dataset
with latencies of 5–7 seconds
© Copyright 2013 EMC Corporation. All rights reserved.
27
28. Fault tolerance in Spark
Lineage
– Lost partitions are recovered by
‘replaying’ the series of
transformations used to build the
RDD
Checkpointing
– To avoid time-consuming recovery,
checkpoint to stable storage will be
helpful to applications with
▪ Long lineage graph
▪ Lineage composed of
wide dependencies
© Copyright 2013 EMC Corporation. All rights reserved.
28
29. Stinger Initiative
Enhance Hive with more SQL and improved
performance to allow human-time use cases
– Announced in Feb 2013, led by Hortonworks
– Effort from community collaboration, with resources from
SAP, Microsoft, Facebook and Hortonworks
© Copyright 2013 EMC Corporation. All rights reserved.
29
30. Making Apache Hive 100 Times Faster
Stinger’s improvements on HIVE
– More SQL
▪ Analytics features, standard SQ aligning, etc
– Optimized query execution plans
▪ 45X performance increase for Hive in some early results
– Support of new columnar file format
▪ ORCFile, more efficiency and high performance
– New runtime framework, Apache Tez
© Copyright 2013 EMC Corporation. All rights reserved.
30
31. Accelerating data processing by Tez
In traditional MapReduce, one SQL query
often results in multiple jobs, which
eventually impacts performance
– Latency introduced from launching of jobs
– Extra overhead in materializing intermediate
job outputs to the file system
Performance improvements from Tez
– With a generalized computing paradigm for
DAG execution, Tez can express any SQL as one
single job
– Tez AM, running atop YARN, supports container
reuse
© Copyright 2013 EMC Corporation. All rights reserved.
31
32. LUMOS Project
A real-time, interactive, self-service
data cloud platform for big data
analytics, from EMC Labs China
LUMOS – guide the data scientists to
the big value of big data
Goal: Develop key building blocks for
the big data cloud platform
© Copyright 2013 EMC Corporation. All rights reserved.
32
33. Design principles
Real-time analytics
– Low latency MapReduce data processing
Interactive analytics
– SQL query interface and visualization
Deep analytics
– Advanced and complex statistical and data mining
– Predictive analytics
Self-service analytics
– Analytics as a service
© Copyright 2013 EMC Corporation. All rights reserved.
33
34. Building Blocks in LUMOS
Data Process
– BoltMR: Flexible and High Performance MapReduce
execution engine
Data Access
– SQL2MR: Declarative query interface and optimizer for
MapReduce
Data Service
– DMaaS: Data mining analytics service and tools
© Copyright 2013 EMC Corporation. All rights reserved.
34
35. Bolt MR
A flexible, low-latency and high
performance MapReduce
implementation
– Improve the overall performance
– Reduce latency
– Supporting for alternative work load
types
▪ Iterative
▪ Incremental
▪ Online Aggregation and Continuous Query
Flickr credit: http://www.flickr.com/photos/blahflowers/4656725185/
© Copyright 2013 EMC Corporation. All rights reserved.
35
36. Bolt MR – latency enhancement
Batch mode MapReduce
with enhancement on
Hadoop:
– Enhanced task resource
allocation
– Master/Worker Pool
– Flexible data
processing/transmission
options
© Copyright 2013 EMC Corporation. All rights reserved.
36
37. Bolt MR – Performance evaluation
Job Execution time (s)
• On Container Reuse and Worker
Pool
• Lower latency is observed in all the
conducted micro-benchmarks
• For the jobs with small input, substantial
improvement ratio is observed
Reuse + Pool
Worker Pool
Container Reuse
Normal
32
Job3
63
209
242
4000
3500
3500
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121
TaskInitializingTime
© Copyright 2013 EMC Corporation. All rights reserved.
TaskProcessingTime
0
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118
1
TaskInitializingTime
TaskProcessingTime
37
38. SQL2MR
Problems
– Poor programmability and metadata mgmt of MapReduce
▪ MapReduce application is hard to program
▪ Need to publish data in well-known schemas
– Poor performance of existing MapReduce query translation
systems (e.g., Hive, Pig)
▪ Inefficiency (latency on the order of minutes) of sub-optimal MR jobs due
to limited query optimization ability
▪ Poor SQL compatibility and limit language expression power
Our solution
– An extensible and powerful SQL-like query language for
complex analytics
– Cost-based query execution plan optimization for MR
© Copyright 2013 EMC Corporation. All rights reserved.
38
39. Query Optimization for MapReduceBased Big Data Analytics
Enumerate the alternative
physical plans (i.e., MR jobs)
for the input query
SQL Queries
Plan Space
Exploration
SQL
query
SQL Query
Processor
A novel cost-based optimization framework
that
Learns from the wide spectrum of DB
query optimization (>40 years!!)
Exploits usage & design properties of
MapReduce frameworks
J
1
J
J
2
J
3
4
Query
Parsing
Estimate the execution costs
of physical plans and select
the cheapest one
Cost
Estimation
Optimal
MR jobs
Schema Info & Statistics
Maintenance
Store and derive the logical and
physical properties of both input
and intermediate data
Efficient MapReduce jobs non-invasively
running at existing and future Hadoop
stacks
© Copyright 2013 EMC Corporation. All rights reserved.
39
40. Optimizations from other
research/engineering efforts
Delay Scheduling
– A scheduler that takes into account both fairness and data locality
Longest Approximate Time to End, LATE
– Speculatively execute task based on finish time estimation
– Launch speculative task on a fast node
Direct I/O
– Read data from local disk if applicable, avoiding inter-process communication costs
from HDFS
Low level optimizations
– OS level: Efficient data transfer with sendfile system call
– Instruction level: Increased HDFS read/write efficiency via CRC32 support from
SSE4.2 instruction extensions in Intel Nehalem processor
© Copyright 2013 EMC Corporation. All rights reserved.
40
41. Quick summary
• Latency improvement - optimization cross all layers in MapReduce system
– Query engine
– SQL query optimization (Tenzing, Stinger, SQL2MR)
– Code generation (Tenzing)
– Architectural design
– Pipelining (HOP)
– Programming model
– Streaming (S4)
– Resource scheduling
– Scheduling algorithm optimization (Delay Scheduling, LATE)
– Data processing and transmission
– In-Memory (S4, Spark), Process Pool (Tenzing, Bolt MR), Sort Avoidance
(Tenzing), more efficient system call, etc
– Data persistence
– Columnar storage (Stinger), Direct I/O
© Copyright 2013 EMC Corporation. All rights reserved.
41
42. 3 Ways to Cope with Latency Lags
Bandwidth
“3 Ways to Cope with Latency Lags Bandwidth”, from David
Patterson
– Caching
▪ Processor caches, file cache, disk cache
– Replication
▪ Multiple requests to multiple copies and
just use the quickest reply
– Prediction
▪ Branches + Prefetching
Corresponding latency decreasing approach in MapReduce
– In-memory cache in Spark
– Speculative execution in MapReduce
– Pipelining in HOP
© Copyright 2013 EMC Corporation. All rights reserved.
42
43. Are We There Yet?
Identifying performance bottlenecks, is
an iterative process
– Performance impact mitigation on one
bottleneck can be followed by the
discovery of the next one
– “These 3 already fully deployed, so must
find next set of tricks to cope; hard!”
- David Patterson
© Copyright 2013 EMC Corporation. All rights reserved.
43