SlideShare una empresa de Scribd logo
1 de 5
Descargar para leer sin conexión
1 Mochi:        Visual Log-Analysis Based Tools for Debugging Hadoop
                      Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan
                                     Electrical & Computer Engineering Department
                                 Carnegie Mellon University, Pittsburgh, PA 15213-3890
                                {jiaqit,xinghaop,spertet,rgandhi,priyan}@andrew.cmu.edu

Abstract                                                                   had multiple categories) in Table 1. These posts fo-
                                                                           cused on MR-specific aspects of Hadoop program be-
 Mochi, a new visual, log-analysis based debugging tool                    havior. The primary response to these posts involved
correlates Hadoop’s behavior in space, time and vol-                       suggestions to use Java profilers, which do not capture
ume, and extracts a causal, unified control- and data-                      dynamic MR-specific behavior, such as relationships in
flow model of Hadoop across the nodes of a cluster.                         time (e.g., orders of execution), space (which tasks ran
Mochi’s analysis produces visualizations of Hadoop’s                       on which nodes), and the volumes of data in various pro-
behavior using which users can reason about and debug                      gram stages. This motivated us to extract and analyze
performance issues. We provide examples of Mochi’s                         time-, space- and volume-related Hadoop behavior.
value in revealing a Hadoop job’s structure, in optimiz-                      The MR framework affects program performance at
ing real-world workloads, and in identifying anomalous                     the macro-scale through task scheduling and data distri-
Hadoop behavior, on the Yahoo! M45 Hadoop cluster.                         bution. This macro behavior is hard to infer from low-
1    Introduction                                                          level language views because of the glut of detail, and
                                                                           because this behavior results from the framework out-
MapReduce (MR) [7] is a programming paradigm and                           side of user code. For effective debugging, tools must
framework introduced by Google for data-intensive                          expose MR-specific abstractions. This motivated us to
cloud computing on commodity clusters. Hadoop [9],                         capture Hadoop distributed data- and execution-related
an open-source Java implementation of MapReduce, is                        behavior that impacts MR performance. Finally, given
used by Yahoo! and Facebook, and is available on Ama-                      the scale (number of nodes, tasks, interactions, dura-
zon’s pay-as-you-use EC2 cloud computing infrastruc-                       tions) of Hadoop programs, there is also a need to visual-
ture. Debugging the performance of Hadoop programs                         ize a program’s distributed execution to support debug-
is difficult because of their scale and distributed na-                     ging and to make it easier for users to detect deviations
ture. Hadoop can be debugged by examining the local                        from expected program behavior/performance. To the
(node-specific) logs of its execution. These logs can                       best of our knowledge, Mochi is the first debugging tool
be large, and must be manually stitched across nodes                       for Hadoop to extract (from Hadoop’s own logs) both
to debug system-wide problems. Current Java debug-                         control- and data-flow views, and to then analyze and vi-
ging/profiling tools (jstack, hprof) target program-                        sualize these views in a distributed, causal manner. We
ming abstractions to help debug local code-level errors                    provide concrete examples where Mochi has assisted us
rather than distributed problems across multiple nodes                     and other Hadoop users in understanding Hadoop’s be-
[16]. In addition, these tools do not provide insights at                  havior and unearthing problems.
the higher level of abstraction (e.g. Maps and Reduces)
that is more natural to MR users and programmers. Sim-                     2    Problem Statement
ilarly, path-tracing tools [8] for distributed systems pro-
                                                                           Our previously developed log-analysis tool, SALSA
duce fine-grained views at the language rather than at the
                                                                           [18], extracted various statistics (e.g., durations of Map
MR abstraction.
                                                                           and Reduce tasks) of system behavior from Hadoop’s
   Our survey of the Hadoop users’ mailing-list indi-
                                                                           logs on individual nodes. Mochi aims to go beyond
cates that the most frequent performance-related ques-
                                                                           SALSA, to (i) correlate Hadoop’s behavior in space,
tions are indeed at the level of MR abstractions. We
                                                                           time and volume, and (ii) extract causal, end-to-end, dis-
examined the 3400 posts on this mailing list over a 6-
                                                                           tributed Hadoop behavior that factors in both computa-
month period (10/2008 to 4/2009), and classified the
                                                                           tion and data across the nodes of the cluster. From our
30-odd explicit performance-related posts2 (some posts
                                                                           interactions with real Hadoop users (of the Yahoo! M45
    1 This research was partially funded by the Defence Science &          [11] cluster), a third need has emerged: to provide help-
Technology Agency, Singapore, via the DSTA Overseas Scholarship,           ful visualizations of Hadoop’s behavior so that users can
and sponsored in part by the National Science Foundation, via CA-          reason about and debug performance issues themselves.
REER grant CCR-0238381 and grant CNS-0326453.
    2 As expected of mailing-lists, most of the 3400 posts were from       Goals. Mochi’s goals are:
users learning about and configuring Hadoop (note that misconfigu-
rations can also lead to performance problems). We filtered out, and        focused on, the 30-odd posts that explicitly raised a performance issue.


                                                                       1
Category           Question                                                                                                                   Fraction
    Configuration       How many Maps/Reduces are efficient? Did I set a wrong number of Reduces?                                                   50%
    Data behavior      My Maps have lots of output, are they beating up nodes in the shuffle?                                                      30%
    Runtime behavior   Must all mappers complete before reducers can run? What is the performance impact of setting X?                            50%
                       What are the execution times of program parts?
                                       Table 1: Common queries on users’ mailing list

To expose MapReduce-specific behavior that results
                                                                                                   Detailed Swimlanes: Sort Workload (4 nodes)
from the MR framework’s automatic execution, that




                                                                                             200
affects program performance but is neither visible
nor exposed to user Map/Reduce code, e.g. when




                                                                                             150
Maps/Reduces are executed and on which nodes,
from/to where data inputs/outputs flow and from which




                                                                                  Per-task

                                                                                             100
Maps/Reduces. Existing Java profilers do not capture
such information.




                                                                                             50
                                                                                                                              MapTask
To expose aggregate and dynamic behavior that can                                                                             ReduceCopyReceive
                                                                                                                              ReduceMergeCopy
provide different insights. For instance, Hadoop sys-                                                                         ReduceTask




                                                                                             0
tem views in time can be instantaneous or aggregated                                                 0      200     400       600     800

across an entire job; views in space can be of individual                                                            Time/s

Maps/Reduces or aggregated at nodes.
                                                                        Figure 2: Swimlanes: detailed states: Sort workload
To be transparent so that Mochi does not require any
modifications to Hadoop, or to the way that Hadoop
users write/compile/load their programs today. This also              Mochi then correlates these views across nodes, and
makes Mochi amenable to deployment in production                      also between HDFS and the execution layer, to con-
Hadoop clusters, as is our objective.                                 struct a unique end-to-end representation that we call a
                                                                      Job-Centric Data-flow (JCDF), which is a distributed,
 Non-goals. Our focus is on exposing MR-specific as-
                                                                      causal, conjoined control- and data-flow.
pects of programs rather than behavior within each Map                   Mochi parses Hadoop’s logs3 based on SALSA’s [18]
or Reduce. Thus, the execution specifics or correctness                state-machine abstraction of Hadoop’s execution. In its
of code within a Map/Reduce is outside our scope. Also,               log analysis, Mochi extracts (i) a time-stamped, cross-
Mochi does not discover the root-cause of performance                 node, control-flow model by seeking string-tokens that
problems, but aids in the process through useful visual-              identify TaskTracker-log messages signaling the starts
izations and analysis that Hadoop users can exploit.                  and ends of activities (e.g., Map, Reduce), and (ii) a
3     Mochi’s Approach                                                time-stamped, cross-node, data-flow model by seeking
                                                                      string-tokens that identify DataNode-log messages sig-
MapReduce programs, or jobs, consist of Map tasks fol-                naling the movement/access of data blocks, and by cor-
lowed by Reduce tasks; multiple identical but distinct in-            relating these accesses with Maps and Reduces running
stances of tasks operate on distinct data segments in par-            at the same time. Mochi assumes that clocks are syn-
allel across nodes in a cluster. The framework has a sin-             chronized across nodes using NTP, as is common in pro-
gle master node (running the NameNode and JobTracker                  duction Hadoop clusters.
daemons) that schedules Maps and Reduces on multiple                     Mochi then correlates the execution of the TaskTrack-
slave nodes. The framework also manages the inputs                    ers and DataNodes in time (e.g. co-occurring Maps and
and outputs of Maps, Reduces, and Shuffles (moving                     block reads in HDFS) to identify when data was read
of Map outputs to Reduces). Hadoop provides a dis-                    from or written to HDFS. This completes the causal
tributed filesystem (HDFS) that implements the Google                  path of the data being read from HDFS, processed in
Filesystem [10]. Each slave node runs a TaskTracker                   the Hadoop framework, and written to HDFS, creating
(execution) and a DataNode (HDFS) daemon. Hadoop                      a JCDF, which is a directed graph with vertices repre-
programs read and write data from HDFS. Each Hadoop                   senting processing stages and data items, and edges an-
node generates logs that record the local execution of                notated with durations and volumes (Figure 1). Finally
tasks and HDFS data accesses.                                         we extract all Realized Execution Paths (REPs) from the
3.1    Mochi’s Log Analysis                                           JCDF graph–unique paths from a parent node to a leaf
                                                                      node– using a depth-first search. Each REP is a distinct
Mochi constructs cluster-wide views of the execution
                                                                      end-to-end, causal flow in the system.
of MR programs from Hadoop-generated system logs.
Mochi builds on our log-analysis capabilities to ex-                     3 Mochi uses SALSA to parse TaskTracker and DataNode logs. We

tract local (node-centric) Hadoop execution views [18].               can extend this to parse NameNode and JobTracker logs as well.


                                                                  2
Reduce Input Size/
                 Read Size/                      Mapout Size/                 Shuffle Size/               Wait Time                                       Reduce Input Size/                           Write Size/
                 Read Time                        Map Time                    Shuffle Time               for Shuffles                  ReduceCopy-          Reduce Time                                Write Time
 Data                                 Map                            Data                      Data                                                                                   Reduce                         Data
                                                                                                                                         Receive



                   Figure 1: Single instance of a Realized Execution Path (REP) showing vertices and edge annotations

                                     Map inputs, outputs                                                                                                        Sort (4 nodes)
                                                                                                                                  0.000000 MB      2433.000000 MB    4866.000000 MB       7299.000000 MB
                 Input
                  Map




                                                                                    1.1e+10
                                                                                    1.0e+10
                                                                                    9.0e+09
                                                                                    8.0e+09
                                                                                    7.0e+09
                 Output




                                                                                    6.0e+09
                  Map




                                                                                                                               16853 paths




                                                                                                                           5
                                                                                    5.0e+09

                             node4      node3     node1     node2

                                                                                                                               62116 paths




                                                                                                                           4
                                            Shuffles
        (Towards Reducers)




                                                                                                                Clusters
                                                                      node3         1.4e+09                                                                                                     MapReadBlockVol
            Dest Host




                                                                                    1.3e+09                                     5102 paths                                                      MapVol




                                                                                                                           3
                                                                      node4         1.2e+09
                                                                                    1.1e+09                                                                                                     ShuffleVol
                                                                      node2         1.0e+09                                                                                                     ReduceInputVol
                                                                      node1         9.0e+08                                                                                                     ReduceOutputVol
                                                                                                                                                                                                ReduceWriteBlockVol
                                                                                                                               12532 paths




                                                                                                                           2
                             node4      node3      node1     node2                                                                                                                              MapReadTime
                                     Src Host (From Mappers)                                                                                                                                    MapTime
                                                                                                                                                                                                ShuffleTime
                                 Reduce inputs, outputs                                                                                                                                         ReduceCopyReceiveTime
                                                                                                                               63936 paths                                                      ReduceTime




                                                                                                                           1
                 Reduce
                  Input




                                                                                                                                                                                                ReduceWriteTime
                                                                                    1.15e+10
                                                                                    1.10e+10
                                                                                    1.05e+10                                             0   100   200   300   400   500   600      700   800    900    1000 1100 1200
                                                                                    1.00e+10
                 Reduce




                                                                                                                                                                 Duration/seconds
                 Output




                                                                                    9.50e+09
                                                                                    9.00e+09

                             node1      node2     node4     node3                                                                    Figure 4: REP plot for Sort workload

 Figure 3: MIROS: Sort workload; (volumes in bytes)                                                         duces on each node, and between Maps and Reduces on
                                                                                                            nodes. These volumes are aggregated over the program’s
   Thus, Mochi automatically generates, and then cor-                                                       run and over nodes. MIROS is useful in highlighting
relates, the cross-node data- and control-flow models of                                                     skewed data flows that can result in bottlenecks.
Hadoop’s behavior, resulting in a unified, causal, cluster-                                                  REP: Volume-duration correlations. For each REP
wide execution+data-flow model.                                                                              flow, we show the time taken for a causal flow, and the
3.2     Mochi’s Visualization                                                                               volume of inputs and outputs, along that flow (Figure 4).
                                                                                                            Each REP is broken down into time spent and volume
Mochi’s distributed data- and control-flows capture MR                                                       processed in each state. We use the k-means clustering
programs in three dimensions: space (nodes), time (du-                                                      algorithm to group similar paths for scalable visualiza-
rations, times, sequences of execution), and volume (of                                                     tion. For each group, the top bar shows volumes, and the
data processed). We use Mochi’s analysis to drive visu-                                                     bottom bar durations. This visualization is useful in (i)
alizations that combine these dimensions at various ag-                                                     checking that states that process larger volumes should
gregation levels. In this section, we describe the form                                                     take longer, and (ii) in tracing problems back to any pre-
of these visualizations, without discussing actual exper-                                                   vious stage or data.
imental data or drawing any conclusions from the visu-
alizations (although the visualizations are based on real                                                   4              Examples of Mochi’s Value
experimental data). We describe the actual workloads                                                        We demonstrate the use of Mochi’s visualizations (using
and case studies in §4.                                                                                     mainly Swimlanes due to space constraints). All data
“Swimlanes”: Task progress in time and space. In                                                            is derived from log traces from the Yahoo! M45 [11]
such a visualization, the x-axis represents wall-clock                                                      production cluster. The examples in § 4.1, § 4.2 involve
time, and each horizontal line corresponds to an exe-                                                       4-slave and 49-slave clusters, and the example in § 4.3
cution state (e.g., Map, Reduce) running in the marked                                                      is from a 24-slave cluster.
time interval. Figure 2 shows a sample detailed view
with all states across all nodes. Figure 6 shows a sample                                                   4.1                Understanding Hadoop Job Structure
summarized view (Maps and Reduces only) collapsed                                                           Figure 6 shows the Swimlanes plots from the Sort
across nodes, while Figure 7 shows summarized views                                                         and RandomWriter benchmark workloads (part of the
with tasks grouped by nodes. Figure 5 shows Swimlanes                                                       Hadoop distribution), respectively.     RandomWriter
for a 49-slave node cluster. Swimlanes are useful in cap-                                                   writes random key/value pairs to HDFS and has only
turing dynamic Hadoop execution, showing where the                                                          Maps, while Sort reads key/value pairs in Maps, and
job and nodes spend their time.                                                                             aggregates, sorts, and outputs them in Reduces. From
“MIROS” plots: Data-flows in space. MIROS (Map                                                               these visualizations, we see that RandomWriter has
Inputs, Reduce Outputs, Shuffles, Figure 3) visualiza-                                                       only Maps, while the Reduces in Sort take significantly
tions show data volumes into all Maps and out of all Re-                                                    longer than the Maps, showing most of the work occurs


                                                                                                       3
Task durations (Matrix-Vector Multiplication): Per-node                                                       RandomWriter Workload (4 nodes)




          200




                                                                                                           40
                     JT_Map                                                                                            JT_Map
                     JT_Reduce



          150




                                                                                                           30
      Per-task




                                                                                                       Per-task
       100




                                                                                                         20
          50




                                                                                                           10
          0




                 0                        50                    100                     150




                                                                                                           0
                                                     Time/s
                       Task durations (Matrix-Vector Multiplication: 49 hosts): All nodes                          0                100                  200             300         400
                                                                                                                                                          Time/s
          200




                     JT_Map
                     JT_Reduce
                                                                                                                                              Sort Workload (4 nodes)
                                                                                                                       JT_Map
                                                                                                                       JT_Reduce
          150




                                                                                                           150
      Per-task
        100




                                                                                                             100
                                                                                                       Per-task
          50




                                                                                                           50
          0




                 0                        50                    100                     150
                                                     Time/s




                                                                                                           0
Figure 5: Swimlanes plot for 49-node job for the Matrix-                                                           0               200             400
                                                                                                                                                          Time/s
                                                                                                                                                                   600         800


Vector Multiplication; top plot: tasks sorted by node;
                                                                                                  Figure 6: Summarized Swimlanes plot for Ran-
bottom plot: tasks sorted by time.
                                                                                                  domWriter (top) and Sort (bottom)
in the Reduces. The REP plot in Figure 4 shows that a
                        2                                                                         tem views from existing instrumentation (Hadoop sys-
significant fraction (≈ 3 ) of the time along the critical
paths (Cluster 5) is spent waiting for Map outputs to be                                          tem logs) to build views at a higher level of abstrac-
shuffled to the Reduces, suggesting this is a bottleneck.                                          tion for MR. Other uses of distributed execution trac-
                                                                                                  ing [2, 5, 13] for diagnosis operate at the language level,
4.2     Finding Opportunities for Optimization
                                                                                                  rather than at a higher-level of abstraction (e.g. MR)
Figure 7 shows the Swimlanes from the Matrix-Vector                                               since they are designed for general systems.
Multiplication job of the HADI [12] graph-mining ap-
                                                                                                  Diagnosis for MR. [14] collected trace events
plication for Hadoop. This workload contains two MR
                                                                                                  in Hadoop’s execution generated by custom
programs, as seen from the two batches of Maps and
                                                                                                  instrumentation– these are akin to language-level
Reduces. Before optimization, the second node and first
                                                                                                  views; their abstractions do not account for the volume
node do not run any Reduce in the first and second jobs
                                                                                                  dimension which we provide (§3.2), and they do not
respectively. The number of Reduces was then increased
                                                                                                  correlate data with the MR level of abstraction. [19]
to twice the number of slave nodes, after which every
                                                                                                  only showed how outlier events can be identified
node ran two Reduces (the maximum concurrent permit-
                                                                                                  in DataNode logs; we utilize information from the
ted), and the job completed 13.5% faster.
                                                                                                  TaskTracker logs as well, and we build a complete
4.3     Debugging: Delayed Java Socket Creation                                                   abstraction of all execution events. [17] diagnosed
We ran a no-op (“Sleep”) Hadoop job, with 2400 idle                                               anomalous nodes in MR clusters by identifying nodes
Maps and Reduces which sleep for 100ms, to character-                                             with OS-level performance counters that deviated
ize idle Hadoop behavior, and found tasks with unusu-                                             from other nodes. [18] demonstrated how to extract
ally long durations. On inspection of the Swimlanes, we                                           state-machine views of Haodop’s execution from its
found delayed tasks ran for 3 minutes (Figure 8). We                                              logs; we have expanded on the per-node state-machine
traced this problem to a delayed socket call in Hadoop,                                           views in [18] by building the JCDF and extracting REPs
and found a fix described at [1]. We resolved this issue                                           (§3.1) to generate novel system views.
by forcing Java to use IPv4 through a JVM option, and                                             Visualization tools. Artemis [6] provides a pluggable
Sleep ran in 270, instead of 520, seconds.                                                        framework for distributed log collection, data analysis,
                                                                                                  and visualization. We have presented specific MR ab-
5     Related Work                                                                                stractions and ways to build them, and our techniques
Distributed tracing and failure diagnosis. Recent                                                 can be implemented as Artemis plugins. The “machine
tools for tracing distributed program execution have fo-                                          usage data” plots in [6] resemble Swimlanes; REP shows
cused on building instrumentation to trace causal paths                                           both data and computational dependencies, while the
[3], infer causality across components [15] and networks                                          critical path analysis in [6] considers only computation.
[8]. They produce fine-grained views at the language but                                           [4], visualized web server access patterns and the out-
not MR level of abstraction. Our work correlates sys-                                             put of anomaly detection algorithms, while we showed


                                                                                              4
Matrix-Vector Multiplication Workload, (4 nodes)                                                      Sleep Workload with socket error (24 nodes)




                                                                                                              2500
                     JT_Map                                                                                                JT_Map
                     JT_Reduce                                                                                             JT_Reduce



         60




                                                                                                              2000
         50




                                                                                                                1500
            40
     Per-task




                                                                                                         Per-task
     30




                                                                                                       1000
         20




                                                                                                              500
         10
         0




                                                                                                              0
                 0                    200               400             600           800                              0               100         200           300         400        500
                                                        Time/s                                                                                             Time/s
                                  Matrix-Vector Multiplication Workload, (4 nodes)                                                     Sleep Workload without socket error (24 nodes)




                                                                                                              2500
                     JT_Map                                                                                                JT_Map
                     JT_Reduce                                                                                             JT_Reduce




                                                                                                              2000
         60




                                                                                                                1500
     Per-task




                                                                                                         Per-task
         40




                                                                                                       1000
         20




                                                                                                              500
         0




                                                                                                              0
                 0          100         200       300         400     500       600    700                             0               50           100             150       200         250
                                                        Time/s                                                                                             Time/s


Figure 7: Matrix-vector Multiplication before optimiza-                                          Figure 8: SleepJob with delayed socket creation (above),
tion (above), and after optimization (below)                                                     and without (below)

system execution patterns.                                                                        [5] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pin-
                                                                                                      point: Problem Determination in Large, Dynamic Internet Ser-
6    Conclusion and Future Work                                                                       vices. In DSN, Jun 2002.
                                                                                                  [6] G. Cretu-Ciocarlie, M. Budiu, and M. Goldszmidt. Hunting for
Mochi extracts and visualizes information about MR                                                    problems with artemis. In Workshop on Analysis of System Logs,
programs at the MR-level abstraction, based on                                                        Dec 2008.
Hadoop’s system logs. We show how Mochi’s analysis                                                [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Pro-
                                                                                                      cessing on Large Clusters. In OSDI, Dec 2004.
produces a distributed, causal, control+data-flow model
                                                                                                  [8] R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica. X-
of Hadoop’s behavior, and then show the use of the re-                                                Trace: A Pervasive Network Tracing Framework. In NSDI, Apr
sulting visualizations for understanding and debugging                                                2007.
the performance of Hadoop jobs in the Yahoo! M45 pro-                                             [9] The Apache Software Foundation.                 Hadoop, 2007.
duction cluster. We intend to implement our (currently)                                               http://hadoop.apache.org/core.
offline Mochi analysis and visualization to run online, to                                        [10] S. Ghemawat, H. Gobioff, and S. Leung. The Google Filesystem.
                                                                                                      In SOSP, Oct 2003.
evaluate the resulting performance overheads and ben-
                                                                                                 [11] Yahoo!           Inc.        Yahoo!          reaches for the
efits. We also intend to support the regression testing                                                stars    with      M45    supercomputing       project,   2007.
of Hadoop programs against new Hadoop versions, and                                                   http://research.yahoo.com/node/1884.
debugging of more problems, e.g. misconfigurations.                                               [12] U. Kang, C. Tsourakakis, A.P. Appel, C. Faloutsos, and
                                                                                                      J. Leskovec. HADi: Fast Diameter Estimation and Mining in
Acknowledgements                                                                                      Massive Graphs with Hadoop. CMU ML Tech Report CMU-ML-
                                                                                                      08-117, 2008.
The authors would like to thank Christos Faloutsos and                                           [13] E. Kiciman and A. Fox. Detecting Application-level Failures
U Kang for discussions on the HADI Hadoop workload                                                    in Component-based Internet Services. IEEE Trans. on Neural
and for providing log data.                                                                           Networks, 16(5):1027– 1041, Sep 2005.
                                                                                                 [14] A. Konwinski, M. Zaharia, R. Katz, and I. Stoica. X-tracing
References                                                                                            Hadoop. Hadoop Summit, Mar 2008.
                                                                                                 [15] Eric Koskinen and John Jannotti. Borderpatrol: Isolating Events
 [1] Creating socket in java takes 3 minutes,                    2004.
                                                                                                      for Black-box Tracing. In Eurosys 2008, Apr 2008.
     http://tinyurl.com/d5p3qr.
                                                                                                 [16] Arun Murthy. Hadoop MapReduce - Tuning and Debugging,
 [2] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and
                                                                                                      2008. http://tinyurl.com/c9eau2.
     A. Muthitacharoen. Performance Debugging for Distributed Sys-
                                                                                                 [17] X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Gane-
     tem of Black Boxes. In SOSP, Oct 2003.
                                                                                                      sha: Black-Box Diagnosis of MapReduce Systems. In HotMet-
 [3] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Mag-                                    rics, Seattle, WA, Jun 2009.
     pie for Request Extraction and Workload Modelling. In OSDI,
                                                                                                 [18] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan.
     Dec 2004.
                                                                                                      SALSA: Analyzing Logs as StAte Machines. In Workshop on
 [4] P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, K. Pa-                                  Analysis of System Logs, Dec 2008.
     tel, G. Tolle, J. Hui, A. Fox, M. Jordan, and D. Patterson. Com-                            [19] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Min-
     bining Visualization and Statistical Analysis to Improve Opera-                                  ing Console Logs for Large-scale System Problem Detection. In
     tor Confidence and Efficiency for Failure Detection and Local-                                     SysML, Dec 2008.
     ization In ICAC, Jun 2005.


                                                                                             5

Más contenido relacionado

La actualidad más candente

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperDavid Luan
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
 

La actualidad más candente (20)

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Hadoop
HadoopHadoop
Hadoop
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaper
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
An Intro to Hadoop
An Intro to HadoopAn Intro to Hadoop
An Intro to Hadoop
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEPERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCE
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
 

Similar a Visualizing Hadoop Jobs for Performance Debugging

Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 

Similar a Visualizing Hadoop Jobs for Performance Debugging (20)

Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
D04501036040
D04501036040D04501036040
D04501036040
 
Yarn
YarnYarn
Yarn
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 

Más de George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

Más de George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Último

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Visualizing Hadoop Jobs for Performance Debugging

  • 1. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University, Pittsburgh, PA 15213-3890 {jiaqit,xinghaop,spertet,rgandhi,priyan}@andrew.cmu.edu Abstract had multiple categories) in Table 1. These posts fo- cused on MR-specific aspects of Hadoop program be- Mochi, a new visual, log-analysis based debugging tool havior. The primary response to these posts involved correlates Hadoop’s behavior in space, time and vol- suggestions to use Java profilers, which do not capture ume, and extracts a causal, unified control- and data- dynamic MR-specific behavior, such as relationships in flow model of Hadoop across the nodes of a cluster. time (e.g., orders of execution), space (which tasks ran Mochi’s analysis produces visualizations of Hadoop’s on which nodes), and the volumes of data in various pro- behavior using which users can reason about and debug gram stages. This motivated us to extract and analyze performance issues. We provide examples of Mochi’s time-, space- and volume-related Hadoop behavior. value in revealing a Hadoop job’s structure, in optimiz- The MR framework affects program performance at ing real-world workloads, and in identifying anomalous the macro-scale through task scheduling and data distri- Hadoop behavior, on the Yahoo! M45 Hadoop cluster. bution. This macro behavior is hard to infer from low- 1 Introduction level language views because of the glut of detail, and because this behavior results from the framework out- MapReduce (MR) [7] is a programming paradigm and side of user code. For effective debugging, tools must framework introduced by Google for data-intensive expose MR-specific abstractions. This motivated us to cloud computing on commodity clusters. Hadoop [9], capture Hadoop distributed data- and execution-related an open-source Java implementation of MapReduce, is behavior that impacts MR performance. Finally, given used by Yahoo! and Facebook, and is available on Ama- the scale (number of nodes, tasks, interactions, dura- zon’s pay-as-you-use EC2 cloud computing infrastruc- tions) of Hadoop programs, there is also a need to visual- ture. Debugging the performance of Hadoop programs ize a program’s distributed execution to support debug- is difficult because of their scale and distributed na- ging and to make it easier for users to detect deviations ture. Hadoop can be debugged by examining the local from expected program behavior/performance. To the (node-specific) logs of its execution. These logs can best of our knowledge, Mochi is the first debugging tool be large, and must be manually stitched across nodes for Hadoop to extract (from Hadoop’s own logs) both to debug system-wide problems. Current Java debug- control- and data-flow views, and to then analyze and vi- ging/profiling tools (jstack, hprof) target program- sualize these views in a distributed, causal manner. We ming abstractions to help debug local code-level errors provide concrete examples where Mochi has assisted us rather than distributed problems across multiple nodes and other Hadoop users in understanding Hadoop’s be- [16]. In addition, these tools do not provide insights at havior and unearthing problems. the higher level of abstraction (e.g. Maps and Reduces) that is more natural to MR users and programmers. Sim- 2 Problem Statement ilarly, path-tracing tools [8] for distributed systems pro- Our previously developed log-analysis tool, SALSA duce fine-grained views at the language rather than at the [18], extracted various statistics (e.g., durations of Map MR abstraction. and Reduce tasks) of system behavior from Hadoop’s Our survey of the Hadoop users’ mailing-list indi- logs on individual nodes. Mochi aims to go beyond cates that the most frequent performance-related ques- SALSA, to (i) correlate Hadoop’s behavior in space, tions are indeed at the level of MR abstractions. We time and volume, and (ii) extract causal, end-to-end, dis- examined the 3400 posts on this mailing list over a 6- tributed Hadoop behavior that factors in both computa- month period (10/2008 to 4/2009), and classified the tion and data across the nodes of the cluster. From our 30-odd explicit performance-related posts2 (some posts interactions with real Hadoop users (of the Yahoo! M45 1 This research was partially funded by the Defence Science & [11] cluster), a third need has emerged: to provide help- Technology Agency, Singapore, via the DSTA Overseas Scholarship, ful visualizations of Hadoop’s behavior so that users can and sponsored in part by the National Science Foundation, via CA- reason about and debug performance issues themselves. REER grant CCR-0238381 and grant CNS-0326453. 2 As expected of mailing-lists, most of the 3400 posts were from Goals. Mochi’s goals are: users learning about and configuring Hadoop (note that misconfigu- rations can also lead to performance problems). We filtered out, and focused on, the 30-odd posts that explicitly raised a performance issue. 1
  • 2. Category Question Fraction Configuration How many Maps/Reduces are efficient? Did I set a wrong number of Reduces? 50% Data behavior My Maps have lots of output, are they beating up nodes in the shuffle? 30% Runtime behavior Must all mappers complete before reducers can run? What is the performance impact of setting X? 50% What are the execution times of program parts? Table 1: Common queries on users’ mailing list To expose MapReduce-specific behavior that results Detailed Swimlanes: Sort Workload (4 nodes) from the MR framework’s automatic execution, that 200 affects program performance but is neither visible nor exposed to user Map/Reduce code, e.g. when 150 Maps/Reduces are executed and on which nodes, from/to where data inputs/outputs flow and from which Per-task 100 Maps/Reduces. Existing Java profilers do not capture such information. 50 MapTask To expose aggregate and dynamic behavior that can ReduceCopyReceive ReduceMergeCopy provide different insights. For instance, Hadoop sys- ReduceTask 0 tem views in time can be instantaneous or aggregated 0 200 400 600 800 across an entire job; views in space can be of individual Time/s Maps/Reduces or aggregated at nodes. Figure 2: Swimlanes: detailed states: Sort workload To be transparent so that Mochi does not require any modifications to Hadoop, or to the way that Hadoop users write/compile/load their programs today. This also Mochi then correlates these views across nodes, and makes Mochi amenable to deployment in production also between HDFS and the execution layer, to con- Hadoop clusters, as is our objective. struct a unique end-to-end representation that we call a Job-Centric Data-flow (JCDF), which is a distributed, Non-goals. Our focus is on exposing MR-specific as- causal, conjoined control- and data-flow. pects of programs rather than behavior within each Map Mochi parses Hadoop’s logs3 based on SALSA’s [18] or Reduce. Thus, the execution specifics or correctness state-machine abstraction of Hadoop’s execution. In its of code within a Map/Reduce is outside our scope. Also, log analysis, Mochi extracts (i) a time-stamped, cross- Mochi does not discover the root-cause of performance node, control-flow model by seeking string-tokens that problems, but aids in the process through useful visual- identify TaskTracker-log messages signaling the starts izations and analysis that Hadoop users can exploit. and ends of activities (e.g., Map, Reduce), and (ii) a 3 Mochi’s Approach time-stamped, cross-node, data-flow model by seeking string-tokens that identify DataNode-log messages sig- MapReduce programs, or jobs, consist of Map tasks fol- naling the movement/access of data blocks, and by cor- lowed by Reduce tasks; multiple identical but distinct in- relating these accesses with Maps and Reduces running stances of tasks operate on distinct data segments in par- at the same time. Mochi assumes that clocks are syn- allel across nodes in a cluster. The framework has a sin- chronized across nodes using NTP, as is common in pro- gle master node (running the NameNode and JobTracker duction Hadoop clusters. daemons) that schedules Maps and Reduces on multiple Mochi then correlates the execution of the TaskTrack- slave nodes. The framework also manages the inputs ers and DataNodes in time (e.g. co-occurring Maps and and outputs of Maps, Reduces, and Shuffles (moving block reads in HDFS) to identify when data was read of Map outputs to Reduces). Hadoop provides a dis- from or written to HDFS. This completes the causal tributed filesystem (HDFS) that implements the Google path of the data being read from HDFS, processed in Filesystem [10]. Each slave node runs a TaskTracker the Hadoop framework, and written to HDFS, creating (execution) and a DataNode (HDFS) daemon. Hadoop a JCDF, which is a directed graph with vertices repre- programs read and write data from HDFS. Each Hadoop senting processing stages and data items, and edges an- node generates logs that record the local execution of notated with durations and volumes (Figure 1). Finally tasks and HDFS data accesses. we extract all Realized Execution Paths (REPs) from the 3.1 Mochi’s Log Analysis JCDF graph–unique paths from a parent node to a leaf node– using a depth-first search. Each REP is a distinct Mochi constructs cluster-wide views of the execution end-to-end, causal flow in the system. of MR programs from Hadoop-generated system logs. Mochi builds on our log-analysis capabilities to ex- 3 Mochi uses SALSA to parse TaskTracker and DataNode logs. We tract local (node-centric) Hadoop execution views [18]. can extend this to parse NameNode and JobTracker logs as well. 2
  • 3. Reduce Input Size/ Read Size/ Mapout Size/ Shuffle Size/ Wait Time Reduce Input Size/ Write Size/ Read Time Map Time Shuffle Time for Shuffles ReduceCopy- Reduce Time Write Time Data Map Data Data Reduce Data Receive Figure 1: Single instance of a Realized Execution Path (REP) showing vertices and edge annotations Map inputs, outputs Sort (4 nodes) 0.000000 MB 2433.000000 MB 4866.000000 MB 7299.000000 MB Input Map 1.1e+10 1.0e+10 9.0e+09 8.0e+09 7.0e+09 Output 6.0e+09 Map 16853 paths 5 5.0e+09 node4 node3 node1 node2 62116 paths 4 Shuffles (Towards Reducers) Clusters node3 1.4e+09 MapReadBlockVol Dest Host 1.3e+09 5102 paths MapVol 3 node4 1.2e+09 1.1e+09 ShuffleVol node2 1.0e+09 ReduceInputVol node1 9.0e+08 ReduceOutputVol ReduceWriteBlockVol 12532 paths 2 node4 node3 node1 node2 MapReadTime Src Host (From Mappers) MapTime ShuffleTime Reduce inputs, outputs ReduceCopyReceiveTime 63936 paths ReduceTime 1 Reduce Input ReduceWriteTime 1.15e+10 1.10e+10 1.05e+10 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1.00e+10 Reduce Duration/seconds Output 9.50e+09 9.00e+09 node1 node2 node4 node3 Figure 4: REP plot for Sort workload Figure 3: MIROS: Sort workload; (volumes in bytes) duces on each node, and between Maps and Reduces on nodes. These volumes are aggregated over the program’s Thus, Mochi automatically generates, and then cor- run and over nodes. MIROS is useful in highlighting relates, the cross-node data- and control-flow models of skewed data flows that can result in bottlenecks. Hadoop’s behavior, resulting in a unified, causal, cluster- REP: Volume-duration correlations. For each REP wide execution+data-flow model. flow, we show the time taken for a causal flow, and the 3.2 Mochi’s Visualization volume of inputs and outputs, along that flow (Figure 4). Each REP is broken down into time spent and volume Mochi’s distributed data- and control-flows capture MR processed in each state. We use the k-means clustering programs in three dimensions: space (nodes), time (du- algorithm to group similar paths for scalable visualiza- rations, times, sequences of execution), and volume (of tion. For each group, the top bar shows volumes, and the data processed). We use Mochi’s analysis to drive visu- bottom bar durations. This visualization is useful in (i) alizations that combine these dimensions at various ag- checking that states that process larger volumes should gregation levels. In this section, we describe the form take longer, and (ii) in tracing problems back to any pre- of these visualizations, without discussing actual exper- vious stage or data. imental data or drawing any conclusions from the visu- alizations (although the visualizations are based on real 4 Examples of Mochi’s Value experimental data). We describe the actual workloads We demonstrate the use of Mochi’s visualizations (using and case studies in §4. mainly Swimlanes due to space constraints). All data “Swimlanes”: Task progress in time and space. In is derived from log traces from the Yahoo! M45 [11] such a visualization, the x-axis represents wall-clock production cluster. The examples in § 4.1, § 4.2 involve time, and each horizontal line corresponds to an exe- 4-slave and 49-slave clusters, and the example in § 4.3 cution state (e.g., Map, Reduce) running in the marked is from a 24-slave cluster. time interval. Figure 2 shows a sample detailed view with all states across all nodes. Figure 6 shows a sample 4.1 Understanding Hadoop Job Structure summarized view (Maps and Reduces only) collapsed Figure 6 shows the Swimlanes plots from the Sort across nodes, while Figure 7 shows summarized views and RandomWriter benchmark workloads (part of the with tasks grouped by nodes. Figure 5 shows Swimlanes Hadoop distribution), respectively. RandomWriter for a 49-slave node cluster. Swimlanes are useful in cap- writes random key/value pairs to HDFS and has only turing dynamic Hadoop execution, showing where the Maps, while Sort reads key/value pairs in Maps, and job and nodes spend their time. aggregates, sorts, and outputs them in Reduces. From “MIROS” plots: Data-flows in space. MIROS (Map these visualizations, we see that RandomWriter has Inputs, Reduce Outputs, Shuffles, Figure 3) visualiza- only Maps, while the Reduces in Sort take significantly tions show data volumes into all Maps and out of all Re- longer than the Maps, showing most of the work occurs 3
  • 4. Task durations (Matrix-Vector Multiplication): Per-node RandomWriter Workload (4 nodes) 200 40 JT_Map JT_Map JT_Reduce 150 30 Per-task Per-task 100 20 50 10 0 0 50 100 150 0 Time/s Task durations (Matrix-Vector Multiplication: 49 hosts): All nodes 0 100 200 300 400 Time/s 200 JT_Map JT_Reduce Sort Workload (4 nodes) JT_Map JT_Reduce 150 150 Per-task 100 100 Per-task 50 50 0 0 50 100 150 Time/s 0 Figure 5: Swimlanes plot for 49-node job for the Matrix- 0 200 400 Time/s 600 800 Vector Multiplication; top plot: tasks sorted by node; Figure 6: Summarized Swimlanes plot for Ran- bottom plot: tasks sorted by time. domWriter (top) and Sort (bottom) in the Reduces. The REP plot in Figure 4 shows that a 2 tem views from existing instrumentation (Hadoop sys- significant fraction (≈ 3 ) of the time along the critical paths (Cluster 5) is spent waiting for Map outputs to be tem logs) to build views at a higher level of abstrac- shuffled to the Reduces, suggesting this is a bottleneck. tion for MR. Other uses of distributed execution trac- ing [2, 5, 13] for diagnosis operate at the language level, 4.2 Finding Opportunities for Optimization rather than at a higher-level of abstraction (e.g. MR) Figure 7 shows the Swimlanes from the Matrix-Vector since they are designed for general systems. Multiplication job of the HADI [12] graph-mining ap- Diagnosis for MR. [14] collected trace events plication for Hadoop. This workload contains two MR in Hadoop’s execution generated by custom programs, as seen from the two batches of Maps and instrumentation– these are akin to language-level Reduces. Before optimization, the second node and first views; their abstractions do not account for the volume node do not run any Reduce in the first and second jobs dimension which we provide (§3.2), and they do not respectively. The number of Reduces was then increased correlate data with the MR level of abstraction. [19] to twice the number of slave nodes, after which every only showed how outlier events can be identified node ran two Reduces (the maximum concurrent permit- in DataNode logs; we utilize information from the ted), and the job completed 13.5% faster. TaskTracker logs as well, and we build a complete 4.3 Debugging: Delayed Java Socket Creation abstraction of all execution events. [17] diagnosed We ran a no-op (“Sleep”) Hadoop job, with 2400 idle anomalous nodes in MR clusters by identifying nodes Maps and Reduces which sleep for 100ms, to character- with OS-level performance counters that deviated ize idle Hadoop behavior, and found tasks with unusu- from other nodes. [18] demonstrated how to extract ally long durations. On inspection of the Swimlanes, we state-machine views of Haodop’s execution from its found delayed tasks ran for 3 minutes (Figure 8). We logs; we have expanded on the per-node state-machine traced this problem to a delayed socket call in Hadoop, views in [18] by building the JCDF and extracting REPs and found a fix described at [1]. We resolved this issue (§3.1) to generate novel system views. by forcing Java to use IPv4 through a JVM option, and Visualization tools. Artemis [6] provides a pluggable Sleep ran in 270, instead of 520, seconds. framework for distributed log collection, data analysis, and visualization. We have presented specific MR ab- 5 Related Work stractions and ways to build them, and our techniques Distributed tracing and failure diagnosis. Recent can be implemented as Artemis plugins. The “machine tools for tracing distributed program execution have fo- usage data” plots in [6] resemble Swimlanes; REP shows cused on building instrumentation to trace causal paths both data and computational dependencies, while the [3], infer causality across components [15] and networks critical path analysis in [6] considers only computation. [8]. They produce fine-grained views at the language but [4], visualized web server access patterns and the out- not MR level of abstraction. Our work correlates sys- put of anomaly detection algorithms, while we showed 4
  • 5. Matrix-Vector Multiplication Workload, (4 nodes) Sleep Workload with socket error (24 nodes) 2500 JT_Map JT_Map JT_Reduce JT_Reduce 60 2000 50 1500 40 Per-task Per-task 30 1000 20 500 10 0 0 0 200 400 600 800 0 100 200 300 400 500 Time/s Time/s Matrix-Vector Multiplication Workload, (4 nodes) Sleep Workload without socket error (24 nodes) 2500 JT_Map JT_Map JT_Reduce JT_Reduce 2000 60 1500 Per-task Per-task 40 1000 20 500 0 0 0 100 200 300 400 500 600 700 0 50 100 150 200 250 Time/s Time/s Figure 7: Matrix-vector Multiplication before optimiza- Figure 8: SleepJob with delayed socket creation (above), tion (above), and after optimization (below) and without (below) system execution patterns. [5] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pin- point: Problem Determination in Large, Dynamic Internet Ser- 6 Conclusion and Future Work vices. In DSN, Jun 2002. [6] G. Cretu-Ciocarlie, M. Budiu, and M. Goldszmidt. Hunting for Mochi extracts and visualizes information about MR problems with artemis. In Workshop on Analysis of System Logs, programs at the MR-level abstraction, based on Dec 2008. Hadoop’s system logs. We show how Mochi’s analysis [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Pro- cessing on Large Clusters. In OSDI, Dec 2004. produces a distributed, causal, control+data-flow model [8] R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica. X- of Hadoop’s behavior, and then show the use of the re- Trace: A Pervasive Network Tracing Framework. In NSDI, Apr sulting visualizations for understanding and debugging 2007. the performance of Hadoop jobs in the Yahoo! M45 pro- [9] The Apache Software Foundation. Hadoop, 2007. duction cluster. We intend to implement our (currently) http://hadoop.apache.org/core. offline Mochi analysis and visualization to run online, to [10] S. Ghemawat, H. Gobioff, and S. Leung. The Google Filesystem. In SOSP, Oct 2003. evaluate the resulting performance overheads and ben- [11] Yahoo! Inc. Yahoo! reaches for the efits. We also intend to support the regression testing stars with M45 supercomputing project, 2007. of Hadoop programs against new Hadoop versions, and http://research.yahoo.com/node/1884. debugging of more problems, e.g. misconfigurations. [12] U. Kang, C. Tsourakakis, A.P. Appel, C. Faloutsos, and J. Leskovec. HADi: Fast Diameter Estimation and Mining in Acknowledgements Massive Graphs with Hadoop. CMU ML Tech Report CMU-ML- 08-117, 2008. The authors would like to thank Christos Faloutsos and [13] E. Kiciman and A. Fox. Detecting Application-level Failures U Kang for discussions on the HADI Hadoop workload in Component-based Internet Services. IEEE Trans. on Neural and for providing log data. Networks, 16(5):1027– 1041, Sep 2005. [14] A. Konwinski, M. Zaharia, R. Katz, and I. Stoica. X-tracing References Hadoop. Hadoop Summit, Mar 2008. [15] Eric Koskinen and John Jannotti. Borderpatrol: Isolating Events [1] Creating socket in java takes 3 minutes, 2004. for Black-box Tracing. In Eurosys 2008, Apr 2008. http://tinyurl.com/d5p3qr. [16] Arun Murthy. Hadoop MapReduce - Tuning and Debugging, [2] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and 2008. http://tinyurl.com/c9eau2. A. Muthitacharoen. Performance Debugging for Distributed Sys- [17] X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Gane- tem of Black Boxes. In SOSP, Oct 2003. sha: Black-Box Diagnosis of MapReduce Systems. In HotMet- [3] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Mag- rics, Seattle, WA, Jun 2009. pie for Request Extraction and Workload Modelling. In OSDI, [18] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Dec 2004. SALSA: Analyzing Logs as StAte Machines. In Workshop on [4] P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, K. Pa- Analysis of System Logs, Dec 2008. tel, G. Tolle, J. Hui, A. Fox, M. Jordan, and D. Patterson. Com- [19] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan. Min- bining Visualization and Statistical Analysis to Improve Opera- ing Console Logs for Large-scale System Problem Detection. In tor Confidence and Efficiency for Failure Detection and Local- SysML, Dec 2008. ization In ICAC, Jun 2005. 5