M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications

M3AT: Monitoring Agents Assignment Model for the Data-
Intensive Applications
Vladislav Kashansky, Dragi Kimovski, Radu Prodan, Prateek Agrawal, Fabrizio Marozzo,
Iuhasz Gabriel, Marek Justyna and Javier Garcia-Blas
1
ASPIDE Collaboration
vladislav@itec.aau.at
28th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Presentation Outline
 Background Information;
 Monitoring Tools and Techniques;
 Challenges at the Large Scales;
 M3AT Architecture & Formal Model;
 Practical Approach using SCIP Optimization Suite;
 Evaluation;
 Q&A;
2

ASPIDE Project’s Architecture
3

Example of the Technical Challenge
 It’s required to trace/profile the behavior of the application on the HPC/Cloud cluster in the next
T seconds with low transition latencies afterwards. It’s unknown how to select the proper
sampling rate and how to place monitoring agents in the current system:
1. Specific application is running on the cluster which represents workload, but it runs not
in isolated environment and affected by the current network congestion state;
2. Operator must see all the required information about the processing to identify hotspots.
Moreover the data is required to be stored for the long-term analysis;
3. Run time data collection and transport enables analysis while applications are running
and while the system is experiencing conditions of interest. Post-processing analysis
does not solve problems as they occur in practice [LDMS, SC’2014].
4

Goal and Objectives
 Project’s Goal: Scalable monitoring system for large scale data-intensive systems
 Our Goal: Mathematical model for the low-latency monitoring data collection subject to the
given I/O policies
 Objectives:
1. Analyze existing state-of-the-art monitoring approaches and frameworks, mathematical
methods in the field of combinatorial and discrete optimization;
2. Propose the mathematical model for the efficient monitoring data collection;
3. Design the architecture that will enable practical evaluation of the proposed model.
5

Contemporary Linux Performance Measurement Tools
6

Contemporary Model-Specific Performance Measurement
Tools
 Cube GUI by Jülich Supercomputing Centre
 Scalasca Trace Tools by Jülich Supercomputing
Centre
 Vampir by Technische Universität Dresden
 Periscope by Technische Universität München
 TAU by University of Oregon
 Extra-P by Technische Universität Darmstadt
7
Challenge lies in massively parallel data and meta-data requests which overcharge distributed parallel file systems. This is a fundamental problem
on highest-scale HPC machines today. - Knüpfer, Andreas, et al. "Score-P: A joint performance measurement run-time infrastructure for Periscope,
Scalasca, TAU, and Vampir.

Curse of Dimensionality
 It is impractical and many times impossible to globally measure the performance metrics
of large-scale applications and systems, while preserving, for example, I/O limitation
policies.
 Thus, it is critical to identify:
• The parameters to monitor and the granularity level (e.g dynamic tracing, profiling, per-node aggregated
statistics, per cluster I/O heatmaps);
• The measurement interval and the communication patterns in relation to these intervals;
• The aggregation and pre-processing of performance metrics at a monitor granularity for further analysis
8

Transition to the larger-scale architecture
Agelastos, Anthony, et al. "The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large
scale computing systems and applications." High Performance Computing, Networking, Storage and Analysis, SC14:
International Conference for. IEEE, 2014.
9

Graph Model
10
 Exchange Protocol, Access Algorithms
 Monitoring Data Vectors
 Monitoring Data Vectors
 The scalable cluster with
unified data space provided
by DFS

Large-scale Monitoring Architecture
 M3AT component for monitoring agent and
aggregator assignment control;
 Aggregation and event detection component
(AEDC) provides monitoring data aggregation from
the agents and detects possible events. This
component is decentralized and runs an instance
on every aggregation node;
 Main analysis component (AC) is centralized and
provides the set of analytic tools, including
smart monitoring of application performance and
bottlenecks detection;
11

Mathematical model – M3AT Components
12
 The M3AT model aims to identify an optimal assignment of monitoring agents and aggregation points, where the
monitoring data needs to be pre-processed for further analysis. Initially, the monitoring agents are selected from
the partitions, subject to a given application. Thereafter, we assign to the monitoring agents a subset of the
required aggregators guaranteeing a low response time and a fixed amount of monitoring traffic within the given
upper limits.

Model’s Limitations
 The a-priori information about the running application is already present, and delivered by the runtime system
(i.e SLURM, Hadoop, Borg);
 The number and location of monitoring and aggregating agents is decided by runtime and data management
systems;
 The relevant application performance metrics have already been selected and considered as the data volume,
accumulated within the given push interval;
 The optimal control criteria (objective function) with the set of constraints is not changing during the solving
procedure of the optimization problem.
13

Formal Mathematical Model
14
 Convex Polyhedron
 Knapsack constraints:
• Upper limit on total bandwidth limitation (Resource
Constraints);
 Assignment constraints:
• Each monitor assigned exactly to the one aggregator;
 0-1 Formulation:
• Admissible values of the decision variables constrained
to the [0,1] set;

Formal Mathematical Model. Matrix and LP Format
15
 Matrix Format:  LP Format:

Data Generation and Limitation Policies
16
 We tested the model on a set of 50 assignments
problems, sampled with a uniform distribution using
variable random seeds ranging form 100 to 150,
generated by the MT19937 generator of the GNU
Scientific Library 2.0.1.
 We identified the constants for the uniform distribution
based on the use-case ecosystem requirements.
 Each class provides possible I/O limitation scenarios
imposed to the given aggregators set.
 For example, complexity classes B† and C † model
an environment with bandwidth saturation within a
given HPC cluster partition by setting limitation
policies inversely proportional to the current
number of aggregators and possible amount of
traffic in circulation.
 The complexity class A† is also derived from the
application use-cases and allows probabilistic
variability in the bandwidth saturation for a given
aggregator.

SCIP Optimization Suite
 Provides a fast open-source IP, MIP and MINLP solver;
 Incorporates
• MIP features (cutting planes, LP relaxation);
• MINLP features;
• CP features (domain propagation);
• SAT-solving features (conict analysis, restarts);
• branch-cut-and-price framework,
• Has a modular structure via plugins;
• Free for academic purposes.
17
Achterberg, Tobias. "SCIP: solving constraint integer programs." Mathematical
Programming Computation 1.1 (2009): 1-41.
 Possible to parallelize branch-and-bound based
methods in a distributed or shared memory
computing environment.

SCIP Optimization Suite – Solution Output
18
 SCIP Solution for the130x80 dimensionality:
Time (sec)
Time (sec)
SCIPRelativeGAP%Amount(n)

Conclusion
 To solve this problem, we applied the ILP formalism and reduced the problem to a GAP formulation;
 We identified the requirements and the parameters for the given model and its solving techniques
based on several data-intensive applications, their corresponding ecosystems, and current the
state-of-the-art monitoring and profiling techniques;
 We have evaluated the scalability of our model based on several varying complexity data sets in
relation to the specific SCIP precision configuration;
 The approach scales well when the number of agents is within these boundaries, demonstrating
high sensitivity to the problem scale and the input data.
20

Open question on SCIP optimization suite and auto-tuning
 There are parameters to affect almost every part of the solving process of SCIP.
• SCIP currently features more than 1600 parameters: boolean, integer, real valued;
• Automated parameter tuning could improve the SCIP performance;
• Nowadays it is active research area for many major OR and AI ecosystems (i.e. CPLEX,
Gurobi, GAMS, Coin-OR, Tensorflow);
 Where to start?
• The default settings yield a good performance on a heterogeneous MIP benchmark set;
• Application-specific subset of parameters;
• Only centralized approach is considered, distributed case requires analysis;
21

M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Similar a M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications

Similar a M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications (20)

Último

Último (20)

M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications