SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
Getting the maximum performance in distributed clusters
Intel Cluster Studio XE
Werner Krotz-Vogel
Development Products Division
Software and Services Group
May 2014
Intel® Software Conference 2014
Agenda
Performance Tuning Methodology Overview
Quick overview of Intel® Trace Analyzer and Collector
What’s new in 2015 beta
Quick overview of Intel® VTune™ Amplifier XE
What’s new in 2015 beta
Performance Tuning Methodology using ITAC and VTune™ Amplifier XE
Demonstrated on Poisson Example
MPI 3.0 Support with Intel® MPI
Summary
2
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance Tuning Methodology
using ITAC and VTune™ Amplifier XE
Step 1
•Cluster Level Analysis &
Algorithmic Tuning
Step 2
•Run-time Analysis & Tuning
Step 3
•Intra-Node and Single Node
Level Analysis
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Trace Analyzer and Collector 8.1
What’s new?
4
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Trace Analyzer and Collector 8.1 Update 3
What’s New
Fresh look-and-feel to the Intel® Trace
Analyzer Graphical Interface
 New toolbars, icons, and dialogs for more
streamlined analysis flow
 Addition of Welcome Page and easy access to
past projects
Support of Dynamic Profiling Tool
Command
 MPI_PControl supported
Support for MPI 2.x Standard
New GUI-based installer on Linux*
5
Compile
rs &
Libraries
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® MPI Library 5.0 Beta
Key Features
6
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Initial MPI-3.0 Support
 Non-blocking and Sparse Collectives
 Fast Remote Memory Acess (RMA)
 Large buffer support (e.g. > 2GB) via mpi_count derived type
ABI compatibility with existing Intel® MPI Library and other MPICH*-based
applications
7
What’s New in Intel MPI Library 5.0 Beta
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Support for the latest MPI-3.0 features
8
Use non-blocking collectives for a
complete comm/comp overlap
More efficient one-sided
communication via new Fast
Remote Memory Access
functionality
// Start synchronization
MPI_Ibarrier(comm, &req);
// Do extra computation
…
// Complete synchronization
MPI_Test(&req, …);
Example (C)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Trace Analyzer and Collector 9.0 Beta
Key Features
9
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s New in Intel Trace Analyzer and Collector 9.0 Beta
10
Initial MPI-3.0 Support
Automatic Performance
Assistant
 Detect common MPI
performance issues
 Automated tips on potential
solutions
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Trace Analyzer and Collector
Optimize MPI Communications (part of Intel® Cluster Studio XE)
 Visually understand parallel application
behavior
 Communications Patterns
 Hotspots
 Load Balance
 MPI Checking
 Detect Deadlocks
 Data Corruption
 Errors in Parameters, Data Types, etc
11
Intel®
ITAC
Processes
Year
0
1000
2000
3000
4000
5000
6000
7000
2010 2011 2012
Intel® Trace Analyzer and
Collector (processes)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
ITAC 9.0: What’s New
12
• Collection
 Full MPI-3 support
 New mpirun options to customize collection
 Experimental TIME-WINDOWS support
 System calls profiling
• Analysis
 New Performance Assistant
 Visual appearance enhancement
 New Summary Page
• New tutorials
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New mpirun data collection keys
13
Reduce a trace file size or a number of Message Checker reports (supported only at
runtime with Hydra process manager):
• -trace-collectives: collect info only about Collective operations
• -trace-pt2pt: collect info only about Point-to-Point operations
Example:
$ [mpirun|mpiexec] -trace-pt2pt –n 4 ./myApp
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
System calls profiling (1|2)
14
Linux* only. Capability to trace the following system calls:
access clearerr close creat
dup dup2 fclose fdopen
feof ferror fflush fgetc
fgetpos fgets fileno fopen
fprintf fputc fputs fread
freopen fseek fsetpos ftell
fwrite getc getchar gets
lseek lseek64 mkfifo perror
pipe poll printf putc
putchar puts read readv
remove rename rewind setbuf
setvbuf sync tmpfile tmpnam
umask ungetc vfprintf vprintf
write writev
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
System calls profiling (2|2)
15
To turn on system calls collection add any of the following lines into ITC configuration file:
• To collect all system calls:
ACTIVITY SYSTEM on
• To collect an exact function:
STATE SYSTEM:<func_name> ON
View system calls using ITA (new Group SYSTEM, can be expanded in an ordinary way):
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New Summary Page
16
At-a-glance view on MPI activity and hints on how to start the analysis of the application:
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New Performance Assistant
17
Automatic highlights of performance issues, both in GUI and CLI.
Currently 4 types of issues are supported, see screenshots:
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE 2013
Key Features
18
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE
Tune Applications for Scalable Multicore Performance
Fast, Accurate Performance Profiles
 Hotspot (Statistical call tree)
 Call counts (Statistical)
 Hardware-Event Sampling
Thread Profiling
 Visualize thread interactions on timeline
 Balance workloads
Easy set-up
 Pre-defined performance profiles
 Use a normal production build
Find Answers Fast
 Filter extraneous data
 View results on the source / assembly
Compatible
 Microsoft, GCC, Intel compilers
 C/C++, Fortran, Assembly, .NET, Java
 Latest Intel® processors
and compatible processors1
Windows or Linux
 Visual Studio Integration (Windows)
 Standalone user i/f and command line
 32 and 64-bit
19
1 IA32 and Intel® 64 architectures.
Many features work with compatible processors.
Event based sampling requires a genuine Intel® Processor.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s New in 2013 SP1?
Intel® VTune™ Amplifier XE
Intel Confidential
20
More Profiling Data
 Intel® Xeon Phi™ – memory and vectorization profiling
 Gen graphics tuning – GT event counting, offload, OpenCL*, …
Better Data Mining – Find Answers Faster
 Search added to all grids
 Timeline sorting, band height, time scale configuration
 Loop hierarchy, overhead and spin time metrics
 OpenMP* 4.0 – affinity controls, tasking and scalability analysis
Easier to Use
 Attach to a running Java process
 Contextual help for hardware events and performance metrics
 Easier generation of command line options from the user i/f
New OS & Processor Support
 Intel® Xeon Phi™, Haswell – Windows* & Linux*
 Windows 8 desktop and Visual Studio* 2012
 Collection on Windows UI and Windows Blue
 Latest Linux distributions
New since the first 2013 release. Some features released in earlier updates.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE 2015 Beta
Key Features
21
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
 GPU analysis
 TSX analysis
 Remote collection in the GUI via ssh
 Mac OS* GUI data viewer (no collection)
 CSV import and custom collector support
 Timeline grouping
22
What’s New in Intel VTune™ Amplifier XE 2015 Beta
Performance Tuning Methodology
using ITAC and VTune™ Amplifier XE
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance Tuning Methodology
using ITAC and VTune™ Amplifier XE
Step 1
•Cluster Level Analysis &
Algorithmic Tuning
Step 2
•Run-time Analysis & Tuning
Step 3
•Single Node Level Analysis
Global Analysis of the whole application gives first
indications of performance issues
• Run time and scaling analysis
• Message passing performance analysis on an inter/intra node level,
including finding of MPI hotspots
• Network Idealization that yields an imbalance diagram, providing
guidance on how to proceed
• Algorithmic/source code changes can be implemented for better
message passing practices or improving the load balance of the
application by:
• Fixing imbalances in communication patterns of MPI and non-
MPI routines.
• E.g: slow sequential I/O often causes imbalances.
• Removing unnecessary synchronization.
• E.g: message passing patterns using blocking send
and receive may cause a send/receive order that
increases wait times.
• This may be resolved by using non-blocking
MPI_Isend/MPI_Irecv pairs.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance Tuning Methodology
using ITAC and VTune™ Amplifier XE
Step 1
•Cluster Level Analysis &
Algorithmic Tuning
Step 2
•Run-time Analysis & Tuning
Step 3
•Single Node Level Analysis
Intel MPI can be tuned without changing the
source code using:
• Environment variables for tuning of collective
operations, e.g., I_MPI_ADJUST_ALLREDUCE
• Environment variables for changing the message
passing characteristics, e.g.,
I_MPI_DAPL_DIRECT_COPY_THRESHOLD
• It is also possible to change the MPI
process/rank to node mapping for a better
inter/intra node communication balance
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance Tuning Methodology
using ITAC and VTune™ Amplifier XE
Step 1
•Cluster Level Analysis &
Algorithmic Tuning
Step 2
•Run-time Analysis & Tuning
Step 3
•Single Node Level Analysis
Single node tuning is necessary for serial and
parallel performance optimizations.
Single node tuning is important for improving overall
application scalability and reducing load imbalance.
Bandwidth analysis on the node is important for an
understanding of deficiencies in cluster level scaling.
Example: Conducting a hotspot analysis for each rank or
critical ranks identified in step 1 and 2.
The call stack information for a specific MPI routine may be
also helpful in refining of the analysis in Step 1.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Simple Scaling analysis
First step may be to just run the program for various number of processes [p] and
record timings: T[p]
Speedup S is defined as: S[p] = T[1]/T[p]
Efficiency E is defined as: E[p] = S[p]/p
An ideal parallel program will show:
S[p] = p and E[p] = 1
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example: Poisson solver on a square 3200x3200 computational
grid analysis
Poisson solver: simple implementation of Poisson solver: e.g. heat equation
• The 3200x3200 grid points can be distributed to MPI ranks using
• 2D process grid, e.g., in the case of 4 ranks, one can use 2 rows x 2 columns of processes
• 1D distribution with 4 rows x 1 column or 1 row x 4 columns
0 1 2 3
3200x800
local grid
points per
MPI rank
2 3
0 1
1600x1600
local grid
points per
MPI rank
Benchmark Environment
Intel® Xeon® E5 v2 processors (Ivy Town) with 12 cores. Frequency:
2 processors per node (24 cores per node)
Mellanox QDR Infiniband
Operating system: RedHat EL 6.1
Intel® MPI 4.1
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example: Poisson solver on a square 3200x3200 computational
grid analysis
Analysis of the application with different
numbers of processes (p)
Speed-up: S[p] = T[1]/T[p]
Parallel Efficiency: E[p] = S[p]/p
The speedup curves for the 2D quadratic
and 1D process (1D 1xN and 1D Nx1)
grids show some differences in scaling
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 1- Cluster Level Analysis
using Intel Trace Analyzer and Collector
The MPI communication and
compute performance breakdown
of total run time
T[p] = T_comp[p] + T_mpi[p]
can be accessed through the trace
analyzer’s Function Profile (Intel®
Trace Analyzer displays the
Functions Profile Chart when
opening a trace file).
The trace file can be generated by
adding the flag “-trace” to the
mpirun or mpiexec.hydra command
The trace analyzer API
was used to time just 100
of 1653 iterations.
VT_API is paused time
Timing is accumulated
over Ranks. Application
time is T_comp
This column is the average
time per process. It can be
added by right click and
Function Profile Settings
Intel Trace Analyzer and Collector Flat Profile
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example - measuring MPI times with ITAC
Function Profile
MPI Breakdown of a
real Application
(VASP).
All MPI functions
are listed and may
be sorted by a click
on the top of each
column
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 1- Cluster Level Analysis
using Intel Trace Analyzer and Collector
Parallel efficiency can be calculated and
plotted for the compute time of the
application separately.
• Parallel Efficiency: E[p] = S[p]/p
• One can see that MPI Time is insignificant up
to 48 cores (the equivalent of two nodes).
• Above 96 ranks (4 nodes), pure computational
application performance also yields super
linear scaling.
• However, at the same data point, at around
96 ranks, MPI time becomes the main reason
for low efficiency
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 1- Cluster Level Analysis
using Intel Trace Analyzer and Collector
Message Passing Profile – 2D case (48x32)
• Message Passing Profile is a display of various
characteristics of message passing in a
sender/receiver matrix that can be obtained
through Charts-> Message Profile Chart.
• Dealing with 1536 ranks generates a huge matrix
we may fuse all ranks for each node: Advanced->
Process Aggregation-> All Nodes.
• The diagonal now shows the intra-node
performance characteristics while the off
diagonals show the inter-node statistics. Without
Process Aggregation the diagonal will be only
filled if we send messages from rank n to the
same rank n which is usually not a good idea
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 1- Cluster Level Analysis
using Intel Trace Analyzer and Collector
Message Passing Profile –1D case
(1535x1)
• Have much fewer massages in the 1D case
that are much larger, which leads to a much
higher average transfer rate.
• In the 1D case we also transfer a much
larger amount of data.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Algorithm and Network evaluation
ITAC shows timing of all MPI routines used by a program
The timing of these routines may be due to network transfer times caused by
bandwidth limitations
The other possibility are waiting times caused by the algorithm: load imbalance or
dependencies
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
A simple Network Model
The most simple network model defines latency
T_trans[V] = L + (1/BW)*V
• Latency L = transfer time for 0 byte message
• Bandwidth BW = transfer rate for (asymptotically) large messages.
• The transfer time is (V = Message Volume)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Ideal Network Simulator
It is extremely complicated to simulate a realistic network!
An extreme case – the ideal network – may be simulated by setting all transfer times
to 0. This would mean L = 0 and BW = ∞ for the simple model
ITAC offers an ideal network simulation with transfer times set to 0. Compute times
(non MPI) will stay the same
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 1- Cluster Level Analysis
using Intel Trace Analyzer and Collector
Idealization of network and the Load
Imbalance Diagram
• Employing the ideal network simulator (invoked
through the Advanced->Idealization menu)
allows us to separate network stack performance
impact on total MPI performance from algorithmic
inefficiencies like imbalance and dependencies.
• A simple network model for the transfer time as
a function of message volume V is
T_trans[V] = L + (1/BW)*V
• L is latency, defined as the time needed to transfer a 0 byte
message. BW is the transfer rate for asymptotically large
messages
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 2- Runtime Analysis
It is possible to improve MPI performance without changing the source code
This can be done by using Intel MPI environment variables or by changing the
process mapping of ranks to compute nodes.
Process to node mapping can be altered by advanced methodologies like machine- or
configuration files or by reordering the ranks inside of a communicator
One option is start the tuning by concentrating on global operations
Set the environment variable I_MPI_DEBUG = 5.
 Prints valuable information about used variables, network fabrics and process placement.
 Setting I_MPI_DEBUG to 6 will further reveal the default algorithms for collective algorithms
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 2- Runtime Analysis
Intel MPI reference guide reveals 8
different algorithms for MPI_Allreduce
• The algorithm can easily be changed
by setting the environment variable
I_MPI_ADJUST_ALLREDUCE to an
integer value in the range of 1-8
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE
Intel Trace Analyzer and Collector is not
sufficient for hybrid applications due to its
primarily focus on MPI performance
• Hybrid codes that combine parallel MPI
processes with threading for a more
efficient exploitation of computing
resources
$> mpirun –n N amplxe-cl –result-dir
hotspots_N –collect hotspots -- poisson.x
ITAC analysis showed us which MPI functions are the hotspots.
But which MPI function?
MPI_Waitall function actually has the largest contribution to the application run time.
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE
Hotspot
Functions
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE
Hotspot MPI/System
functions
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE
Callstack #3
Callstack #2
Callstack #1
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE – Bandwidth Analysis
export VTUNE_COLLECT=snb-bandwidth
export VTUNE_COLLECT=hotspots
export VTUNE_FLAGS=-start-paused
mpiexec.hydra $MPI_FLAGS -n 1 amplxe-cl
$VTUNE_FLAGS --result-dir ${VTUNE_COLLECT}_$1 -
-collect $VTUNE_COLLECT
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE - Bandwidth Analysis
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Step 3 - Analyzing Intra-Node performance
using Intel® VTune™ Amplifier XE
8.908
61.836
83.548 86.569
80.717
30.392
15.839
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
10
20
30
40
50
60
70
80
90
100
1 6 12 24 48 72 96
ParallelEfficiency
Bandwidthononenode,GB/s
Number of ranks
Bandwidth vs. Parallel Efficiency on a first node
Bandwidth, GB/s Parallel Efficiency
MPI 3.0 Support with Intel® MPI
Intel® MPI Library 5.0 and Intel® Trace Analyzer and Collector 9.0
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How do you spell MPI?
A de facto standard for communicating
between processes of a parallel program on
a distributed memory system
 Standardized
 Supported on almost all platforms
 Portable
 No need to modify your code when
porting
 Performance opportunities
 Vendor MPIs can exploit native
hardware features
 Functionality
 Over 125 routines defined by a
committee
49
#include "mpi.h“
int main(argc,argv){
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
printf ("Number of tasks= %d 
My rank= %dn",ntasks,rank);
/******* do some work *******/
MPI_Finalize();
}
Example (C)
MPI include file
Initialize MPI
environment
Terminate MPI
environment
Do work and
make MPI calls
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What is in MPI-3?
Topic Motivation Main Result
Collective Operations Collective performance Non-Blocking & Sparse Collectives
Remote Memory Access Cache coherence, PGAS support Fast RMA
Backward Compatibility Buffers > 2 GB
Large buffer support, const
buffers
Fortran Bindings Fortran 2008
Fortran 2008 bindings
Removed C++ bindings
Tools Support PMPI Limitations MPIT Interface
Hybrid Programming Core count growth
MPI_Mprobe, shared memory
windows
Fault Tolerance Node count growth None. Next time?
50
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
I want a complete comm/comp overlap
Problem
 Computation/communication overlap is
not possible with the blocking collective
operations
Solution: Non-blocking Collectives
 Add non-blocking equivalents for
existing blocking collectives
 Do not mix non-blocking and blocking
collectives on different ranks in the
same operation
51
// Start synchronization
MPI_Ibarrier(comm, &req);
// Do extra computation
…
// Complete synchronization
MPI_Test(&req, …);
Example (C)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
I have a sparse communication network
Problem
 Neighbor exchanges are poorly served
by the current collective operations
(memory and performance losses)
Solution: Sparse Collectives
 Add blocking and non-blocking
Allgather* and Alltoall*
collectives based on neighborhoods
52
call MPI_NEIGHBOR_ALLGATHER(&
& sendbuf, sendcount, sendtype,&
& recvbuf, recvcount, recvtype,&
& graph_comm, ierror)
Example (FORTRAN)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
I want to use one-sided calls to reduce sync overhead
Problem
 MPI-2 one-sided operations are too
general to work efficiently on cache
coherent systems and compete with
PGAS languages
Solution: Fast Remote Memory Access
 Eliminate unnecessary overheads by
adding a ‘unified’ memory model
 Simplify usage model by supporting the
MPI_Request non-blocking call, extra
synchronization calls, relaxed
restrictions, shared memory, and much
more
53
call MPI_WIN_GET_ATTR(win, MPI_WIN_MODEL, &
memory_model, flag, ierror)
if (memory_model .eq. MPI_WIN_UNIFIED) then
! private and public copies coincide
Example (FORTRAN)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
I’m sending *very* large messages
Problem
 Original MPI counts are limited to 2
Gigaunits, while applications want to
send much more
Solution: Large Buffer Support
 “Hide” the long counts inside the
derived MPI datatypes
 Add new datatype query calls to
manipulate long counts
54
// mpi_count may be, e.g.,
// 64-bit long
MPI_Get_elements_x(&status,
datatype, &mpi_count);
Example (FORTRAN)
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
None of these apply to me. What else you got?
I have a hybrid application
 Create a communicator inside a shared memory domain (intranode, via
MPI_Comm_split_type)
 Use the new MPI_Mprobe calls
I need to know what architecture I’m running on
 Predefined info object MPI_INFO_ENV allows for environment query
I’m using the C++ bindings
 Tough luck. C++ bindings have been removed from the standard.
55
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Tell me more about this Intel® MPI Library
Optimized MPI application performance
 Application-specific tuning
 Automatic tuning
Lower Latency and Multi-vendor interoperability
 Optimized support for latest OFED* features
Faster MPI communication
 Optimized collectives
Sustainable scalability beyond 120K cores
 Native InfiniBand* interface allows for reduced
memory load and higher bandwidth
Simply and Accelerate Clusters
 Intel® Cluster Ready compliance
56
iWARPiWARP
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Tell me more about this Intel® MPI Library
Optimized MPI application performance
 Application-specific tuning
 Automatic tuning
Lower Latency and Multi-vendor interoperability
 Optimized support for latest OFED* features
Faster MPI communication
 Optimized collectives
Sustainable scalability beyond 120K cores
 Native InfiniBand* interface allows for reduced
memory load and higher bandwidth
Simply and Accelerate Clusters
 Intel® Cluster Ready compliance
57
iWARPiWARP
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® MPI Library 5.0 & Intel® Trace Analyzer and Collector 9.0
Beta Nov 2013
Initial MPI-3.0 Support
 Non-blocking Collectives
 Fast RMA
 Large Counts
ABI compatibility with existing Intel® MPI
Library applications
58
Initial MPI-3.0 Support
Automatic Performance Assistant
 Detect common MPI performance issues
 Automated tips on potential solutions
Intel® MPI Library Intel® Trace Analyzer and Collector
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What can I do?
Register for the Beta program (it’s free)
Start playing around with MPI-3.0
Come talk to us about it:
 Visit the Intel® Clusters and HPC Technology forums
 Check out the Intel® MPI Library product page (LEARN tab) for articles, examples, etc.
59
bit.ly/impi50-beta
software.intel.com/en-us/forums/intel-clusters-and-hpc-technologychnology
www.intel.com/go/mpi
Summary
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel Technical Computing
Compute enables a New Scientific Method*
Technical computing and R&D workflow innovation
61
• Prediction
• Modeling & Simulation
• Experiment Refinement
• Physical
Prototyping
• Analysis
• Conclusion
• Refinement
• Physical
Prototyping
• Analysis
• Conclusion
• Refinement
• Hypothesis
• Hypothesis
1. Satava, Richard M. “The Scientific Method Is Dead-Long Live the (New) Scientific Method.” Journal of Surgical Innovation (June 2005).
• Prediction
Accelerates
the Method
Iterate
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Technical Computing
Millions of Applications… Plus Yours
Delivering performance across generations and platforms
62
Today
Development
Tools
Performance/
Optimizations
Standards
Intel Architecture ecosystem: Increasing the return and longevity of your
application investment
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Tablet
Desktop
Intel® Xeon ®
Workstation
Local Cluster
Computation Large Clusters
Common underlying architecture and software tools
scales investments across technical computing platforms
Intel® Technical Computing
The Right Tool for the Job: A Continuum of Computing
How do you get breakthroughs for your investment
63
Intel Confidential — Do Not Forward
Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of
Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804

Más contenido relacionado

La actualidad más candente

Enea Keystone training 2014
Enea Keystone training 2014Enea Keystone training 2014
Enea Keystone training 2014EneaSoftware
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFILinaro
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel Software Brasil
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersMichelle Holley
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishBruno Cornec
 
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...Michelle Holley
 
Using SoC Vendor HALs in the Zephyr Project - SFO17-112
Using SoC Vendor HALs in the Zephyr Project - SFO17-112Using SoC Vendor HALs in the Zephyr Project - SFO17-112
Using SoC Vendor HALs in the Zephyr Project - SFO17-112Linaro
 
Porting linux on ARM
Porting linux on ARMPorting linux on ARM
Porting linux on ARMSatpal Parmar
 
Long-term Maintenance Model of Embedded Industrial Linux Distribution
Long-term Maintenance Model of Embedded Industrial Linux DistributionLong-term Maintenance Model of Embedded Industrial Linux Distribution
Long-term Maintenance Model of Embedded Industrial Linux DistributionSZ Lin
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCLinaro
 
Fast boot
Fast bootFast boot
Fast bootSZ Lin
 
Kernel Recipes 2013 - ARM support in the Linux kernel
Kernel Recipes 2013 - ARM support in the Linux kernelKernel Recipes 2013 - ARM support in the Linux kernel
Kernel Recipes 2013 - ARM support in the Linux kernelAnne Nicolas
 
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...Intel® Software
 
Distributed Compiler Icecc
Distributed Compiler IceccDistributed Compiler Icecc
Distributed Compiler IceccSZ Lin
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technologySZ Lin
 
Developing new zynq based instruments
Developing new zynq based instrumentsDeveloping new zynq based instruments
Developing new zynq based instrumentsGraham NAYLOR
 
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?Distro Recipes 2013: What&rsquo;s new in gcc 4.8?
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?Anne Nicolas
 
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profile
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profileLinux Kernel , BSP, Boot Loader, ARM Engineer - Satish profile
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profileSatish Kumar
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsNational Cheng Kung University
 
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device ConfigurationLAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device ConfigurationLinaro
 

La actualidad más candente (20)

Enea Keystone training 2014
Enea Keystone training 2014Enea Keystone training 2014
Enea Keystone training 2014
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFI
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
 
IPMI is dead, Long live Redfish
IPMI is dead, Long live RedfishIPMI is dead, Long live Redfish
IPMI is dead, Long live Redfish
 
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...
Unleashing End-to_end TLS Security Leveraging NGINX with Intel(r) QuickAssist...
 
Using SoC Vendor HALs in the Zephyr Project - SFO17-112
Using SoC Vendor HALs in the Zephyr Project - SFO17-112Using SoC Vendor HALs in the Zephyr Project - SFO17-112
Using SoC Vendor HALs in the Zephyr Project - SFO17-112
 
Porting linux on ARM
Porting linux on ARMPorting linux on ARM
Porting linux on ARM
 
Long-term Maintenance Model of Embedded Industrial Linux Distribution
Long-term Maintenance Model of Embedded Industrial Linux DistributionLong-term Maintenance Model of Embedded Industrial Linux Distribution
Long-term Maintenance Model of Embedded Industrial Linux Distribution
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
Fast boot
Fast bootFast boot
Fast boot
 
Kernel Recipes 2013 - ARM support in the Linux kernel
Kernel Recipes 2013 - ARM support in the Linux kernelKernel Recipes 2013 - ARM support in the Linux kernel
Kernel Recipes 2013 - ARM support in the Linux kernel
 
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
 
Distributed Compiler Icecc
Distributed Compiler IceccDistributed Compiler Icecc
Distributed Compiler Icecc
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technology
 
Developing new zynq based instruments
Developing new zynq based instrumentsDeveloping new zynq based instruments
Developing new zynq based instruments
 
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?Distro Recipes 2013: What&rsquo;s new in gcc 4.8?
Distro Recipes 2013: What&rsquo;s new in gcc 4.8?
 
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profile
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profileLinux Kernel , BSP, Boot Loader, ARM Engineer - Satish profile
Linux Kernel , BSP, Boot Loader, ARM Engineer - Satish profile
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
 
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device ConfigurationLAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
 

Destacado

Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Intel Software Brasil
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaIntel Software Brasil
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatIntel Software Brasil
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaIntel Software Brasil
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™ Intel Software Brasil
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoIntel Software Brasil
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoIntel Software Brasil
 
Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013Intel Software Brasil
 
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software ConferenceIdentificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software ConferenceIntel Software Brasil
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Software Brasil
 
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013Intel Software Brasil
 

Destacado (18)

Intel and Big Data
Intel and Big DataIntel and Big Data
Intel and Big Data
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013
 
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software ConferenceIdentificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Intel
IntelIntel
Intel
 
CV-LucianoPalma
CV-LucianoPalmaCV-LucianoPalma
CV-LucianoPalma
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar a Maximizing cluster performance with Intel Cluster Studio XE

Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel Software Brasil
 
Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018AWS User Group Bengaluru
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Media Gorod
 
Software Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT PlatformsSoftware Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT PlatformsIntel® Software
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYehMAKERPRO.cc
 
Intel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/LabIntel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/LabMichelle Holley
 
Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationIntel IT Center
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing SlidesRonen Mendezitsky
 
Performance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesIntel® Software
 
Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovDenis Nagorny
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production EnvironmentsIntel® Software
 
Develop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyDevelop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyIntel IT Center
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...tdc-globalcode
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkMichelle Holley
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceSergey Arkhipov
 
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Intel® Software
 

Similar a Maximizing cluster performance with Intel Cluster Studio XE (20)

Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013
 
Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
 
Software Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT PlatformsSoftware Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT Platforms
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
 
Intel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/LabIntel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/Lab
 
Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing Transformation
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing Slides
 
Performance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android Devices
 
Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanov
 
Clear Linux OS - Introduction
Clear Linux OS - IntroductionClear Linux OS - Introduction
Clear Linux OS - Introduction
 
Enabling NFV features in kubernetes
Enabling NFV features in kubernetesEnabling NFV features in kubernetes
Enabling NFV features in kubernetes
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Develop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyDevelop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster Ready
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
 
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
 

Más de Intel Software Brasil

Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaIntel Software Brasil
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoIntel Software Brasil
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...Intel Software Brasil
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoIntel Software Brasil
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayIntel Software Brasil
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in JavaIntel Software Brasil
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil
 
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Intel Software Brasil
 
Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Intel Software Brasil
 
Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel Software Brasil
 

Más de Intel Software Brasil (16)

Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw Day
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
 
Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3
 
Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013
 
Hackeando a Sala de Aula
Hackeando a Sala de AulaHackeando a Sala de Aula
Hackeando a Sala de Aula
 
Android Native Apps Hands On
Android Native Apps Hands OnAndroid Native Apps Hands On
Android Native Apps Hands On
 
Android Fat Binaries
Android Fat BinariesAndroid Fat Binaries
Android Fat Binaries
 
Android Native Apps Development
Android Native Apps DevelopmentAndroid Native Apps Development
Android Native Apps Development
 

Último

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Último (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Maximizing cluster performance with Intel Cluster Studio XE

  • 1. Getting the maximum performance in distributed clusters Intel Cluster Studio XE Werner Krotz-Vogel Development Products Division Software and Services Group May 2014
  • 2. Intel® Software Conference 2014 Agenda Performance Tuning Methodology Overview Quick overview of Intel® Trace Analyzer and Collector What’s new in 2015 beta Quick overview of Intel® VTune™ Amplifier XE What’s new in 2015 beta Performance Tuning Methodology using ITAC and VTune™ Amplifier XE Demonstrated on Poisson Example MPI 3.0 Support with Intel® MPI Summary 2
  • 3. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Tuning Methodology using ITAC and VTune™ Amplifier XE Step 1 •Cluster Level Analysis & Algorithmic Tuning Step 2 •Run-time Analysis & Tuning Step 3 •Intra-Node and Single Node Level Analysis
  • 4. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Trace Analyzer and Collector 8.1 What’s new? 4
  • 5. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Trace Analyzer and Collector 8.1 Update 3 What’s New Fresh look-and-feel to the Intel® Trace Analyzer Graphical Interface  New toolbars, icons, and dialogs for more streamlined analysis flow  Addition of Welcome Page and easy access to past projects Support of Dynamic Profiling Tool Command  MPI_PControl supported Support for MPI 2.x Standard New GUI-based installer on Linux* 5 Compile rs & Libraries
  • 6. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® MPI Library 5.0 Beta Key Features 6
  • 7. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Initial MPI-3.0 Support  Non-blocking and Sparse Collectives  Fast Remote Memory Acess (RMA)  Large buffer support (e.g. > 2GB) via mpi_count derived type ABI compatibility with existing Intel® MPI Library and other MPICH*-based applications 7 What’s New in Intel MPI Library 5.0 Beta
  • 8. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Support for the latest MPI-3.0 features 8 Use non-blocking collectives for a complete comm/comp overlap More efficient one-sided communication via new Fast Remote Memory Access functionality // Start synchronization MPI_Ibarrier(comm, &req); // Do extra computation … // Complete synchronization MPI_Test(&req, …); Example (C)
  • 9. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Trace Analyzer and Collector 9.0 Beta Key Features 9
  • 10. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What’s New in Intel Trace Analyzer and Collector 9.0 Beta 10 Initial MPI-3.0 Support Automatic Performance Assistant  Detect common MPI performance issues  Automated tips on potential solutions
  • 11. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Trace Analyzer and Collector Optimize MPI Communications (part of Intel® Cluster Studio XE)  Visually understand parallel application behavior  Communications Patterns  Hotspots  Load Balance  MPI Checking  Detect Deadlocks  Data Corruption  Errors in Parameters, Data Types, etc 11 Intel® ITAC Processes Year 0 1000 2000 3000 4000 5000 6000 7000 2010 2011 2012 Intel® Trace Analyzer and Collector (processes)
  • 12. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. ITAC 9.0: What’s New 12 • Collection  Full MPI-3 support  New mpirun options to customize collection  Experimental TIME-WINDOWS support  System calls profiling • Analysis  New Performance Assistant  Visual appearance enhancement  New Summary Page • New tutorials
  • 13. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. New mpirun data collection keys 13 Reduce a trace file size or a number of Message Checker reports (supported only at runtime with Hydra process manager): • -trace-collectives: collect info only about Collective operations • -trace-pt2pt: collect info only about Point-to-Point operations Example: $ [mpirun|mpiexec] -trace-pt2pt –n 4 ./myApp
  • 14. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. System calls profiling (1|2) 14 Linux* only. Capability to trace the following system calls: access clearerr close creat dup dup2 fclose fdopen feof ferror fflush fgetc fgetpos fgets fileno fopen fprintf fputc fputs fread freopen fseek fsetpos ftell fwrite getc getchar gets lseek lseek64 mkfifo perror pipe poll printf putc putchar puts read readv remove rename rewind setbuf setvbuf sync tmpfile tmpnam umask ungetc vfprintf vprintf write writev
  • 15. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. System calls profiling (2|2) 15 To turn on system calls collection add any of the following lines into ITC configuration file: • To collect all system calls: ACTIVITY SYSTEM on • To collect an exact function: STATE SYSTEM:<func_name> ON View system calls using ITA (new Group SYSTEM, can be expanded in an ordinary way):
  • 16. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. New Summary Page 16 At-a-glance view on MPI activity and hints on how to start the analysis of the application:
  • 17. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. New Performance Assistant 17 Automatic highlights of performance issues, both in GUI and CLI. Currently 4 types of issues are supported, see screenshots:
  • 18. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE 2013 Key Features 18
  • 19. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance Fast, Accurate Performance Profiles  Hotspot (Statistical call tree)  Call counts (Statistical)  Hardware-Event Sampling Thread Profiling  Visualize thread interactions on timeline  Balance workloads Easy set-up  Pre-defined performance profiles  Use a normal production build Find Answers Fast  Filter extraneous data  View results on the source / assembly Compatible  Microsoft, GCC, Intel compilers  C/C++, Fortran, Assembly, .NET, Java  Latest Intel® processors and compatible processors1 Windows or Linux  Visual Studio Integration (Windows)  Standalone user i/f and command line  32 and 64-bit 19 1 IA32 and Intel® 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel® Processor.
  • 20. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What’s New in 2013 SP1? Intel® VTune™ Amplifier XE Intel Confidential 20 More Profiling Data  Intel® Xeon Phi™ – memory and vectorization profiling  Gen graphics tuning – GT event counting, offload, OpenCL*, … Better Data Mining – Find Answers Faster  Search added to all grids  Timeline sorting, band height, time scale configuration  Loop hierarchy, overhead and spin time metrics  OpenMP* 4.0 – affinity controls, tasking and scalability analysis Easier to Use  Attach to a running Java process  Contextual help for hardware events and performance metrics  Easier generation of command line options from the user i/f New OS & Processor Support  Intel® Xeon Phi™, Haswell – Windows* & Linux*  Windows 8 desktop and Visual Studio* 2012  Collection on Windows UI and Windows Blue  Latest Linux distributions New since the first 2013 release. Some features released in earlier updates.
  • 21. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE 2015 Beta Key Features 21
  • 22. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.  GPU analysis  TSX analysis  Remote collection in the GUI via ssh  Mac OS* GUI data viewer (no collection)  CSV import and custom collector support  Timeline grouping 22 What’s New in Intel VTune™ Amplifier XE 2015 Beta
  • 23. Performance Tuning Methodology using ITAC and VTune™ Amplifier XE
  • 24. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Tuning Methodology using ITAC and VTune™ Amplifier XE Step 1 •Cluster Level Analysis & Algorithmic Tuning Step 2 •Run-time Analysis & Tuning Step 3 •Single Node Level Analysis Global Analysis of the whole application gives first indications of performance issues • Run time and scaling analysis • Message passing performance analysis on an inter/intra node level, including finding of MPI hotspots • Network Idealization that yields an imbalance diagram, providing guidance on how to proceed • Algorithmic/source code changes can be implemented for better message passing practices or improving the load balance of the application by: • Fixing imbalances in communication patterns of MPI and non- MPI routines. • E.g: slow sequential I/O often causes imbalances. • Removing unnecessary synchronization. • E.g: message passing patterns using blocking send and receive may cause a send/receive order that increases wait times. • This may be resolved by using non-blocking MPI_Isend/MPI_Irecv pairs.
  • 25. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Tuning Methodology using ITAC and VTune™ Amplifier XE Step 1 •Cluster Level Analysis & Algorithmic Tuning Step 2 •Run-time Analysis & Tuning Step 3 •Single Node Level Analysis Intel MPI can be tuned without changing the source code using: • Environment variables for tuning of collective operations, e.g., I_MPI_ADJUST_ALLREDUCE • Environment variables for changing the message passing characteristics, e.g., I_MPI_DAPL_DIRECT_COPY_THRESHOLD • It is also possible to change the MPI process/rank to node mapping for a better inter/intra node communication balance
  • 26. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Tuning Methodology using ITAC and VTune™ Amplifier XE Step 1 •Cluster Level Analysis & Algorithmic Tuning Step 2 •Run-time Analysis & Tuning Step 3 •Single Node Level Analysis Single node tuning is necessary for serial and parallel performance optimizations. Single node tuning is important for improving overall application scalability and reducing load imbalance. Bandwidth analysis on the node is important for an understanding of deficiencies in cluster level scaling. Example: Conducting a hotspot analysis for each rank or critical ranks identified in step 1 and 2. The call stack information for a specific MPI routine may be also helpful in refining of the analysis in Step 1.
  • 27. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Simple Scaling analysis First step may be to just run the program for various number of processes [p] and record timings: T[p] Speedup S is defined as: S[p] = T[1]/T[p] Efficiency E is defined as: E[p] = S[p]/p An ideal parallel program will show: S[p] = p and E[p] = 1
  • 28. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example: Poisson solver on a square 3200x3200 computational grid analysis Poisson solver: simple implementation of Poisson solver: e.g. heat equation • The 3200x3200 grid points can be distributed to MPI ranks using • 2D process grid, e.g., in the case of 4 ranks, one can use 2 rows x 2 columns of processes • 1D distribution with 4 rows x 1 column or 1 row x 4 columns 0 1 2 3 3200x800 local grid points per MPI rank 2 3 0 1 1600x1600 local grid points per MPI rank Benchmark Environment Intel® Xeon® E5 v2 processors (Ivy Town) with 12 cores. Frequency: 2 processors per node (24 cores per node) Mellanox QDR Infiniband Operating system: RedHat EL 6.1 Intel® MPI 4.1
  • 29. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example: Poisson solver on a square 3200x3200 computational grid analysis Analysis of the application with different numbers of processes (p) Speed-up: S[p] = T[1]/T[p] Parallel Efficiency: E[p] = S[p]/p The speedup curves for the 2D quadratic and 1D process (1D 1xN and 1D Nx1) grids show some differences in scaling
  • 30. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 1- Cluster Level Analysis using Intel Trace Analyzer and Collector The MPI communication and compute performance breakdown of total run time T[p] = T_comp[p] + T_mpi[p] can be accessed through the trace analyzer’s Function Profile (Intel® Trace Analyzer displays the Functions Profile Chart when opening a trace file). The trace file can be generated by adding the flag “-trace” to the mpirun or mpiexec.hydra command The trace analyzer API was used to time just 100 of 1653 iterations. VT_API is paused time Timing is accumulated over Ranks. Application time is T_comp This column is the average time per process. It can be added by right click and Function Profile Settings Intel Trace Analyzer and Collector Flat Profile
  • 31. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Example - measuring MPI times with ITAC Function Profile MPI Breakdown of a real Application (VASP). All MPI functions are listed and may be sorted by a click on the top of each column
  • 32. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 1- Cluster Level Analysis using Intel Trace Analyzer and Collector Parallel efficiency can be calculated and plotted for the compute time of the application separately. • Parallel Efficiency: E[p] = S[p]/p • One can see that MPI Time is insignificant up to 48 cores (the equivalent of two nodes). • Above 96 ranks (4 nodes), pure computational application performance also yields super linear scaling. • However, at the same data point, at around 96 ranks, MPI time becomes the main reason for low efficiency
  • 33. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 1- Cluster Level Analysis using Intel Trace Analyzer and Collector Message Passing Profile – 2D case (48x32) • Message Passing Profile is a display of various characteristics of message passing in a sender/receiver matrix that can be obtained through Charts-> Message Profile Chart. • Dealing with 1536 ranks generates a huge matrix we may fuse all ranks for each node: Advanced-> Process Aggregation-> All Nodes. • The diagonal now shows the intra-node performance characteristics while the off diagonals show the inter-node statistics. Without Process Aggregation the diagonal will be only filled if we send messages from rank n to the same rank n which is usually not a good idea
  • 34. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 1- Cluster Level Analysis using Intel Trace Analyzer and Collector Message Passing Profile –1D case (1535x1) • Have much fewer massages in the 1D case that are much larger, which leads to a much higher average transfer rate. • In the 1D case we also transfer a much larger amount of data.
  • 35. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Algorithm and Network evaluation ITAC shows timing of all MPI routines used by a program The timing of these routines may be due to network transfer times caused by bandwidth limitations The other possibility are waiting times caused by the algorithm: load imbalance or dependencies
  • 36. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. A simple Network Model The most simple network model defines latency T_trans[V] = L + (1/BW)*V • Latency L = transfer time for 0 byte message • Bandwidth BW = transfer rate for (asymptotically) large messages. • The transfer time is (V = Message Volume)
  • 37. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Ideal Network Simulator It is extremely complicated to simulate a realistic network! An extreme case – the ideal network – may be simulated by setting all transfer times to 0. This would mean L = 0 and BW = ∞ for the simple model ITAC offers an ideal network simulation with transfer times set to 0. Compute times (non MPI) will stay the same
  • 38. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 1- Cluster Level Analysis using Intel Trace Analyzer and Collector Idealization of network and the Load Imbalance Diagram • Employing the ideal network simulator (invoked through the Advanced->Idealization menu) allows us to separate network stack performance impact on total MPI performance from algorithmic inefficiencies like imbalance and dependencies. • A simple network model for the transfer time as a function of message volume V is T_trans[V] = L + (1/BW)*V • L is latency, defined as the time needed to transfer a 0 byte message. BW is the transfer rate for asymptotically large messages
  • 39. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 2- Runtime Analysis It is possible to improve MPI performance without changing the source code This can be done by using Intel MPI environment variables or by changing the process mapping of ranks to compute nodes. Process to node mapping can be altered by advanced methodologies like machine- or configuration files or by reordering the ranks inside of a communicator One option is start the tuning by concentrating on global operations Set the environment variable I_MPI_DEBUG = 5.  Prints valuable information about used variables, network fabrics and process placement.  Setting I_MPI_DEBUG to 6 will further reveal the default algorithms for collective algorithms
  • 40. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 2- Runtime Analysis Intel MPI reference guide reveals 8 different algorithms for MPI_Allreduce • The algorithm can easily be changed by setting the environment variable I_MPI_ADJUST_ALLREDUCE to an integer value in the range of 1-8
  • 41. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE Intel Trace Analyzer and Collector is not sufficient for hybrid applications due to its primarily focus on MPI performance • Hybrid codes that combine parallel MPI processes with threading for a more efficient exploitation of computing resources $> mpirun –n N amplxe-cl –result-dir hotspots_N –collect hotspots -- poisson.x ITAC analysis showed us which MPI functions are the hotspots. But which MPI function? MPI_Waitall function actually has the largest contribution to the application run time.
  • 42. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE Hotspot Functions
  • 43. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE Hotspot MPI/System functions
  • 44. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE Callstack #3 Callstack #2 Callstack #1
  • 45. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE – Bandwidth Analysis export VTUNE_COLLECT=snb-bandwidth export VTUNE_COLLECT=hotspots export VTUNE_FLAGS=-start-paused mpiexec.hydra $MPI_FLAGS -n 1 amplxe-cl $VTUNE_FLAGS --result-dir ${VTUNE_COLLECT}_$1 - -collect $VTUNE_COLLECT
  • 46. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE - Bandwidth Analysis
  • 47. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Step 3 - Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE 8.908 61.836 83.548 86.569 80.717 30.392 15.839 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 80 90 100 1 6 12 24 48 72 96 ParallelEfficiency Bandwidthononenode,GB/s Number of ranks Bandwidth vs. Parallel Efficiency on a first node Bandwidth, GB/s Parallel Efficiency
  • 48. MPI 3.0 Support with Intel® MPI Intel® MPI Library 5.0 and Intel® Trace Analyzer and Collector 9.0
  • 49. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. How do you spell MPI? A de facto standard for communicating between processes of a parallel program on a distributed memory system  Standardized  Supported on almost all platforms  Portable  No need to modify your code when porting  Performance opportunities  Vendor MPIs can exploit native hardware features  Functionality  Over 125 routines defined by a committee 49 #include "mpi.h“ int main(argc,argv){ MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf ("Number of tasks= %d My rank= %dn",ntasks,rank); /******* do some work *******/ MPI_Finalize(); } Example (C) MPI include file Initialize MPI environment Terminate MPI environment Do work and make MPI calls
  • 50. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What is in MPI-3? Topic Motivation Main Result Collective Operations Collective performance Non-Blocking & Sparse Collectives Remote Memory Access Cache coherence, PGAS support Fast RMA Backward Compatibility Buffers > 2 GB Large buffer support, const buffers Fortran Bindings Fortran 2008 Fortran 2008 bindings Removed C++ bindings Tools Support PMPI Limitations MPIT Interface Hybrid Programming Core count growth MPI_Mprobe, shared memory windows Fault Tolerance Node count growth None. Next time? 50
  • 51. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. I want a complete comm/comp overlap Problem  Computation/communication overlap is not possible with the blocking collective operations Solution: Non-blocking Collectives  Add non-blocking equivalents for existing blocking collectives  Do not mix non-blocking and blocking collectives on different ranks in the same operation 51 // Start synchronization MPI_Ibarrier(comm, &req); // Do extra computation … // Complete synchronization MPI_Test(&req, …); Example (C)
  • 52. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. I have a sparse communication network Problem  Neighbor exchanges are poorly served by the current collective operations (memory and performance losses) Solution: Sparse Collectives  Add blocking and non-blocking Allgather* and Alltoall* collectives based on neighborhoods 52 call MPI_NEIGHBOR_ALLGATHER(& & sendbuf, sendcount, sendtype,& & recvbuf, recvcount, recvtype,& & graph_comm, ierror) Example (FORTRAN)
  • 53. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. I want to use one-sided calls to reduce sync overhead Problem  MPI-2 one-sided operations are too general to work efficiently on cache coherent systems and compete with PGAS languages Solution: Fast Remote Memory Access  Eliminate unnecessary overheads by adding a ‘unified’ memory model  Simplify usage model by supporting the MPI_Request non-blocking call, extra synchronization calls, relaxed restrictions, shared memory, and much more 53 call MPI_WIN_GET_ATTR(win, MPI_WIN_MODEL, & memory_model, flag, ierror) if (memory_model .eq. MPI_WIN_UNIFIED) then ! private and public copies coincide Example (FORTRAN)
  • 54. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. I’m sending *very* large messages Problem  Original MPI counts are limited to 2 Gigaunits, while applications want to send much more Solution: Large Buffer Support  “Hide” the long counts inside the derived MPI datatypes  Add new datatype query calls to manipulate long counts 54 // mpi_count may be, e.g., // 64-bit long MPI_Get_elements_x(&status, datatype, &mpi_count); Example (FORTRAN)
  • 55. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. None of these apply to me. What else you got? I have a hybrid application  Create a communicator inside a shared memory domain (intranode, via MPI_Comm_split_type)  Use the new MPI_Mprobe calls I need to know what architecture I’m running on  Predefined info object MPI_INFO_ENV allows for environment query I’m using the C++ bindings  Tough luck. C++ bindings have been removed from the standard. 55
  • 56. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Tell me more about this Intel® MPI Library Optimized MPI application performance  Application-specific tuning  Automatic tuning Lower Latency and Multi-vendor interoperability  Optimized support for latest OFED* features Faster MPI communication  Optimized collectives Sustainable scalability beyond 120K cores  Native InfiniBand* interface allows for reduced memory load and higher bandwidth Simply and Accelerate Clusters  Intel® Cluster Ready compliance 56 iWARPiWARP
  • 57. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Tell me more about this Intel® MPI Library Optimized MPI application performance  Application-specific tuning  Automatic tuning Lower Latency and Multi-vendor interoperability  Optimized support for latest OFED* features Faster MPI communication  Optimized collectives Sustainable scalability beyond 120K cores  Native InfiniBand* interface allows for reduced memory load and higher bandwidth Simply and Accelerate Clusters  Intel® Cluster Ready compliance 57 iWARPiWARP
  • 58. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® MPI Library 5.0 & Intel® Trace Analyzer and Collector 9.0 Beta Nov 2013 Initial MPI-3.0 Support  Non-blocking Collectives  Fast RMA  Large Counts ABI compatibility with existing Intel® MPI Library applications 58 Initial MPI-3.0 Support Automatic Performance Assistant  Detect common MPI performance issues  Automated tips on potential solutions Intel® MPI Library Intel® Trace Analyzer and Collector
  • 59. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What can I do? Register for the Beta program (it’s free) Start playing around with MPI-3.0 Come talk to us about it:  Visit the Intel® Clusters and HPC Technology forums  Check out the Intel® MPI Library product page (LEARN tab) for articles, examples, etc. 59 bit.ly/impi50-beta software.intel.com/en-us/forums/intel-clusters-and-hpc-technologychnology www.intel.com/go/mpi
  • 61. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Technical Computing Compute enables a New Scientific Method* Technical computing and R&D workflow innovation 61 • Prediction • Modeling & Simulation • Experiment Refinement • Physical Prototyping • Analysis • Conclusion • Refinement • Physical Prototyping • Analysis • Conclusion • Refinement • Hypothesis • Hypothesis 1. Satava, Richard M. “The Scientific Method Is Dead-Long Live the (New) Scientific Method.” Journal of Surgical Innovation (June 2005). • Prediction Accelerates the Method Iterate
  • 62. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Technical Computing Millions of Applications… Plus Yours Delivering performance across generations and platforms 62 Today Development Tools Performance/ Optimizations Standards Intel Architecture ecosystem: Increasing the return and longevity of your application investment
  • 63. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Tablet Desktop Intel® Xeon ® Workstation Local Cluster Computation Large Clusters Common underlying architecture and software tools scales investments across technical computing platforms Intel® Technical Computing The Right Tool for the Job: A Continuum of Computing How do you get breakthroughs for your investment 63
  • 64. Intel Confidential — Do Not Forward
  • 65. Copyright© 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804