SlideShare una empresa de Scribd logo
1 de 48
Optimizing Hortonworks
Apache Spark machine learning
workloads for contemporary
Open Platforms
Raj Krishnamurthy, Indrajit Poddar (I.P), IBM Systems
Animesh Trivedi, Bernard Metzler, IBM Research
© International Business Machines (IBM) 2017
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s
sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be
relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver
any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole
discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The
actual throughput or performance that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
2
© International Business Machines (IBM) 2017
Agenda
Spark, Machine Learning and Deep Learning Overview
Why OpenPower ?
Deep Learning with OpenPOWER GPUs
Spark Machine Learning performance tuning with OpenPower CPUs
IO Optimization for Spark TeraSort benchmark
3
© International Business Machines (IBM) 2017
What is Apache Spark
• Unified Analytics Platform
– Combine streaming, graph, machine
learning and sql analytics on a single
platform
– Simplified, multi-language
programming model
– Interactive and Batch
• In-Memory Design
– Pipelines multiple iterations on single
copy of data in memory
– Superior Performance
– Natural Successor to MapReduce
Fast and general engine for
large-scale data processing
Spark Core API
R Scala SQL Python Java
Spark SQL Streaming MLlib GraphX
4
© International Business Machines (IBM) 2017
Machine Learning and Deep Learning (ML/DL)
What you and I (our brains) do without even thinking about it…..we recognize a bicycle
Apr 7, 2017
(c) International Business Machines (IBM) 2017
5
6
Now machines are learning the way we learn….
From "Texture of the Nervous
System of Man and the Vertebrates"
by Santiago Ramón y Cajal.
Artificial Neural Networks
Apr 7, 2017(c) International Business Machines (IBM) 2017
But training needs a lot computational resources
Easy scale-out with:
But model training is not easy to distribute
Training can take hours, days or
weeks
Input data and model sizes are
becoming larger than ever (e.g. video
input, billions of features etc.)
Real-time analytics with:
• whole system optimization
• offloaded computation
• accelerators, and
• higher memory bandwidth systems
Resulting in need for:
Moore’s law is dying
Apr 7, 2017(c) International Business Machines (IBM) 2017 7
Today’s challenges demand whole system innovation
You are here
44 zettabytes
unstructured data
2010 2020
structured data
Data holds competitive valueFull system and stack open innovation required
DataGrowth
Price/Performance
Moore’s Law
Processor
Technology
2000 2020
Firmware / OS
Accelerators
Software
Storage
Network
8
© International Business Machines (IBM) 2017
9
OpenPOWER: open hardware for high performance
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/POWER-SERVERS
https://mc.jarvice.com/
Apr 7, 2017(c) International Business Machines (IBM) 2017
10
OpenPower Ecosystem – Members
(c) International Business Machines (IBM) 2017 Apr 7, 2017
Memory
Interface
Control
Memory
IBM & Partner
Devices
CAPI/PCI
DMI
Cores
• 12 cores / 8 threads per core
• TDP: 130W and 190W
• 64K data cache, 32K instruction cache
Accelerators
• Crypto & memory expansion
• Transactional Memory
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory Subsystem
• Memory buffers with 128MB Cache
• ~70ns latency to memory
Bus Interfaces
• Durable Memory attach Interface (DMI)
• Integrated PCIe Gen3
• SMP Interconnect for up to 4 sockets
Virtual Addressing
•Accelerator can work with same memory
addresses that the processors use
•Pointers de-referenced same as the host
application
•Removes OS & device driver overhead
Hardware Managed Cache Coherence
•Enables the accelerator to participate in “Locks” as
a normal thread
•Lowers Latency over IO communication model
6 Hardware Partners developing with CAPI
Over 20 CAPI Solutions
• All listed here http://ibm.biz/powercapi
Examples of Available CAPI Solutions
• IBM Data Engine for NoSQL
• DRC Graphfind analytics
• Erasure Code Acceleration for Hadoop
Coherent Accelerator Processor Interface
(CAPI)
22nm SOI, eDRAM, 15 ML 650mm2
SMP
http://openpowerfoundation.org/wp-content/uploads/2016/04/HardwareRevealFlyerFinal.pdf
Newly Announced OpenPOWER systems and solutions:
POWER8 Processor - Design
1
1
© International Business Machines (IBM) 2017
Introducing Minsky S822LC OpenPOWER system for HPC
first custom-built GPU accelerator server with NVLink
|
12
2.5x Faster CPU-GPU Data
Communication via NVLink
NVLink
80 GB/s
GPU
P8
GPU GPU
P8
GPU
PCIe
32 GB/s
GPU
x86
GPU GPU
x86
GPU
No NVLink between CPU &
GPU for x86 Servers: PCIe
Bottleneck
NVIDIA P100 Pascal GPU
POWER8 NVLink Server x86 Servers with PCIe
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPUs & GPUs and among GPUs
• Features novel NVIDIA P100 Pascal GPU
accelerator
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
Deep Learning on OpenPOWER with GPUs
Transparent acceleration without code changes
|
13
Apr 7, 2017(c) International Business Machines (IBM) 2017
Introducing PowerAI: Get started fast with Deep Learning
14
Enabled by High Performance Computing Infrastructure
Package of Pre-Compiled Major
Deep Learning Frameworks
Easy to install & get started with
Deep Learning with Enterprise-
Class Support
for Performance
To Take Advantage of NVLink
https://www.ibm.com/ms-en/marketplace/deep-learning-platform
Machine and Deep Learning analytics on OpenPOWER
no code changes needed!!
15
ATLAS
Automatically Tuned Linear Algebra
Software)
https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-
44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en
OpenPOWER: GPU support
16
Credit: Kevin Klaues, Mesosphere
Mesos supports GPU scheduling
Huge speed-ups with GPUs and OpenPOWER!
Enabling Accelerators/GPUs in the cloud stack
17
Deep Learning Training + Inference
Containers
and images
Accelerators
Clustering frameworks
Tensorflow on tesla P100: PowerAI is 30% faster
18
IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 /
cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA
8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Larger value is better
PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar
19
▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage
of the full capabilities of NVLink
▪ For image classification and analysis this means a 1.6X price
performance advantage relative to the NVIDIA DGX-1
System Images / Second List Price $ / Image / Second
NVIDIA DGX-1
(8 P100 GPU,
512GB Mem)
330 $129,000 $390
PowerAI (4 P100
GPU, 512 GB Mem)
273 $67,000 $241
Lower cost is better
NVLink and P100 advantage
|
20
• NVLink reduces communication time and overhead
• Incorporating the fastest GPU for deep learning
• Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance
Spark Machine Learning performance tuning on OpenPOWER
What knobs can you tweak?
|
21
Apr 7, 2017(c) International Business Machines (IBM) 2017
Spark on OpenPower
• Streaming and SQL benefit from High Thread Density and Concurrency
• Processing multiple packets of a stream and different stages of a message stream pipeline
• Processing multiple rows from a query
2
2
© International Business Machines (IBM) 2017
• Machine Learning benefits from Large Caches and Memory Bandwidth
• Iterative Algorithms on the same data
• Fewer core pipeline stalls and overall higher throughput
2
3
Spark on OpenPower
© International Business Machines (IBM) 2017
• Graph algorithms also benefit from Large Caches, Memory Bandwidth and Higher
Thread Strength
• Flexibility to go from 8 SMT threads per core to 4 or 2
• Manage Balance between thread performance and throughput
24
Spark on OpenPower
© International Business Machines (IBM) 2017
• Headroom
• Balanced resource utilization, more efficient scale-out
• Multi-tenant deployments
2
5
Spark on OpenPower
© International Business Machines (IBM) 2017
Roofline SPARK Performance Model
26
Spark Tunables
Spark Performance
“Roofline” Performance
Navigation uses system resource
workload characterization and analysis
to look for fundamental inefficiencies
“Roofline “
Good Enough
“Out of Box”
FOR 1 … MAX WORKERS
FOR 1 …. MAX CPU PER NODE
FOR 1 … MAX THREADS PER CPU
FOR 1… MAX PARTITIONS
Unwieldly & Complicated
(some respite in ML workloads
from data sampling)
Performance Navigation Automation Script
© International Business Machines (IBM) 2017
Performance Tuning Tips for a Machine Learning Workload
27
Top Down Approach
Methodology:
Alternating Least Squares Based
Matrix Factorization application
Optimization Process:
Spark executor Instances
Spark executor cores
Spark executor memory
Spark shuffle location and manager
RDD persistence storage level
Application
Large No of Spark Tunable -
Spark Executors and Spark Cores ……
Default Configurations
Out of Box Performance
Bottom Approach
System Hardware
Characterizing the Workload
Through Resource monitoring
Custom SPARK Tunables from
Configuration Sweeps
Roofline Performance
WorkFlow
28
• Matrix Factorization from SPARKBENCH - https://github.com/SparkTC/spark-bench
• Training
• Validation
• Prediction
With permission - Raj Krishnamurthy STRATA NYC 2016
© International Business Machines (IBM) 2017
Parameters used for data generation in MF application
Matrix Factorization with Alternating Least Squares
29
Data generation
parameters
Value
Rows in data matrix 62000
Columns in data matrix 62000
Data set size 100 GB
Spark parameter Value for MF
Master node 1
Worker nodes 6
Executors per Node 1
Executor cores 80 / 40 /24
Executor Memory 480 GB
Shuffle Location HDDs
Input Storage HDFS
Job Function Description / API called
7 Mean at
MFApp.java
AbstractJavaRDDLike.map
MatrixFactorizationModel.predict
JavaDoubleRDD.mean
6 Aggregate at
MFModel.scala
MatrixFactorizationModel.predict
MatrixFactorizationModel.countApproxDistinctUserProduct
5 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
4 First at
MFModel.scala
ml.recommendation.ALS.computeFactors
3 Count at ALS.scala ALS.train and ALS.intialize
2 Count at ALS.scala ALS.train
1 Count at ALS.scala ALS.train
0 Count at ALS.scala ALS.train
© International Business Machines (IBM) 2017
Analyzing SPARK Configuration Sweep
30
Configur
ation
1 2 3 4 5 6 7 8 9 10 11
Spark
executor
cores
80 80 40 40 40 40 40 40 24 24 24
GC
options
Default Default Default ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=40
ParallelGCth
reads=24
ParallelGCth
reads=24
Default
RDD
compres
sion
TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Storage
level
memory_a
nd_disk
memory
_only
memory
_only
memory_onl
y
memory_and
_disk_ser
memory_onl
y_ser
memory_onl
y
memory_onl
y
memory_and
_disk_ser
memory_and
_disk_ser
memory_
and_disk
_ser
Partition
numbers
1000 1000 1000 1000 1000 1000 800 1200 1000 1000 1000
Shuffle
Manager
Sort based Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Sort
based
Tungsten-
sort
Tungsten-
sort
Run-
time
(minutes
)
40 34 26 24 20 25 26 27 21 19 18
Various configurations tried in optimizing MF application on Spark
© International Business Machines (IBM) 2017
GC and Memory Foot print
31
Configuration Run time of last stage GC time of last stage
1 12 min 4.4 min
4 4.4 min 1.8 min
9 3.5 min 1.6 min
11 47s 16s
Run time and GC time of Stage 68 for different configurations
© International Business Machines (IBM) 2017
Last Stage Analysis
32© International Business Machines (IBM) 2017
Characterizing Configuration #1
33
CPU utilization on
a worker node
(configuration 1 )
Memory utilization
on a worker node (
configuration 1)
© International Business Machines (IBM) 2017
Characterizing Configuration #1 and Configuration #11
34
Memory footprint of configuration 11
© International Business Machines (IBM) 2017
Summary - How to Optimize Closer to Roofline Performance Faster?
• Classify workload into CPU, memory, IO or mixed (CPU, memory, IO) intensive
• Characterize “out-of-the-box” workload to understand CPU, Memory, IO and Network performance
characteristics
• Floorplan cluster resources
• Tune “out-of-the-box” workload to navigate “Roofline” performance space in the above named dimensions
– If workload is memory/IO/Network bound then tune SPARK to increase operational intensity operations/byte as much as
possible to make it CPU bound
• Divide search space into regions and perform exhaustive search
35© International Business Machines (IBM) 2017
IO Optimizations
How to take advantage of faster networks?
36Apr 7, 2017(c) International Business Machines (IBM) 2017
THE GAP – HIGH-PERFORMANCE NETWORKS
The networks – 1, 10, and 40 Gbps networks
Runtime(secs)
37Apr 7, 2017(c) International Business Machines (IBM) 2017
THE PERFORMANCE LOSS IN THE BIG-DATA STACK
High-Performance
I/O devices
• Data copies
• Context switches
• Cache pollution
• Deep call-stacks
• Legacy I/O interfaces
38Apr 7, 2017(c) International Business Machines (IBM) 2017
The Crail Architecture WWW.CRAIL.IO
 A high-performance data fabric for the Apache Data
Processing Stack
 Relies on the principles of user level IO
 Separation between control path and data path
 User-space direct-access I/O architecture/layer cut-through
 Builds on a distributed, shared data store
 No changes to overall data processing framework
 Is optimized to serve short-lived data sharing and staging
spark / flink / storm …
HDFS
Crail Store
High Performance
RDMA Network
zerocopy
spark specific
shuffle broadcast
39Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION - TERASORT
0
100
200
300
400
500
600
Spark Spark/Crail
Runtime(seconds)
12.8 TB data set, TeraSort
reduce
map
128 nodes OpenPOWER cluster
• 2 x IBM POWER8 10-core @ 2.9 Ghz
• DRAM: 512GB DDR4
• 4 x 1.2 TB NVMe SSD
• 100GbE Mellanox ConnectX-4 EN (RoCE)
• Ubuntu 16.04 (kernel 4.4.0-31)
• Spark 2.0.2
Performance gain: 6x
• Most gain from reduce phase:
• Crail shuffler much faster than Spark build-in
• Dramatically reduced CPU involvement
• Dramatically improved network usage
• Map phase: all activity local
• Still faster than vanilla Spark
40Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION – TERASORT: NETWORK IO
• Vanilla Spark runs on 100GbE
• Spark/Crail runs on 100Gb RoCE/RDMA
• Vanilla Spark peaks at ~10Gb/s
• Spark/Crail shuffle delivers ~70Gb/s per node
41Apr 7, 2017(c) International Business Machines (IBM) 2017
EVALUATION – TERASORT CPU EFFICIENCY
• Spark/Crail completes much faster despite comparable CPU load
• Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13
vs. 4.4 GB/min/core
• 2016 winner runs native C code!
Spark +
Crail
Spark
2.0.2
Winner
2014
Winner
2016
Size
TB
12.8 100
Time
sec
98 527 1406 98.6
Cores 2560 6592 10240
Nodes 128 206 512
NW
Gb/s
100 10 100
Rate
TB/min
7.8 1.4 4.27 44.78
Rate/core
GB/min
3.13 0.58 0.66 4.4
42Apr 7, 2017(c) International Business Machines (IBM) 2017
CRAIL WITH THE HORTONWORKS STACK
scalable, fault-tolerant,
cost-efficient storage
resource manager
compute
frameworks
user
interfaces
broadcast
HDFS
plugin
RPCs
shuffle
caching
key-value
store
...
High-performance
Crailfabric
43Apr 7, 2017(c) International Business Machines (IBM) 2017
Roadmap
Where is OpenPOWER headed?
|
44
Apr 7, 2017(c) International Business Machines (IBM) 2017
Accelerator Technology
2015 2016 2017
POWER8 POWER8 with NVLink POWER9
OpenPower
CAPI Interface
Enhanced CAPI
& NVLink
Connect-IB
FDR Infiniband
PCIe Gen3
ConnectX-4
EDR Infiniband
CAPI over PCIe Gen3
ConnectX-5
Next-Gen Infiniband
Enhanced CAPI over PCIe Gen4
IBM CPUs
Kepler
PCIe Gen3
Volta
Enhanced NVLink
Pascal
NVLink
45© International Business Machines (IBM) 2017
NOTICES AND DISCLAIMERS
46
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial
publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS"
WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT
LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms
apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used
IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM
operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions
are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as
to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any
actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services
or products will ensure that the customer is in compliance with any law
© 2016 International Business Machines C
NOTICES AND DISCLAIMERS CON’T.
47
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those
products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of
non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business
Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®,
OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®,
Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and
System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or
other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
48
Q & A

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Correio eletronico
Correio eletronicoCorreio eletronico
Correio eletronico
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion SystemsMIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
MIPI DevCon 2016: MIPI CSI-2 Application for Vision and Sensor Fusion Systems
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
Arm Cortex
Arm CortexArm Cortex
Arm Cortex
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 
Usabilidade Aula-05. Processos: heuristicas
Usabilidade Aula-05. Processos: heuristicasUsabilidade Aula-05. Processos: heuristicas
Usabilidade Aula-05. Processos: heuristicas
 
Historic Opportunities: Discover the Power of Ignition's Historian
Historic Opportunities: Discover the Power of Ignition's HistorianHistoric Opportunities: Discover the Power of Ignition's Historian
Historic Opportunities: Discover the Power of Ignition's Historian
 
Pipeline Técnica de processadores.
Pipeline Técnica de processadores.Pipeline Técnica de processadores.
Pipeline Técnica de processadores.
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Maquinas multinivel
Maquinas multinivelMaquinas multinivel
Maquinas multinivel
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Bellekler
BelleklerBellekler
Bellekler
 

Similar a How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
IBM Switzerland
 

Similar a How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors (20)

BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Capi snap overview
Capi snap overviewCapi snap overview
Capi snap overview
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 

Más de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors

  • 1. Optimizing Hortonworks Apache Spark machine learning workloads for contemporary Open Platforms Raj Krishnamurthy, Indrajit Poddar (I.P), IBM Systems Animesh Trivedi, Bernard Metzler, IBM Research © International Business Machines (IBM) 2017
  • 2. Please Note: • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. • Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 2 © International Business Machines (IBM) 2017
  • 3. Agenda Spark, Machine Learning and Deep Learning Overview Why OpenPower ? Deep Learning with OpenPOWER GPUs Spark Machine Learning performance tuning with OpenPower CPUs IO Optimization for Spark TeraSort benchmark 3 © International Business Machines (IBM) 2017
  • 4. What is Apache Spark • Unified Analytics Platform – Combine streaming, graph, machine learning and sql analytics on a single platform – Simplified, multi-language programming model – Interactive and Batch • In-Memory Design – Pipelines multiple iterations on single copy of data in memory – Superior Performance – Natural Successor to MapReduce Fast and general engine for large-scale data processing Spark Core API R Scala SQL Python Java Spark SQL Streaming MLlib GraphX 4 © International Business Machines (IBM) 2017
  • 5. Machine Learning and Deep Learning (ML/DL) What you and I (our brains) do without even thinking about it…..we recognize a bicycle Apr 7, 2017 (c) International Business Machines (IBM) 2017 5
  • 6. 6 Now machines are learning the way we learn…. From "Texture of the Nervous System of Man and the Vertebrates" by Santiago Ramón y Cajal. Artificial Neural Networks Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 7. But training needs a lot computational resources Easy scale-out with: But model training is not easy to distribute Training can take hours, days or weeks Input data and model sizes are becoming larger than ever (e.g. video input, billions of features etc.) Real-time analytics with: • whole system optimization • offloaded computation • accelerators, and • higher memory bandwidth systems Resulting in need for: Moore’s law is dying Apr 7, 2017(c) International Business Machines (IBM) 2017 7
  • 8. Today’s challenges demand whole system innovation You are here 44 zettabytes unstructured data 2010 2020 structured data Data holds competitive valueFull system and stack open innovation required DataGrowth Price/Performance Moore’s Law Processor Technology 2000 2020 Firmware / OS Accelerators Software Storage Network 8 © International Business Machines (IBM) 2017
  • 9. 9 OpenPOWER: open hardware for high performance Systems designed for big data analytics and superior cloud economics Upto: 12 cores per cpu 96 hardware threads per cpu 1 TB RAM 7.6Tb/s combined I/O Bandwidth GPUs and FPGAs coming… OpenPOWER Traditional Intel x86 http://www.softlayer.com/POWER-SERVERS https://mc.jarvice.com/ Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 10. 10 OpenPower Ecosystem – Members (c) International Business Machines (IBM) 2017 Apr 7, 2017
  • 11. Memory Interface Control Memory IBM & Partner Devices CAPI/PCI DMI Cores • 12 cores / 8 threads per core • TDP: 130W and 190W • 64K data cache, 32K instruction cache Accelerators • Crypto & memory expansion • Transactional Memory Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 Memory Subsystem • Memory buffers with 128MB Cache • ~70ns latency to memory Bus Interfaces • Durable Memory attach Interface (DMI) • Integrated PCIe Gen3 • SMP Interconnect for up to 4 sockets Virtual Addressing •Accelerator can work with same memory addresses that the processors use •Pointers de-referenced same as the host application •Removes OS & device driver overhead Hardware Managed Cache Coherence •Enables the accelerator to participate in “Locks” as a normal thread •Lowers Latency over IO communication model 6 Hardware Partners developing with CAPI Over 20 CAPI Solutions • All listed here http://ibm.biz/powercapi Examples of Available CAPI Solutions • IBM Data Engine for NoSQL • DRC Graphfind analytics • Erasure Code Acceleration for Hadoop Coherent Accelerator Processor Interface (CAPI) 22nm SOI, eDRAM, 15 ML 650mm2 SMP http://openpowerfoundation.org/wp-content/uploads/2016/04/HardwareRevealFlyerFinal.pdf Newly Announced OpenPOWER systems and solutions: POWER8 Processor - Design 1 1 © International Business Machines (IBM) 2017
  • 12. Introducing Minsky S822LC OpenPOWER system for HPC first custom-built GPU accelerator server with NVLink | 12 2.5x Faster CPU-GPU Data Communication via NVLink NVLink 80 GB/s GPU P8 GPU GPU P8 GPU PCIe 32 GB/s GPU x86 GPU GPU x86 GPU No NVLink between CPU & GPU for x86 Servers: PCIe Bottleneck NVIDIA P100 Pascal GPU POWER8 NVLink Server x86 Servers with PCIe • Custom-built GPU Accelerator Server • High-Speed NVLink Connections between CPUs & GPUs and among GPUs • Features novel NVIDIA P100 Pascal GPU accelerator M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
  • 13. Deep Learning on OpenPOWER with GPUs Transparent acceleration without code changes | 13 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 14. Introducing PowerAI: Get started fast with Deep Learning 14 Enabled by High Performance Computing Infrastructure Package of Pre-Compiled Major Deep Learning Frameworks Easy to install & get started with Deep Learning with Enterprise- Class Support for Performance To Take Advantage of NVLink https://www.ibm.com/ms-en/marketplace/deep-learning-platform
  • 15. Machine and Deep Learning analytics on OpenPOWER no code changes needed!! 15 ATLAS Automatically Tuned Linear Algebra Software) https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d- 44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en
  • 16. OpenPOWER: GPU support 16 Credit: Kevin Klaues, Mesosphere Mesos supports GPU scheduling Huge speed-ups with GPUs and OpenPOWER!
  • 17. Enabling Accelerators/GPUs in the cloud stack 17 Deep Learning Training + Inference Containers and images Accelerators Clustering frameworks
  • 18. Tensorflow on tesla P100: PowerAI is 30% faster 18 IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Larger value is better
  • 19. PowerAI vs DGX-1: 1.6x Tensorflow throughput / dollar 19 ▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage of the full capabilities of NVLink ▪ For image classification and analysis this means a 1.6X price performance advantage relative to the NVIDIA DGX-1 System Images / Second List Price $ / Image / Second NVIDIA DGX-1 (8 P100 GPU, 512GB Mem) 330 $129,000 $390 PowerAI (4 P100 GPU, 512 GB Mem) 273 $67,000 $241 Lower cost is better
  • 20. NVLink and P100 advantage | 20 • NVLink reduces communication time and overhead • Incorporating the fastest GPU for deep learning • Data gets from GPU-GPU, Memory-GPU faster, for shorter training times x86 based GPU system POWER8 + Tesla P100+NVLink ImageNet / Alexnet: Minibatch size = 128 170 ms 78 ms IBM advantage: data communication and GPU performance
  • 21. Spark Machine Learning performance tuning on OpenPOWER What knobs can you tweak? | 21 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 22. Spark on OpenPower • Streaming and SQL benefit from High Thread Density and Concurrency • Processing multiple packets of a stream and different stages of a message stream pipeline • Processing multiple rows from a query 2 2 © International Business Machines (IBM) 2017
  • 23. • Machine Learning benefits from Large Caches and Memory Bandwidth • Iterative Algorithms on the same data • Fewer core pipeline stalls and overall higher throughput 2 3 Spark on OpenPower © International Business Machines (IBM) 2017
  • 24. • Graph algorithms also benefit from Large Caches, Memory Bandwidth and Higher Thread Strength • Flexibility to go from 8 SMT threads per core to 4 or 2 • Manage Balance between thread performance and throughput 24 Spark on OpenPower © International Business Machines (IBM) 2017
  • 25. • Headroom • Balanced resource utilization, more efficient scale-out • Multi-tenant deployments 2 5 Spark on OpenPower © International Business Machines (IBM) 2017
  • 26. Roofline SPARK Performance Model 26 Spark Tunables Spark Performance “Roofline” Performance Navigation uses system resource workload characterization and analysis to look for fundamental inefficiencies “Roofline “ Good Enough “Out of Box” FOR 1 … MAX WORKERS FOR 1 …. MAX CPU PER NODE FOR 1 … MAX THREADS PER CPU FOR 1… MAX PARTITIONS Unwieldly & Complicated (some respite in ML workloads from data sampling) Performance Navigation Automation Script © International Business Machines (IBM) 2017
  • 27. Performance Tuning Tips for a Machine Learning Workload 27 Top Down Approach Methodology: Alternating Least Squares Based Matrix Factorization application Optimization Process: Spark executor Instances Spark executor cores Spark executor memory Spark shuffle location and manager RDD persistence storage level Application Large No of Spark Tunable - Spark Executors and Spark Cores …… Default Configurations Out of Box Performance Bottom Approach System Hardware Characterizing the Workload Through Resource monitoring Custom SPARK Tunables from Configuration Sweeps Roofline Performance
  • 28. WorkFlow 28 • Matrix Factorization from SPARKBENCH - https://github.com/SparkTC/spark-bench • Training • Validation • Prediction With permission - Raj Krishnamurthy STRATA NYC 2016 © International Business Machines (IBM) 2017
  • 29. Parameters used for data generation in MF application Matrix Factorization with Alternating Least Squares 29 Data generation parameters Value Rows in data matrix 62000 Columns in data matrix 62000 Data set size 100 GB Spark parameter Value for MF Master node 1 Worker nodes 6 Executors per Node 1 Executor cores 80 / 40 /24 Executor Memory 480 GB Shuffle Location HDDs Input Storage HDFS Job Function Description / API called 7 Mean at MFApp.java AbstractJavaRDDLike.map MatrixFactorizationModel.predict JavaDoubleRDD.mean 6 Aggregate at MFModel.scala MatrixFactorizationModel.predict MatrixFactorizationModel.countApproxDistinctUserProduct 5 First at MFModel.scala ml.recommendation.ALS.computeFactors 4 First at MFModel.scala ml.recommendation.ALS.computeFactors 3 Count at ALS.scala ALS.train and ALS.intialize 2 Count at ALS.scala ALS.train 1 Count at ALS.scala ALS.train 0 Count at ALS.scala ALS.train © International Business Machines (IBM) 2017
  • 30. Analyzing SPARK Configuration Sweep 30 Configur ation 1 2 3 4 5 6 7 8 9 10 11 Spark executor cores 80 80 40 40 40 40 40 40 24 24 24 GC options Default Default Default ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=40 ParallelGCth reads=24 ParallelGCth reads=24 Default RDD compres sion TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE Storage level memory_a nd_disk memory _only memory _only memory_onl y memory_and _disk_ser memory_onl y_ser memory_onl y memory_onl y memory_and _disk_ser memory_and _disk_ser memory_ and_disk _ser Partition numbers 1000 1000 1000 1000 1000 1000 800 1200 1000 1000 1000 Shuffle Manager Sort based Sort based Sort based Sort based Sort based Sort based Sort based Sort based Sort based Tungsten- sort Tungsten- sort Run- time (minutes ) 40 34 26 24 20 25 26 27 21 19 18 Various configurations tried in optimizing MF application on Spark © International Business Machines (IBM) 2017
  • 31. GC and Memory Foot print 31 Configuration Run time of last stage GC time of last stage 1 12 min 4.4 min 4 4.4 min 1.8 min 9 3.5 min 1.6 min 11 47s 16s Run time and GC time of Stage 68 for different configurations © International Business Machines (IBM) 2017
  • 32. Last Stage Analysis 32© International Business Machines (IBM) 2017
  • 33. Characterizing Configuration #1 33 CPU utilization on a worker node (configuration 1 ) Memory utilization on a worker node ( configuration 1) © International Business Machines (IBM) 2017
  • 34. Characterizing Configuration #1 and Configuration #11 34 Memory footprint of configuration 11 © International Business Machines (IBM) 2017
  • 35. Summary - How to Optimize Closer to Roofline Performance Faster? • Classify workload into CPU, memory, IO or mixed (CPU, memory, IO) intensive • Characterize “out-of-the-box” workload to understand CPU, Memory, IO and Network performance characteristics • Floorplan cluster resources • Tune “out-of-the-box” workload to navigate “Roofline” performance space in the above named dimensions – If workload is memory/IO/Network bound then tune SPARK to increase operational intensity operations/byte as much as possible to make it CPU bound • Divide search space into regions and perform exhaustive search 35© International Business Machines (IBM) 2017
  • 36. IO Optimizations How to take advantage of faster networks? 36Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 37. THE GAP – HIGH-PERFORMANCE NETWORKS The networks – 1, 10, and 40 Gbps networks Runtime(secs) 37Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 38. THE PERFORMANCE LOSS IN THE BIG-DATA STACK High-Performance I/O devices • Data copies • Context switches • Cache pollution • Deep call-stacks • Legacy I/O interfaces 38Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 39. The Crail Architecture WWW.CRAIL.IO  A high-performance data fabric for the Apache Data Processing Stack  Relies on the principles of user level IO  Separation between control path and data path  User-space direct-access I/O architecture/layer cut-through  Builds on a distributed, shared data store  No changes to overall data processing framework  Is optimized to serve short-lived data sharing and staging spark / flink / storm … HDFS Crail Store High Performance RDMA Network zerocopy spark specific shuffle broadcast 39Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 40. EVALUATION - TERASORT 0 100 200 300 400 500 600 Spark Spark/Crail Runtime(seconds) 12.8 TB data set, TeraSort reduce map 128 nodes OpenPOWER cluster • 2 x IBM POWER8 10-core @ 2.9 Ghz • DRAM: 512GB DDR4 • 4 x 1.2 TB NVMe SSD • 100GbE Mellanox ConnectX-4 EN (RoCE) • Ubuntu 16.04 (kernel 4.4.0-31) • Spark 2.0.2 Performance gain: 6x • Most gain from reduce phase: • Crail shuffler much faster than Spark build-in • Dramatically reduced CPU involvement • Dramatically improved network usage • Map phase: all activity local • Still faster than vanilla Spark 40Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 41. EVALUATION – TERASORT: NETWORK IO • Vanilla Spark runs on 100GbE • Spark/Crail runs on 100Gb RoCE/RDMA • Vanilla Spark peaks at ~10Gb/s • Spark/Crail shuffle delivers ~70Gb/s per node 41Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 42. EVALUATION – TERASORT CPU EFFICIENCY • Spark/Crail completes much faster despite comparable CPU load • Spark/Crail CPU efficiency is close to 2016 sorting benchmark winner: 3.13 vs. 4.4 GB/min/core • 2016 winner runs native C code! Spark + Crail Spark 2.0.2 Winner 2014 Winner 2016 Size TB 12.8 100 Time sec 98 527 1406 98.6 Cores 2560 6592 10240 Nodes 128 206 512 NW Gb/s 100 10 100 Rate TB/min 7.8 1.4 4.27 44.78 Rate/core GB/min 3.13 0.58 0.66 4.4 42Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 43. CRAIL WITH THE HORTONWORKS STACK scalable, fault-tolerant, cost-efficient storage resource manager compute frameworks user interfaces broadcast HDFS plugin RPCs shuffle caching key-value store ... High-performance Crailfabric 43Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 44. Roadmap Where is OpenPOWER headed? | 44 Apr 7, 2017(c) International Business Machines (IBM) 2017
  • 45. Accelerator Technology 2015 2016 2017 POWER8 POWER8 with NVLink POWER9 OpenPower CAPI Interface Enhanced CAPI & NVLink Connect-IB FDR Infiniband PCIe Gen3 ConnectX-4 EDR Infiniband CAPI over PCIe Gen3 ConnectX-5 Next-Gen Infiniband Enhanced CAPI over PCIe Gen4 IBM CPUs Kepler PCIe Gen3 Volta Enhanced NVLink Pascal NVLink 45© International Business Machines (IBM) 2017
  • 46. NOTICES AND DISCLAIMERS 46 Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.” Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law © 2016 International Business Machines C
  • 47. NOTICES AND DISCLAIMERS CON’T. 47 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.