1. Team 6:
Sourabh Ketkale : 010470785
Sahil Kaw : 010725104
Siddhi Pai : 010702458
Goutham Nekkalapu : 010815233
Prince Jacob Chandy : 010807225
2.
3.
4. Comparison to Optimized BLAS package : For higher order matrices the speedup
of BLAS packages was higher in comparison of the baseline CPU.
Comparison to an optimized GPU implementation: Without batching the GPU
attained 2.8 times speedup to baseline CPU.
5. Linear Quantization: We make use of the 8 bit quantization technique to
convert activations into unsigned character and weights into signed character
with biases which are coded as 32 bit
Intel SSE3: We are able to achieve the 3* speed up because it provides
support to pmaddubsw.
Intel SSE4: These instruction set provide optimization to convert 16 bit to 32
bit instruction and thereby we achieve 9% relative speed improvement over
SSE3 benchmark.
6. BATCHING: With batching we can further overcome the GPU performance by applying
batching on neural networks in bulk so that we can take advantage of CPU caching of
both weights and activation.
LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we
can reduce the number of parameters that needs to be visited at every point and
thereby reducing the number of the arithmetic and memory operations using Gaussian
Selection technique.
BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches
in the speech evaluation readily improve the performance of the CPU over GPU.
7. Auto encoder is an artificial Neural network used for learning efficient codings.
The stacked auto encoder is a deep learning model consists of multiple auto-
encoders.
XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has
8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock
speed. It has 32 KB L1 data cache and 512 KB L2 cache
8. Thread oversubscription means number of thread in parallel is more than the number
of the threads of the XEON PHI supports
It greatly decrease the performance of the XENON PHI as it leads to context switching
and in a many core processor its very expensive
Solution:
MapReduce method can effectively determine the number of threads required by
MKL(Math Kernal libraries) function.
MKL libraries itself also determine the number of threads required by the process but
not suited for model parallelism and asynchronous training
9. Basic Design of Xeon Phi:
Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM
and the memory and thus this time also needs consideration.
To solve this we generally keep all parameters and the temporary variables always stored in
global memory of Xeon Phi and keep on transferring the training dataset.
Parallel Design:
Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation
in each model replica.
Task Parallelism: Is achieved by multiple threads in the XEON PHI
Affinity Mode: Affinity sets up the mapping between the thread and the core.
12. For achieving this kind of computing, one can’t depend upon a single system;
you need ‘large scale distributed systems’
13. You have multiple model
replicas, each consisting
of multiple machines, that
train on different subset
of data. And they publish
updates to the global model
parameter server
Model Parallelism
Data Parallelism
14. Whole system co-design
Model partitioning – working set of the model is stored in L3 cache
Local weight computation at the parameter server
Exploiting Asynchrony (as weight updates are commutative and associative)
Multi-threaded weight updates without locks
Asynchronous batch updates – aggregate the weights and update to parameter server
only when we have large enough aggregation
15. To achieve this, GeePS needs to overcome the challenges of limited GPU memory,
and inter-machine communication (data movement overheads), GPU stalls
Parameter server works by separating the problems of processing data and the
problem of communicating and synchronizing them between different machines
GeePS is a parameter server supporting data-parallel model training
16. The authors tried using an existing state-of-the-art parameter server system (IterStore)
with GPU based ML…
To enable a parameter server to support parallel ML applications running on distributed
GPUs the authors make three important changes:
Explicit use of GPU memory for the parameter cache
Batch-based parameter access methods
Parameter server management of GPU memory on behalf of the application
17. GPUs using a CPU-based parameter server
GPU based parameter server
18.
19. Two ways to achieve parallelism:
• By distributing deep computation into a Hadoop cluster or cloud of computing nodes
• By using field programmable gate arrays (FPGA) hardware acceleration to speed up
computationally intensive deep learning Kernels
20.
21. Performance bottle necks in Deep learning of CNN
Design Distributed Hadoop clusters with separation of kernels processed Standard or
accelerated FPGA based nodes
Design and synthesis of the reconfigurable architecture to support Kernel
acceleration on
Designing a interface library to achieve compatibility between FPGA nodes and
general purpose nodes
22. Kernel Identification
Approach to Distributed Algorithm With FPGA-Based Nodes
Design and Implementation Of Reconfigurable Architecture
For Deep Learning Kernels
Seamless Integration of the Distributed Algorithm with the
Accelerated Kernels
23.
24. To cash on the advantage to achieve fine grain parallelism with the help of
reconfigurable hardware which cannot be done in case of GPU’s
The performance per watt ratio is better with FPGA’s which can exploit computation
power with lower energy consumption on power intensive environments like mobile
devices, data centers
Support with all the open source framework for the
25.
26.
27. A set of programming languages, models and tools
supporting the Intel x86 architecture can also be used
on the Intel Xeon Phi coprocessor with little change.
As a result, instead of redesigning new algorithms or
models for GPU in CUDA or OpenCL.
The vector-intensive algorithms can take advantage of
the above mentioned architecture
28.
29.
30. OpenMP and Intel MKL (Math Kernel Library)
packages are used to parallelize them.
Many matrix multiplications and are tackled by
the Intel MKL packages.
31. lAchieves a 302-fold speedup compared with the
un-optimized sequential algorithm
33. Thread parallelism
Controlled Hogwild
Arbitrary Order of Synchronization
Vectorization
34. Speed up of the algorithm, compared to one
thread on the Xeon Phi and that of on sequential
version executed on Xeon E5
Execution times for all thread counts and CNN
architecture sizes on the Xeon Phi, and the
sequential version on Xeon E5
35.
36.
37.
38. Implements Deep Learning on low cost platforms.
Low platform device adopts task flexible architecture and
multiple parallelism to cover functions of CDBN.
39. complex function
an additional stage
random number generation
Additional tradeoff
Arithmetic Precision
Hardware Parallelism
Memory Input output bandwidth
Random number generator
40. By implementing 3 key features
Deep network learning engine with dual threaded 4 stage task level pipeline.
Deep network inference engine with dynamically reconfigurable systolic PE array.
True Random number generator.
41. High computational throughput and memory bandwidth
Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations
on GPU and INTEL MIC
Hence, we go for many core architecture.
42.
43. For 1D and 2D : Register tiling.
For Multi-channel 2D convolution: Local Memory tiling.
44. On Intel MIC, our solution gets up to 25% of the theoretical
peak performance.
45. Deep Learning algorithms being Computing power intensive, it depends on the use
case scenario to choose the framework and hardware
GPU :
Pro: They provide huge computational power
Can be used as a cluster of GPU’s
But huge power consumption and algorithms have to be designed and implemented again
in CUDA/OpenCL
FPGAs :
Pro : Low power consumption when compared to GPUs
But, design of algorithm on this can be time consuming
A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node
FPGA accelerated Hadoopcluster
46. Xeon Phi co-processor:
Pro : Offers considerable amount of computation power, very easy to migrate to this platform
from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce
method
But, to run huge datasets, should use higher end processor
X86
CPU: Can improve the performance by fixed point implementation, batching and lazy
evaluation.