SlideShare una empresa de Scribd logo
1 de 46
Team 6:
Sourabh Ketkale : 010470785
Sahil Kaw : 010725104
Siddhi Pai : 010702458
Goutham Nekkalapu : 010815233
Prince Jacob Chandy : 010807225
 Comparison to Optimized BLAS package : For higher order matrices the speedup
of BLAS packages was higher in comparison of the baseline CPU.
 Comparison to an optimized GPU implementation: Without batching the GPU
attained 2.8 times speedup to baseline CPU.
 Linear Quantization: We make use of the 8 bit quantization technique to
convert activations into unsigned character and weights into signed character
with biases which are coded as 32 bit
 Intel SSE3: We are able to achieve the 3* speed up because it provides
support to pmaddubsw.
 Intel SSE4: These instruction set provide optimization to convert 16 bit to 32
bit instruction and thereby we achieve 9% relative speed improvement over
SSE3 benchmark.
BATCHING: With batching we can further overcome the GPU performance by applying
batching on neural networks in bulk so that we can take advantage of CPU caching of
both weights and activation.
LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we
can reduce the number of parameters that needs to be visited at every point and
thereby reducing the number of the arithmetic and memory operations using Gaussian
Selection technique.
BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches
in the speech evaluation readily improve the performance of the CPU over GPU.
 Auto encoder is an artificial Neural network used for learning efficient codings.
 The stacked auto encoder is a deep learning model consists of multiple auto-
encoders.
 XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has
8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock
speed. It has 32 KB L1 data cache and 512 KB L2 cache
 Thread oversubscription means number of thread in parallel is more than the number
of the threads of the XEON PHI supports
 It greatly decrease the performance of the XENON PHI as it leads to context switching
and in a many core processor its very expensive
Solution:
 MapReduce method can effectively determine the number of threads required by
MKL(Math Kernal libraries) function.
 MKL libraries itself also determine the number of threads required by the process but
not suited for model parallelism and asynchronous training
Basic Design of Xeon Phi:
Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM
and the memory and thus this time also needs consideration.
To solve this we generally keep all parameters and the temporary variables always stored in
global memory of Xeon Phi and keep on transferring the training dataset.
Parallel Design:
 Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation
in each model replica.
 Task Parallelism: Is achieved by multiple threads in the XEON PHI
 Affinity Mode: Affinity sets up the mapping between the thread and the core.
what is really holding us back with ‘deep learning’ ?
For achieving this kind of computing, one can’t depend upon a single system;
you need ‘large scale distributed systems’
You have multiple model
replicas, each consisting
of multiple machines, that
train on different subset
of data. And they publish
updates to the global model
parameter server
Model Parallelism
Data Parallelism
Whole system co-design
 Model partitioning – working set of the model is stored in L3 cache
 Local weight computation at the parameter server
Exploiting Asynchrony (as weight updates are commutative and associative)
 Multi-threaded weight updates without locks
 Asynchronous batch updates – aggregate the weights and update to parameter server
only when we have large enough aggregation
 To achieve this, GeePS needs to overcome the challenges of limited GPU memory,
and inter-machine communication (data movement overheads), GPU stalls
 Parameter server works by separating the problems of processing data and the
problem of communicating and synchronizing them between different machines
 GeePS is a parameter server supporting data-parallel model training
The authors tried using an existing state-of-the-art parameter server system (IterStore)
with GPU based ML…
To enable a parameter server to support parallel ML applications running on distributed
GPUs the authors make three important changes:
 Explicit use of GPU memory for the parameter cache
 Batch-based parameter access methods
 Parameter server management of GPU memory on behalf of the application
GPUs using a CPU-based parameter server
GPU based parameter server
Two ways to achieve parallelism:
• By distributing deep computation into a Hadoop cluster or cloud of computing nodes
• By using field programmable gate arrays (FPGA) hardware acceleration to speed up
computationally intensive deep learning Kernels
 Performance bottle necks in Deep learning of CNN
 Design Distributed Hadoop clusters with separation of kernels processed Standard or
accelerated FPGA based nodes
 Design and synthesis of the reconfigurable architecture to support Kernel
acceleration on
 Designing a interface library to achieve compatibility between FPGA nodes and
general purpose nodes
Kernel Identification
 Approach to Distributed Algorithm With FPGA-Based Nodes
Design and Implementation Of Reconfigurable Architecture
For Deep Learning Kernels
Seamless Integration of the Distributed Algorithm with the
Accelerated Kernels
 To cash on the advantage to achieve fine grain parallelism with the help of
reconfigurable hardware which cannot be done in case of GPU’s
 The performance per watt ratio is better with FPGA’s which can exploit computation
power with lower energy consumption on power intensive environments like mobile
devices, data centers
 Support with all the open source framework for the
 A set of programming languages, models and tools
supporting the Intel x86 architecture can also be used
on the Intel Xeon Phi coprocessor with little change.
 As a result, instead of redesigning new algorithms or
models for GPU in CUDA or OpenCL.
 The vector-intensive algorithms can take advantage of
the above mentioned architecture
 OpenMP and Intel MKL (Math Kernel Library)
packages are used to parallelize them.
 Many matrix multiplications and are tackled by
the Intel MKL packages.
lAchieves a 302-fold speedup compared with the
un-optimized sequential algorithm
.
 Thread parallelism
 Controlled Hogwild
 Arbitrary Order of Synchronization
 Vectorization
 Speed up of the algorithm, compared to one
thread on the Xeon Phi and that of on sequential
version executed on Xeon E5
 Execution times for all thread counts and CNN
architecture sizes on the Xeon Phi, and the
sequential version on Xeon E5
Implements Deep Learning on low cost platforms.
Low platform device adopts task flexible architecture and
multiple parallelism to cover functions of CDBN.
 complex function
 an additional stage
 random number generation
Additional tradeoff
 Arithmetic Precision
 Hardware Parallelism
 Memory Input output bandwidth
 Random number generator
 By implementing 3 key features
 Deep network learning engine with dual threaded 4 stage task level pipeline.
 Deep network inference engine with dynamically reconfigurable systolic PE array.
 True Random number generator.
 High computational throughput and memory bandwidth
 Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations
on GPU and INTEL MIC
 Hence, we go for many core architecture.
 For 1D and 2D : Register tiling.
 For Multi-channel 2D convolution: Local Memory tiling.
On Intel MIC, our solution gets up to 25% of the theoretical
peak performance.
 Deep Learning algorithms being Computing power intensive, it depends on the use
case scenario to choose the framework and hardware
 GPU :
 Pro: They provide huge computational power
 Can be used as a cluster of GPU’s
 But huge power consumption and algorithms have to be designed and implemented again
in CUDA/OpenCL
 FPGAs :
 Pro : Low power consumption when compared to GPUs
 But, design of algorithm on this can be time consuming
 A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node
FPGA accelerated Hadoopcluster
 Xeon Phi co-processor:
 Pro : Offers considerable amount of computation power, very easy to migrate to this platform
from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce
method
 But, to run huge datasets, should use higher end processor
 X86
 CPU: Can improve the performance by fixed point implementation, batching and lazy
evaluation.

Más contenido relacionado

La actualidad más candente

Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
Liang Men
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Sitakanta Mishra
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 

La actualidad más candente (17)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 

Similar a DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0

Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
Content-Based Matching on GPUs
Content-Based Matching on GPUsContent-Based Matching on GPUs
Content-Based Matching on GPUs
Alessandro Margara
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 

Similar a DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0 (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Content-Based Matching on GPUs
Content-Based Matching on GPUsContent-Based Matching on GPUs
Content-Based Matching on GPUs
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 

DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0

  • 1. Team 6: Sourabh Ketkale : 010470785 Sahil Kaw : 010725104 Siddhi Pai : 010702458 Goutham Nekkalapu : 010815233 Prince Jacob Chandy : 010807225
  • 2.
  • 3.
  • 4.  Comparison to Optimized BLAS package : For higher order matrices the speedup of BLAS packages was higher in comparison of the baseline CPU.  Comparison to an optimized GPU implementation: Without batching the GPU attained 2.8 times speedup to baseline CPU.
  • 5.  Linear Quantization: We make use of the 8 bit quantization technique to convert activations into unsigned character and weights into signed character with biases which are coded as 32 bit  Intel SSE3: We are able to achieve the 3* speed up because it provides support to pmaddubsw.  Intel SSE4: These instruction set provide optimization to convert 16 bit to 32 bit instruction and thereby we achieve 9% relative speed improvement over SSE3 benchmark.
  • 6. BATCHING: With batching we can further overcome the GPU performance by applying batching on neural networks in bulk so that we can take advantage of CPU caching of both weights and activation. LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we can reduce the number of parameters that needs to be visited at every point and thereby reducing the number of the arithmetic and memory operations using Gaussian Selection technique. BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches in the speech evaluation readily improve the performance of the CPU over GPU.
  • 7.  Auto encoder is an artificial Neural network used for learning efficient codings.  The stacked auto encoder is a deep learning model consists of multiple auto- encoders.  XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has 8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock speed. It has 32 KB L1 data cache and 512 KB L2 cache
  • 8.  Thread oversubscription means number of thread in parallel is more than the number of the threads of the XEON PHI supports  It greatly decrease the performance of the XENON PHI as it leads to context switching and in a many core processor its very expensive Solution:  MapReduce method can effectively determine the number of threads required by MKL(Math Kernal libraries) function.  MKL libraries itself also determine the number of threads required by the process but not suited for model parallelism and asynchronous training
  • 9. Basic Design of Xeon Phi: Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM and the memory and thus this time also needs consideration. To solve this we generally keep all parameters and the temporary variables always stored in global memory of Xeon Phi and keep on transferring the training dataset. Parallel Design:  Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation in each model replica.  Task Parallelism: Is achieved by multiple threads in the XEON PHI  Affinity Mode: Affinity sets up the mapping between the thread and the core.
  • 10.
  • 11. what is really holding us back with ‘deep learning’ ?
  • 12. For achieving this kind of computing, one can’t depend upon a single system; you need ‘large scale distributed systems’
  • 13. You have multiple model replicas, each consisting of multiple machines, that train on different subset of data. And they publish updates to the global model parameter server Model Parallelism Data Parallelism
  • 14. Whole system co-design  Model partitioning – working set of the model is stored in L3 cache  Local weight computation at the parameter server Exploiting Asynchrony (as weight updates are commutative and associative)  Multi-threaded weight updates without locks  Asynchronous batch updates – aggregate the weights and update to parameter server only when we have large enough aggregation
  • 15.  To achieve this, GeePS needs to overcome the challenges of limited GPU memory, and inter-machine communication (data movement overheads), GPU stalls  Parameter server works by separating the problems of processing data and the problem of communicating and synchronizing them between different machines  GeePS is a parameter server supporting data-parallel model training
  • 16. The authors tried using an existing state-of-the-art parameter server system (IterStore) with GPU based ML… To enable a parameter server to support parallel ML applications running on distributed GPUs the authors make three important changes:  Explicit use of GPU memory for the parameter cache  Batch-based parameter access methods  Parameter server management of GPU memory on behalf of the application
  • 17. GPUs using a CPU-based parameter server GPU based parameter server
  • 18.
  • 19. Two ways to achieve parallelism: • By distributing deep computation into a Hadoop cluster or cloud of computing nodes • By using field programmable gate arrays (FPGA) hardware acceleration to speed up computationally intensive deep learning Kernels
  • 20.
  • 21.  Performance bottle necks in Deep learning of CNN  Design Distributed Hadoop clusters with separation of kernels processed Standard or accelerated FPGA based nodes  Design and synthesis of the reconfigurable architecture to support Kernel acceleration on  Designing a interface library to achieve compatibility between FPGA nodes and general purpose nodes
  • 22. Kernel Identification  Approach to Distributed Algorithm With FPGA-Based Nodes Design and Implementation Of Reconfigurable Architecture For Deep Learning Kernels Seamless Integration of the Distributed Algorithm with the Accelerated Kernels
  • 23.
  • 24.  To cash on the advantage to achieve fine grain parallelism with the help of reconfigurable hardware which cannot be done in case of GPU’s  The performance per watt ratio is better with FPGA’s which can exploit computation power with lower energy consumption on power intensive environments like mobile devices, data centers  Support with all the open source framework for the
  • 25.
  • 26.
  • 27.  A set of programming languages, models and tools supporting the Intel x86 architecture can also be used on the Intel Xeon Phi coprocessor with little change.  As a result, instead of redesigning new algorithms or models for GPU in CUDA or OpenCL.  The vector-intensive algorithms can take advantage of the above mentioned architecture
  • 28.
  • 29.
  • 30.  OpenMP and Intel MKL (Math Kernel Library) packages are used to parallelize them.  Many matrix multiplications and are tackled by the Intel MKL packages.
  • 31. lAchieves a 302-fold speedup compared with the un-optimized sequential algorithm
  • 32. .
  • 33.  Thread parallelism  Controlled Hogwild  Arbitrary Order of Synchronization  Vectorization
  • 34.  Speed up of the algorithm, compared to one thread on the Xeon Phi and that of on sequential version executed on Xeon E5  Execution times for all thread counts and CNN architecture sizes on the Xeon Phi, and the sequential version on Xeon E5
  • 35.
  • 36.
  • 37.
  • 38. Implements Deep Learning on low cost platforms. Low platform device adopts task flexible architecture and multiple parallelism to cover functions of CDBN.
  • 39.  complex function  an additional stage  random number generation Additional tradeoff  Arithmetic Precision  Hardware Parallelism  Memory Input output bandwidth  Random number generator
  • 40.  By implementing 3 key features  Deep network learning engine with dual threaded 4 stage task level pipeline.  Deep network inference engine with dynamically reconfigurable systolic PE array.  True Random number generator.
  • 41.  High computational throughput and memory bandwidth  Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations on GPU and INTEL MIC  Hence, we go for many core architecture.
  • 42.
  • 43.  For 1D and 2D : Register tiling.  For Multi-channel 2D convolution: Local Memory tiling.
  • 44. On Intel MIC, our solution gets up to 25% of the theoretical peak performance.
  • 45.  Deep Learning algorithms being Computing power intensive, it depends on the use case scenario to choose the framework and hardware  GPU :  Pro: They provide huge computational power  Can be used as a cluster of GPU’s  But huge power consumption and algorithms have to be designed and implemented again in CUDA/OpenCL  FPGAs :  Pro : Low power consumption when compared to GPUs  But, design of algorithm on this can be time consuming  A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node FPGA accelerated Hadoopcluster
  • 46.  Xeon Phi co-processor:  Pro : Offers considerable amount of computation power, very easy to migrate to this platform from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce method  But, to run huge datasets, should use higher end processor  X86  CPU: Can improve the performance by fixed point implementation, batching and lazy evaluation.