SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
CUDA and Caffe for 
deep learning 
Amgad Muhammad 
Mohamed Ghoneim
Outline 
• GPU Computing 
• What is CUDA? 
• Why use CUDA? 
• When use CUDA? 
• CUDA - Machine Specs . 
• CUDA - Matrix Multiplication 
• CUDA - Closest Pair in 2D 
• Convolution Neural Networks 
• Auto Encoder
GPU Computing 
• Moore’s law slowed down. 
• Computation is directed towards parallelism instead of 
better processing unit performance. 
• CPU has a small number of processing units with very 
high processing power. 
• GPU has a large number of processing units with 
moderate processing power.
What is CUDA? 
• Compute Unified Device Architecture 
• Introduced by nVidia in 2006. 
• Refers to 2 different concepts: 
1. CUDA Architecture: Massively parallel architecture of 
modern GPUs with hundreds of cores. 
2. CUDA Programming Model: the model used to program 
these GPUs
[Bryan Catanzaro]
Why use CUDA? 
• Efficiently processing thousands of small/repeated tasks 
in parallel. 
• It provides a methodology for these tasks to 
communicate and cooperate efficiently. 
• Scalable and intuitive mechanism to express 
parallelism.
When use CUDA? 
• Lots of computations and lots of data. 
• Parallel algorithms. 
• Neural Networks. 
• Physical Simulations 
• Distributed Computing 
• Accelerated Encryption, Decryption and Compression
CUDA – Machine Specs . 
Machine specs for this experiment: 
- Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 
processors). 
- RAM: 32.0 GB 
- OS: 64-bit Windows 7 
- Graphics Card: Quadro FX 4600 
- CUDA Driver: 5.5 
- CUDA Compatibility: 1.0 
- # of Cores: 96 
- Core Clock: 500MHz 
- Memory: 768MB 
- Memory Clock: 1400MHz
CUDA - Matrix Multiplication 
Comparing different implementations: 
All the times below are in milliseconds. 
100 200 300 400 500 600 700 800 900 1000 
25000 
20000 
15000 
10000 
5000 
0 
Matrix Multiplication 
Matrix Side 
CPU GPU 
Time in MS
CUDA - Closest Pair in 2D 
This is a well known problem where the 
algorithm tries to find the 2 points that 
closest to each other. There are many 
solutions to address this problem: 
1. Brute Force complexity O( n^2 ) 
2. Divide and Conquer O( n log(n) ) 
For completeness there is another implementation using KD-trees with complexity similar to D&C.
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
100 1000 5000 10000 20000 25000 30000 40000 50000 100000 
250000 
200000 
150000 
100000 
50000 
0 
Closest Pair in 2D 
Number of Points 
Brute Force CPU BF GPU BF GPU Optimized 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Closest Pair in 2D 
100 1000 5000 10000 20000 25000 30000 40000 
Number of Points 
BF GPU Optimized Divide and Conquer CPU 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the threads hierarchy in 
the GPU works:
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the memory hierarchy in 
the GPU works:
CUDA – back to Matrix Multiplication 
Explaining the matrix multiplication optimization on board
CUDA - Closest Pair in 2D (cont.) 
Explaining the optimized code on board 
__global__ void FindClosestGPU2(float2* points, float* vals, int count) 
{ 
__shared__ float2 sharedPoints[blockSize]; 
if(count <= 1) return; 
int idx = threadIdx.x + blockIdx.x * blockDim.x; 
float2 thisPoint; 
float distanceToClosest = FLT_MAX; 
if(idx < count) thisPoint = points[idx]; 
for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { 
if(threadIdx.x + currentBlockOfPoints * blockSize < count) 
sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; 
else 
sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; 
__syncthreads(); 
if(idx < count) { 
float *ptr = &sharedPoints[0].x; 
for(int i = 0; i < blockSize; i++) { 
float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + 
(thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); 
ptr += 2; 
if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) 
&& (i + currentBlockOfPoints * blockSize != idx)) 
distanceToClosest = dist; 
} 
}_ 
_syncthreads(); 
}i 
f(idx < count) 
vals[idx] = distanceToClosest; 
}
CNN
Convolution, The first operation to optimize
Pooling, the second operation to optimize
Results
LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used 
OpenBlas for parallelization on the CPU 
Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 
1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
CNN 
with GPU without GPU 
Time in Seconds
AutoEncoder
AutoEncoders Results 
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main 
operation here is inner product 
1 CPU Core 2 CPU Cores 3 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Auto Encoder 
with GPU without GPU 
Time in Seconds
Thank You! 
Questions?

Más contenido relacionado

La actualidad más candente

PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017Yu-Hsun (lymanblue) Lin
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its applicationSrishty Saha
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Altoros
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 

La actualidad más candente (20)

Chainer v3
Chainer v3Chainer v3
Chainer v3
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 

Destacado

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsMartin Peniak
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...Dataconomy Media
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation PrimerAuro Tripathy
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural networkIndira Nayak
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural networkSumeet Kakani
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
neural network
neural networkneural network
neural networkSTUDENT
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...Sergio Orts-Escolano
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi
 

Destacado (18)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive Robotics
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation Primer
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural network
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural network
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
neural network
neural networkneural network
neural network
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 

Similar a CUDA and Caffe for deep learning

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 

Similar a CUDA and Caffe for deep learning (20)

Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 

Más de Amgad Muhammad

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationAmgad Muhammad
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAmgad Muhammad
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices Amgad Muhammad
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 

Más de Amgad Muhammad (6)

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimization
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological background
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Google File System
Google File SystemGoogle File System
Google File System
 
Python
PythonPython
Python
 

Último

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Último (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

CUDA and Caffe for deep learning

  • 1. CUDA and Caffe for deep learning Amgad Muhammad Mohamed Ghoneim
  • 2. Outline • GPU Computing • What is CUDA? • Why use CUDA? • When use CUDA? • CUDA - Machine Specs . • CUDA - Matrix Multiplication • CUDA - Closest Pair in 2D • Convolution Neural Networks • Auto Encoder
  • 3. GPU Computing • Moore’s law slowed down. • Computation is directed towards parallelism instead of better processing unit performance. • CPU has a small number of processing units with very high processing power. • GPU has a large number of processing units with moderate processing power.
  • 4. What is CUDA? • Compute Unified Device Architecture • Introduced by nVidia in 2006. • Refers to 2 different concepts: 1. CUDA Architecture: Massively parallel architecture of modern GPUs with hundreds of cores. 2. CUDA Programming Model: the model used to program these GPUs
  • 6. Why use CUDA? • Efficiently processing thousands of small/repeated tasks in parallel. • It provides a methodology for these tasks to communicate and cooperate efficiently. • Scalable and intuitive mechanism to express parallelism.
  • 7. When use CUDA? • Lots of computations and lots of data. • Parallel algorithms. • Neural Networks. • Physical Simulations • Distributed Computing • Accelerated Encryption, Decryption and Compression
  • 8. CUDA – Machine Specs . Machine specs for this experiment: - Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 processors). - RAM: 32.0 GB - OS: 64-bit Windows 7 - Graphics Card: Quadro FX 4600 - CUDA Driver: 5.5 - CUDA Compatibility: 1.0 - # of Cores: 96 - Core Clock: 500MHz - Memory: 768MB - Memory Clock: 1400MHz
  • 9. CUDA - Matrix Multiplication Comparing different implementations: All the times below are in milliseconds. 100 200 300 400 500 600 700 800 900 1000 25000 20000 15000 10000 5000 0 Matrix Multiplication Matrix Side CPU GPU Time in MS
  • 10. CUDA - Closest Pair in 2D This is a well known problem where the algorithm tries to find the 2 points that closest to each other. There are many solutions to address this problem: 1. Brute Force complexity O( n^2 ) 2. Divide and Conquer O( n log(n) ) For completeness there is another implementation using KD-trees with complexity similar to D&C.
  • 11. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 100 1000 5000 10000 20000 25000 30000 40000 50000 100000 250000 200000 150000 100000 50000 0 Closest Pair in 2D Number of Points Brute Force CPU BF GPU BF GPU Optimized Time in MS
  • 12. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 1000 900 800 700 600 500 400 300 200 100 0 Closest Pair in 2D 100 1000 5000 10000 20000 25000 30000 40000 Number of Points BF GPU Optimized Divide and Conquer CPU Time in MS
  • 13. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the threads hierarchy in the GPU works:
  • 14. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the memory hierarchy in the GPU works:
  • 15. CUDA – back to Matrix Multiplication Explaining the matrix multiplication optimization on board
  • 16. CUDA - Closest Pair in 2D (cont.) Explaining the optimized code on board __global__ void FindClosestGPU2(float2* points, float* vals, int count) { __shared__ float2 sharedPoints[blockSize]; if(count <= 1) return; int idx = threadIdx.x + blockIdx.x * blockDim.x; float2 thisPoint; float distanceToClosest = FLT_MAX; if(idx < count) thisPoint = points[idx]; for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { if(threadIdx.x + currentBlockOfPoints * blockSize < count) sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; else sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; __syncthreads(); if(idx < count) { float *ptr = &sharedPoints[0].x; for(int i = 0; i < blockSize; i++) { float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + (thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); ptr += 2; if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) && (i + currentBlockOfPoints * blockSize != idx)) distanceToClosest = dist; } }_ _syncthreads(); }i f(idx < count) vals[idx] = distanceToClosest; }
  • 17. CNN
  • 18. Convolution, The first operation to optimize
  • 19. Pooling, the second operation to optimize
  • 21. LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used OpenBlas for parallelization on the CPU Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 800 700 600 500 400 300 200 100 0 CNN with GPU without GPU Time in Seconds
  • 23. AutoEncoders Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main operation here is inner product 1 CPU Core 2 CPU Cores 3 CPU Cores 800 700 600 500 400 300 200 100 0 Auto Encoder with GPU without GPU Time in Seconds