SlideShare a Scribd company logo
1 of 15
Download to read offline
Alma Mater Studiorum - University of Bologna
Master's Degree
in
Biomedical Engineering

Parallelization of the Algorithm
WHAM
with
NVIDIA CUDA
Supervisor: Prof. Stefano Severi
Co-Supervisor: Ing. Simone Furini

NVIDIA Research

Presented by Nicolò Savioli
Academic year 2012/2013
Free-Energy:

 
Fi = mai i = 1,..., N
i

The aim of this thesis is to implement
the WHAM algorithm, originally
implemented in CPU, for execution in
GPU graphic cards. WHAM is an
algorithm to estimate free energy
profiles from Molecular Dynamics
simulation.
Free energy estimates can be used to
ΔA=A −A
identify the affinity between
molecules (Pharmacological
Research).
1

The difference in Free Energy,
between two configurations, 0 and 1
can be expressed as:

ΔA=A1−A0=log()
2 =log

P1
P0

( )


 
Fi = −∇iV(r1,..., rN )

ΔA=A − A =−k T log ( P / P )
1

© 2008 NVIDIA Corporation

0

B

1

0
Umbrella Sampling
(Torrie and Valleau ,1977)
The problem is that Molecular Dynamics trajectories are limited
in time (blocked in local minima of energy).
Biasing potential can be used to force the system to explore new
configurations.
In Umbrella Sampling several simulations with different biasing
potentials are run to explore the configuration space.
ξ ( r 3N )

Ion channel

0 2

W i ( ξ ) = k / 2 ( ξ −ξ 1 )

•

Ion

H ( Γ ) =H +W ( ξ )
i

© 2008 NVIDIA Corporation
Biased Hamiltonian

0

i

Unbiased Hamiltonian

+ Biasing Potential
Weighted Histogram Analysis Method
(WHAM)
Our aim is calculate the properties of the original system (Unbiased) using the
trajectories of biased simulations.
In the WHAM algorithm the probability of the unbiased system is calculated as
a linear combination of R estimates obtained from R independent trajectories.
Minimization of the variance of the unbiased probability gives the following set
of equations:
a) It starts with an arbitrary set
of fi.
b) It use the first equation to
calculate P(ξh).
c) It use second equation to
update fi.

Number of samples inside bin h
R

P (ξ h )=∑ (
u

i=1

ni / 2 τ( ξh )
(n j / 2 τ j (ξ h ))e

u

f i =−(1 /β) log( ∑ P ( ξh ) e
h

© 2008 NVIDIA Corporation

b

−β (W j (ξh )− f j )

−βW i (ξh )

)

) P i (ξ h )

Integrated autocorrelation time
Why GPU?
In recent years, new computational models have been developed in which new
parallel architectures have allowed the improvement of computational abilities
allowing numerical simulations to be more efficient and quicker.
One of the strategies used to parallelize mathematical model is the use of
GPGPU (General Propose Computing on Graphics Processing Unit).
It was originally develop in image processing and now is also used in scientific
simulations.
In recent years the computational capability of these architecture is increasing
exponentially in comparison with CPU, and from 2007 NVIDIA has opened the
possibility of programming GPUs with a specific language called CUDA.
•

© 2008 NVIDIA Corporation
GPU Architecture:
The model of NVIDIA GPUs is SMID (Single Instruction,
Multiple Data) composed of only a control unit that
executes one instruction at a time by controlling more
ALU that works in a synchronous manner.

GPUs is
constituted by a
number of
Multiprocessors
(SM)

The GPU is connected to a host through a
PCI-Express.

8 or 16 Stream
Processors (SP):
(floating
point,integer logic
unit.)
Registers,
Execution
Pipelines,
Chaches.

Texture Memory
implanting a
texture 2D of
polygonal model.


Global Memory
from 256MB to 6GB
with Bandwidth
150 GB/s

© 2008 NVIDIA Corporation

Shared Memory (32KB)
but fast !!!
Example Code

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
//i) index that runs every thread to block
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);
//a) Allocate input vectors h_A and h_B in host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
// Initialize input vectors
...
//b) Allocate vectors in device memory
float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);
//c) Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//d) Group of threads are contained in blocks which in turn are
contained in a grid must initialize number blocks for grid and
thread block number
int threadsPerBlock = 256;
int blocksPerGrid =(N + threadsPerBlock - 1) / threadsPerBlock;
//e) Invoke kernel
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
//f) Copy result from device memory to host memory
h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
//g) Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
//h) Free host memory
...
© 2008 NVIDIA Corporation

}
CUDA WHAM Considerations:
The code consists of 11 files invoked as external functions and of a main
file that initializes variables and execute the iterative algorithm.
The C++ function clock() was used to temporize the code.
Optimizations have been made:
The Costant Memory was used to store the variables used more often.

In order to optimize the process of sums we used a Cuda technique called sum
reduction. Each thread of block is synchronized and it produces a single result
that is shared with another through Shared Memory.

__syncthreads()

© 2008 NVIDIA Corporation
Organization of the code:
//invocation of the external CUDA function for Calculating Bias
Bias(HIST.numhist, HIST.numwin,HIST.numdim,dev numhist,dev numdim,dev histmin,dev
center, dev harmrest, dev delta,dev step,dev numbin,dev U,dev numwham);

0 2
k
W i ( ξ ) = ( ξ −ξ 1 )
2

R

P (ξ h )=∑ (
u

i=1

ni / 2 τ( ξh )
(n j / 2 τ j (ξ h ))e

b

−β (W j (ξh )− f j )

1
u
−β W (ξ )
f i =−( )log (∑ P (ξ h )e
)
β
h
i

NF = ∑ P ( ξh )
u

h

P u (ξ h)=∑ P u (ξh )/ NF
h

f i = f i +log( NF )
u

u

2

Conv=( P n [i ]− P o [i])

h

) P i (ξ h )

while((it < numit)&&(!converged)){
//invocation of the external CUDA function for Calculating P (New probability)
NewProbabilities(cpu numhist[0],cpu numwin[0],dev numhist,dev numwin,
dev numbinwin,dev g,dev numwham,dev U,dev F,dev denwham,dev Punnorm result);
//invocation of the external CUDA function for Calculating new Sum
summationP (cpu numhist[0],cpu numwin[0],
dev numhist,dev numwin,dev U,dev UU,dev numwham);
NewSum (dev numhist,cpu numwin[0],dev sumP,dev UU,dev Punnorm result,dev
numwham);
//invocation of the external CUDA function for Calculating new constant F
NewConstants(cpu numhist[0],cpu numwin[0],dev U,dev Punnorm result,
dev sumP,dev F,dev numwham);
//invocation of the external CUDA function for Calculating Normalization Constant
NormFactor(cpu numhist[0],dev Punnorm result,
sum normfactor for normprob and normcoef,dev numwham);
//invocation of the external CUDA function for Normalization of P
NormProbabilities (cpu numhist[0],dev sum normfactor for normprob and normcoef,
Punnorm result,dev P,dev numwham);
//invocation of the external CUDA function for Normalization of F
NormCoefficient(cpu numwin[0],dev sum normfactor for normprob and normcoef
,dev F,dev sumP);
//invocation of the external CUDA function for Convergence of the Math Model
CheckConvergence(cpu numhist[0],dev P,dev P old,HIST.numgood,
rmsd result,dev numwham);

A ( ξ )=−k B T log ( P ( ξ ) )

© 2008 NVIDIA Corporation

//invocation of the external CUDA function for Calculating Free Energy
ComputeEnergy(cpu numhist[0],dev P,dev kT,dev A result,dev P old,dev denwham);
cudaMemcpy(cpu rmsd result,dev rmsd result,sizeof (float),cudaMemcpyDeviceToHost);
if (cpu rmsd result[0] < tol)
converged = true;//Is it converged ?
it++;
}
Architectures used:
GPU WHAM was tested in different GPU architectures and compared
with the corresponding CPU WHAM.
GT 9500 with Compute Capability of 1.1 (32 CUDA cores)
GT 320M with Compute Capability of 1.0 (24 CUDA cores)
Athlon X2 64 Dual Core
Intel i5 3400 Quad Core

© 2008 NVIDIA Corporation
Analysis of Convergence

GT 9500 (32 CUDA Cores)

GT 320M (24 CUDA Cores)

KJ/mol

They reach the same point of
convergence !!!

Time [s]
© 2008 NVIDIA Corporation
Performance:
Performances almost double from compute capability 1.0 to compute capability 1.1.

GT 9500 (32 CUDA Cores)

Time [s]

GT 320M (24 CUDA Cores)

MORE POWER !!!

© 2008 NVIDIA Corporation

Number of Iterations
Ratio with variable grid:

GPU/CPU Time [s]

Constant with
increasing size of
the grid: there are
no traffic problems
with memory !!!

© 2008 NVIDIA Corporation

Number of Dim Grid
Conclusions:

For the first time the WHAM algorithm has been implemented in GPU.
The speed of execution of the GPU-WHAM algorithm increases with the speed of the
graphics card used.
The GPU/CPU speed ratio is constant when changing the size of grid.
GPU-WHAM can execute in parallel with CPU calculations increasing the speed of
execution.

© 2008 NVIDIA Corporation
Thank you for your attention!

© 2008 NVIDIA Corporation

More Related Content

What's hot

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTaegyun Jeon
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsKusano Hitoshi
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexMahesh Khadatare
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUMahesh Khadatare
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017Yu-Hsun (lymanblue) Lin
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

What's hot (20)

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
CUDA
CUDACUDA
CUDA
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPU
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Similar to Slide tesi

Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Andrei Varanovich
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
CUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesCUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
 
GPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionGPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionRichard Southern
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedHimanshu577858
 
Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作鈵斯 倪
 

Similar to Slide tesi (20)

Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
CUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesCUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : Notes
 
GPU Accelerated Domain Decomposition
GPU Accelerated Domain DecompositionGPU Accelerated Domain Decomposition
GPU Accelerated Domain Decomposition
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作Landset 8 的雲層去除技巧實作
Landset 8 的雲層去除技巧實作
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Slide tesi

  • 1. Alma Mater Studiorum - University of Bologna Master's Degree in Biomedical Engineering Parallelization of the Algorithm WHAM with NVIDIA CUDA Supervisor: Prof. Stefano Severi Co-Supervisor: Ing. Simone Furini NVIDIA Research Presented by Nicolò Savioli Academic year 2012/2013
  • 2. Free-Energy:   Fi = mai i = 1,..., N i The aim of this thesis is to implement the WHAM algorithm, originally implemented in CPU, for execution in GPU graphic cards. WHAM is an algorithm to estimate free energy profiles from Molecular Dynamics simulation. Free energy estimates can be used to ΔA=A −A identify the affinity between molecules (Pharmacological Research). 1 The difference in Free Energy, between two configurations, 0 and 1 can be expressed as: ΔA=A1−A0=log() 2 =log P1 P0 ( )    Fi = −∇iV(r1,..., rN ) ΔA=A − A =−k T log ( P / P ) 1 © 2008 NVIDIA Corporation 0 B 1 0
  • 3. Umbrella Sampling (Torrie and Valleau ,1977) The problem is that Molecular Dynamics trajectories are limited in time (blocked in local minima of energy). Biasing potential can be used to force the system to explore new configurations. In Umbrella Sampling several simulations with different biasing potentials are run to explore the configuration space. ξ ( r 3N ) Ion channel 0 2 W i ( ξ ) = k / 2 ( ξ −ξ 1 ) • Ion H ( Γ ) =H +W ( ξ ) i © 2008 NVIDIA Corporation Biased Hamiltonian 0 i Unbiased Hamiltonian + Biasing Potential
  • 4. Weighted Histogram Analysis Method (WHAM) Our aim is calculate the properties of the original system (Unbiased) using the trajectories of biased simulations. In the WHAM algorithm the probability of the unbiased system is calculated as a linear combination of R estimates obtained from R independent trajectories. Minimization of the variance of the unbiased probability gives the following set of equations: a) It starts with an arbitrary set of fi. b) It use the first equation to calculate P(ξh). c) It use second equation to update fi. Number of samples inside bin h R P (ξ h )=∑ ( u i=1 ni / 2 τ( ξh ) (n j / 2 τ j (ξ h ))e u f i =−(1 /β) log( ∑ P ( ξh ) e h © 2008 NVIDIA Corporation b −β (W j (ξh )− f j ) −βW i (ξh ) ) ) P i (ξ h ) Integrated autocorrelation time
  • 5. Why GPU? In recent years, new computational models have been developed in which new parallel architectures have allowed the improvement of computational abilities allowing numerical simulations to be more efficient and quicker. One of the strategies used to parallelize mathematical model is the use of GPGPU (General Propose Computing on Graphics Processing Unit). It was originally develop in image processing and now is also used in scientific simulations. In recent years the computational capability of these architecture is increasing exponentially in comparison with CPU, and from 2007 NVIDIA has opened the possibility of programming GPUs with a specific language called CUDA. • © 2008 NVIDIA Corporation
  • 6. GPU Architecture: The model of NVIDIA GPUs is SMID (Single Instruction, Multiple Data) composed of only a control unit that executes one instruction at a time by controlling more ALU that works in a synchronous manner. GPUs is constituted by a number of Multiprocessors (SM) The GPU is connected to a host through a PCI-Express. 8 or 16 Stream Processors (SP): (floating point,integer logic unit.) Registers, Execution Pipelines, Chaches. Texture Memory implanting a texture 2D of polygonal model. Global Memory from 256MB to 6GB with Bandwidth 150 GB/s © 2008 NVIDIA Corporation Shared Memory (32KB) but fast !!!
  • 7. Example Code // Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { //i) index that runs every thread to block int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ...; size_t size = N * sizeof(float); //a) Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... //b) Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); //c) Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); //d) Group of threads are contained in blocks which in turn are contained in a grid must initialize number blocks for grid and thread block number int threadsPerBlock = 256; int blocksPerGrid =(N + threadsPerBlock - 1) / threadsPerBlock; //e) Invoke kernel VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); //f) Copy result from device memory to host memory h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); //g) Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); //h) Free host memory ... © 2008 NVIDIA Corporation }
  • 8. CUDA WHAM Considerations: The code consists of 11 files invoked as external functions and of a main file that initializes variables and execute the iterative algorithm. The C++ function clock() was used to temporize the code. Optimizations have been made: The Costant Memory was used to store the variables used more often. In order to optimize the process of sums we used a Cuda technique called sum reduction. Each thread of block is synchronized and it produces a single result that is shared with another through Shared Memory. __syncthreads() © 2008 NVIDIA Corporation
  • 9. Organization of the code: //invocation of the external CUDA function for Calculating Bias Bias(HIST.numhist, HIST.numwin,HIST.numdim,dev numhist,dev numdim,dev histmin,dev center, dev harmrest, dev delta,dev step,dev numbin,dev U,dev numwham); 0 2 k W i ( ξ ) = ( ξ −ξ 1 ) 2 R P (ξ h )=∑ ( u i=1 ni / 2 τ( ξh ) (n j / 2 τ j (ξ h ))e b −β (W j (ξh )− f j ) 1 u −β W (ξ ) f i =−( )log (∑ P (ξ h )e ) β h i NF = ∑ P ( ξh ) u h P u (ξ h)=∑ P u (ξh )/ NF h f i = f i +log( NF ) u u 2 Conv=( P n [i ]− P o [i]) h ) P i (ξ h ) while((it < numit)&&(!converged)){ //invocation of the external CUDA function for Calculating P (New probability) NewProbabilities(cpu numhist[0],cpu numwin[0],dev numhist,dev numwin, dev numbinwin,dev g,dev numwham,dev U,dev F,dev denwham,dev Punnorm result); //invocation of the external CUDA function for Calculating new Sum summationP (cpu numhist[0],cpu numwin[0], dev numhist,dev numwin,dev U,dev UU,dev numwham); NewSum (dev numhist,cpu numwin[0],dev sumP,dev UU,dev Punnorm result,dev numwham); //invocation of the external CUDA function for Calculating new constant F NewConstants(cpu numhist[0],cpu numwin[0],dev U,dev Punnorm result, dev sumP,dev F,dev numwham); //invocation of the external CUDA function for Calculating Normalization Constant NormFactor(cpu numhist[0],dev Punnorm result, sum normfactor for normprob and normcoef,dev numwham); //invocation of the external CUDA function for Normalization of P NormProbabilities (cpu numhist[0],dev sum normfactor for normprob and normcoef, Punnorm result,dev P,dev numwham); //invocation of the external CUDA function for Normalization of F NormCoefficient(cpu numwin[0],dev sum normfactor for normprob and normcoef ,dev F,dev sumP); //invocation of the external CUDA function for Convergence of the Math Model CheckConvergence(cpu numhist[0],dev P,dev P old,HIST.numgood, rmsd result,dev numwham); A ( ξ )=−k B T log ( P ( ξ ) ) © 2008 NVIDIA Corporation //invocation of the external CUDA function for Calculating Free Energy ComputeEnergy(cpu numhist[0],dev P,dev kT,dev A result,dev P old,dev denwham); cudaMemcpy(cpu rmsd result,dev rmsd result,sizeof (float),cudaMemcpyDeviceToHost); if (cpu rmsd result[0] < tol) converged = true;//Is it converged ? it++; }
  • 10. Architectures used: GPU WHAM was tested in different GPU architectures and compared with the corresponding CPU WHAM. GT 9500 with Compute Capability of 1.1 (32 CUDA cores) GT 320M with Compute Capability of 1.0 (24 CUDA cores) Athlon X2 64 Dual Core Intel i5 3400 Quad Core © 2008 NVIDIA Corporation
  • 11. Analysis of Convergence GT 9500 (32 CUDA Cores) GT 320M (24 CUDA Cores) KJ/mol They reach the same point of convergence !!! Time [s] © 2008 NVIDIA Corporation
  • 12. Performance: Performances almost double from compute capability 1.0 to compute capability 1.1. GT 9500 (32 CUDA Cores) Time [s] GT 320M (24 CUDA Cores) MORE POWER !!! © 2008 NVIDIA Corporation Number of Iterations
  • 13. Ratio with variable grid: GPU/CPU Time [s] Constant with increasing size of the grid: there are no traffic problems with memory !!! © 2008 NVIDIA Corporation Number of Dim Grid
  • 14. Conclusions: For the first time the WHAM algorithm has been implemented in GPU. The speed of execution of the GPU-WHAM algorithm increases with the speed of the graphics card used. The GPU/CPU speed ratio is constant when changing the size of grid. GPU-WHAM can execute in parallel with CPU calculations increasing the speed of execution. © 2008 NVIDIA Corporation
  • 15. Thank you for your attention! © 2008 NVIDIA Corporation