SlideShare a Scribd company logo
1 of 29
Download to read offline
Webinar 20130404

Molecular Simulation with
GROMACS on CUDA GPUs
Erik Lindahl
GROMACS is used on a
wide range of resources

We’re comfortably
on the single-μs
scale today

Larger machines often
mean larger systems,
not necessarily longer
simulations
Why use GPUs?
Throughput
• Sampling
• Free energy
• Cost efficiency
• Power efficiency
• Desktop simulation
• Upgrade old machines
• Low-end clusters

Performance
• Longer simulations
• Parallel GPU simulation
•
•

using Infiniband
High-end efficiency by
using fewer nodes
Reach timescales not
possible with CPUs
Many GPU programs today
Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!
Gromacs-4.5 with OpenMM
Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...
Why don’t we use the CPU too?

0.5-1 TFLOP
Random memory
access OK (not great)

Great for complex
latency-sensitive stuff
(domain decomposition, etc.)

~2 TFLOP
Random memory
access won’t work
Great for
throughput
Gromacs-4.6 next-generation GPU implementation:
Programming model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank

1 MPI rank 1 MPI rank

CPU
N OpenMP N OpenMP
(PME) threads
threads

N OpenMP N OpenMP
threads
threads

Load balancing
GPU

1 GPU
context

1 GPU
context

Load balancing
1 GPU
context

1 GPU
context
Heterogeneous CPU-GPU acceleration in GROMACS-4.6

Wallclock time for an MD step:
~0.5 ms if we want to simulate 1μs/day
We cannot afford to lose all previous acceleration tricks!
CPU trick 1: all-bond constraints
• •Δt limited by fast motions - 1fs
• SHAKE (iterative, slow) - 2fs
Remove bond vibrations

•
•

•

Problematic in parallel (won’t work)
Compromise: constrain h-bonds only 1.4fs

GROMACS (LINCS):

•
•
•
•
•
•

LINear Constraint Solver
Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)

LINCS:

t=2’
t=1

A) Move w/o constraint
t=2’’
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond
CPU trick 2: Virtual sites
•
•

Next fastest motions is H-angle and
rotations of CH3/NH2 groups
Try to remove them:

•
•
•
•

•

Ideal H position from heavy atoms.
CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct H’s

Enables 5fs timesteps!
θ
1-a

2

a

a

b

a

|d |

1-a

3

|b |

3fd

3fad

|c |

3out

4fd

Interactions

Degrees of Freedom
dista
actio
this i
intera
it to
tem.
Fo
tween
7
C’
3
B’ 2
1D d
C
B
decom
A’
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
most
FIG. 3: The zones to communicate to the proces
FIG. 2: The domain decomposition cells (1-7)for details.
see the text that communinami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between ch
cut-o
is the cut-o radius.
can be assigned to a processor, it is su⌅cien
balan
that the charge groups within a sphere of ra
muni
present on at least one processor for every p
ter of the sphere. In Fig. ?? this means we a
balan
communicate volumes B’ and C’. When no bo
are calculated.
in cel
actions are present between charge groups, th
are not communicated. For 2D decomposition
bond
Bonded interactions are distributed over the processors
C’ are the only extra volumes that need to
calcu
by finding the smallest x, y and z coordinate of the charge pictures be
For 3D domain decomposition the
be
tions
groups involved and assigning the ainteraction to thebut the procedure i
bit more complicated, pro-

CPU trick 3: Non-rectangular
cells & decomposition

Load balancing works
for arbitrary triclinic cells

Lysozyme, 25k atoms
All these “tricks” now work fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!

apart from more extensive book-keeping. All
From neighborlists to cluster
pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding

Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:

X
X
X
X

X
X
X
X

X
X
X
X

X
X
X
X
Tiling circles is difficult
Need a lot of cubes
to cover a sphere
Interactions outside
cutoff should be 0.0
Group cutoff

•
•

Verlet cutoff

GROMACS-4.6 calculates a “large enough” buffer
zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!
Tixel algorithm work-efficiency
8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36

rc=1.5, rl=1.6

0.58
0.82

Verlet
Tixel Pruned
Tixel non-pruned

0.29

rc=1.2, rl=1.3

0.52
0.75
0.21

rc=0.9, rl=1.0

0.42
0.73

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Highly memory-efficient algorithm:
Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way
PME weak scaling
Xeon X5650 3T + C2075 / process
0.35

480 μs/step (1500 atoms)

1xC2075 CUDA F kernel
1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total

Iteration time per 1000 atoms (ms/step)

0.3

0.25

0.2

Text
0.15

700 μs/step (6000 atoms)
0.1

Complete time step including
kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI

0.05

0
1.5

3

6

12

24

48

96

192

System size/GPU (1000s of atoms)

384

768

1536

3072
Example performance: Systems with
~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR

CPU, 96 CPU cores
GPU, 1xGTX680
GPU, 4xGTX680
0

Gromacs:
RNAse

100

ns/day

200

300

200

300

CPU, 6 cores
CPU, 2*8 cores

6 CPU cores +1xK20c GPU
6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c

dodec+vsites(5fs), 6 cores + 1xGTX680
0

100
The Villin headpiece
~8,000 atoms, 5 fs steps
explicit solvent
triclinic box
PME electrostatics
i7 3930K (GMX 4.5)
i7 3930K (GMX 4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

200

400

600

ns/day

800

1000

2,546 FPS (beat that, Battlefield 4)

1200
GLIC: Ion channel
membrane protein
150,000 atoms
Running on a simple desktop!

i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

10

20

ns/day

30

40
Strong scaling of Reaction-Field and &
Scaling of Reaction-fieldPME PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff

Performance (ns/day)

100

10

1
RF
RF linear scaling
PME
PME linear scaling

0.1
1

10

100

#Processes-GPUs

Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!
GROMACS 4.6 extreme scaling
Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000

XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only

ns/day

100

10

1
1

2

4
8
16
#sockets (CPU or CPU+GPU)

32

64
Using GROMACS
with GPUs in practice
Compiling GROMACS with CUDA

•
•
•
•

Make sure CUDA driver is installed
Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run ‘cmake’ and we will detect CUDA
automatically and use it

•
•

gcc-4.7 works great as a compiler
On Macs, you want to use icc (commercial)

Longer Mac story: Clang does not support OpenMP,
which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!
Using GPUs in practice
In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy

=
=
=
=
=

Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr

• Verlet cutoff-scheme is more accurate
• Necessary for GPUs in GROMACS
• Use -testverlet mdrun option to force it w. old tpr files
• Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier
Load balancing
rcoulomb
fourierspacing

= 1.0
= 0.12

• If we increase/decrease the coulomb direct-space
•
•

cutoff and the reciprocal space PME grid spacing by
the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.
Demo
Acknowledgments
•
•
•
•

GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!
 
Test Drive K20
GPUs!

Questions?

Run GROMACS on Tesla K20
GPU today

Devang Sachdev - NVIDIA
dsachdev@nvidia.com
@DevangSachdev

Contact us

Experience The Acceleration

Sign up for FREE GPU Test Drive
on remotely hosted clusters
www.nvidia.com/GPUTestDrive

GROMACS questions
Check	
  www.gromacs.org
gmx-users@gromacs.org	
  

mailing	
  list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html
Register for the Next GTC
Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress

More Related Content

Viewers also liked

Gromacs on Science Gateway
Gromacs on Science GatewayGromacs on Science Gateway
Gromacs on Science Gatewayriround
 
Force Field Analysis by Slideshop
Force Field Analysis by SlideshopForce Field Analysis by Slideshop
Force Field Analysis by SlideshopSlideShop.com
 
Force field analysis
Force field analysisForce field analysis
Force field analysisRobin Jadhav
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and SimulationsAbhilash Kannan
 
Force Field Analysis
Force  Field  AnalysisForce  Field  Analysis
Force Field Analysispuspaltamuli
 

Viewers also liked (7)

Force field analysis april2011
Force field analysis april2011Force field analysis april2011
Force field analysis april2011
 
Example of force fields
Example of force fieldsExample of force fields
Example of force fields
 
Gromacs on Science Gateway
Gromacs on Science GatewayGromacs on Science Gateway
Gromacs on Science Gateway
 
Force Field Analysis by Slideshop
Force Field Analysis by SlideshopForce Field Analysis by Slideshop
Force Field Analysis by Slideshop
 
Force field analysis
Force field analysisForce field analysis
Force field analysis
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
 
Force Field Analysis
Force  Field  AnalysisForce  Field  Analysis
Force Field Analysis
 

More from Can Ozdoruk

ROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSCan Ozdoruk
 
Cloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCan Ozdoruk
 
Image optimization q_auto - f_auto
Image optimization q_auto - f_autoImage optimization q_auto - f_auto
Image optimization q_auto - f_autoCan Ozdoruk
 
Boomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARBoomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARCan Ozdoruk
 
White-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsWhite-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsCan Ozdoruk
 
Boomerang-Toys-RAR
Boomerang-Toys-RARBoomerang-Toys-RAR
Boomerang-Toys-RARCan Ozdoruk
 
SacramentoKings_Case-Study
SacramentoKings_Case-StudySacramentoKings_Case-Study
SacramentoKings_Case-StudyCan Ozdoruk
 
Product Marketing 101
Product Marketing 101Product Marketing 101
Product Marketing 101Can Ozdoruk
 
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemChallenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemCan Ozdoruk
 
Supercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsSupercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsCan Ozdoruk
 
NVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUNVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUCan Ozdoruk
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldCan Ozdoruk
 
Introduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryIntroduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryCan Ozdoruk
 
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsCan Ozdoruk
 
AMBER and Kepler GPUs
AMBER and Kepler GPUsAMBER and Kepler GPUs
AMBER and Kepler GPUsCan Ozdoruk
 

More from Can Ozdoruk (16)

ROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPSROAD FROM $0 TO $10M: 10 GROWTH TIPS
ROAD FROM $0 TO $10M: 10 GROWTH TIPS
 
Cloudinary Webinar Responsive Images
Cloudinary Webinar Responsive ImagesCloudinary Webinar Responsive Images
Cloudinary Webinar Responsive Images
 
Image optimization q_auto - f_auto
Image optimization q_auto - f_autoImage optimization q_auto - f_auto
Image optimization q_auto - f_auto
 
Boomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RARBoomerang-ConsumerElectronics-RAR
Boomerang-ConsumerElectronics-RAR
 
White-Paper-Consumer-Electronics
White-Paper-Consumer-ElectronicsWhite-Paper-Consumer-Electronics
White-Paper-Consumer-Electronics
 
Boomerang-Toys-RAR
Boomerang-Toys-RARBoomerang-Toys-RAR
Boomerang-Toys-RAR
 
SacramentoKings_Case-Study
SacramentoKings_Case-StudySacramentoKings_Case-Study
SacramentoKings_Case-Study
 
Product Marketing 101
Product Marketing 101Product Marketing 101
Product Marketing 101
 
AMBER14 & GPUs
AMBER14 & GPUsAMBER14 & GPUs
AMBER14 & GPUs
 
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChemChallenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
Challenges and Advances in Large-scale DFT Calculations on GPUs using TeraChem
 
Supercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUsSupercharging MD Simulations with GPUs
Supercharging MD Simulations with GPUs
 
NVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPUNVIDIA Tesla K40 GPU
NVIDIA Tesla K40 GPU
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
 
Introduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template LibraryIntroduction to SeqAn, an Open-source C++ Template Library
Introduction to SeqAn, an Open-source C++ Template Library
 
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUsACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
ACEMD: High-throughput Molecular Dynamics with NVIDIA Kepler GPUs
 
AMBER and Kepler GPUs
AMBER and Kepler GPUsAMBER and Kepler GPUs
AMBER and Kepler GPUs
 

Recently uploaded

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Recently uploaded (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Gromacs and Kepler GPUs

  • 1. Webinar 20130404 Molecular Simulation with GROMACS on CUDA GPUs Erik Lindahl
  • 2. GROMACS is used on a wide range of resources We’re comfortably on the single-μs scale today Larger machines often mean larger systems, not necessarily longer simulations
  • 3. Why use GPUs? Throughput • Sampling • Free energy • Cost efficiency • Power efficiency • Desktop simulation • Upgrade old machines • Low-end clusters Performance • Longer simulations • Parallel GPU simulation • • using Infiniband High-end efficiency by using fewer nodes Reach timescales not possible with CPUs
  • 4. Many GPU programs today Caveat emperor: It is much easier to get a reference problem/algorithm to scale i.e., you see much better relative scaling before introducing any optimization on the CPU side When comparing programs: What matters is absolute performance (ns/day), not the relative speedup!
  • 5.
  • 6. Gromacs-4.5 with OpenMM Previous version - what was the limitation? Gromacs running entirely on CPU as a fancy interface Actual simulation running entirely on GPU using OpenMM kernels Only a few select algorithms worked Multi-CPU sometimes beat GPU performance...
  • 7. Why don’t we use the CPU too? 0.5-1 TFLOP Random memory access OK (not great) Great for complex latency-sensitive stuff (domain decomposition, etc.) ~2 TFLOP Random memory access won’t work Great for throughput
  • 8. Gromacs-4.6 next-generation GPU implementation: Programming model Domain decomposition dynamic load balancing 1 MPI rank 1 MPI rank 1 MPI rank 1 MPI rank CPU N OpenMP N OpenMP (PME) threads threads N OpenMP N OpenMP threads threads Load balancing GPU 1 GPU context 1 GPU context Load balancing 1 GPU context 1 GPU context
  • 9. Heterogeneous CPU-GPU acceleration in GROMACS-4.6 Wallclock time for an MD step: ~0.5 ms if we want to simulate 1μs/day We cannot afford to lose all previous acceleration tricks!
  • 10. CPU trick 1: all-bond constraints • •Δt limited by fast motions - 1fs • SHAKE (iterative, slow) - 2fs Remove bond vibrations • • • Problematic in parallel (won’t work) Compromise: constrain h-bonds only 1.4fs GROMACS (LINCS): • • • • • • LINear Constraint Solver Approximate matrix inversion expansion Fast & stable - much better than SHAKE Non-iterative Enables 2-3 fs timesteps Parallel: P-LINCS (from Gromacs 4.0) LINCS: t=2’ t=1 A) Move w/o constraint t=2’’ t=1 B) Project out motion along bonds t=2 t=1 C) Correct for rotational extension of bond
  • 11. CPU trick 2: Virtual sites • • Next fastest motions is H-angle and rotations of CH3/NH2 groups Try to remove them: • • • • • Ideal H position from heavy atoms. CH3/NH2 groups are made rigid Calculate forces, then project back onto heavy atoms Integrate only heavy atom positions, reconstruct H’s Enables 5fs timesteps! θ 1-a 2 a a b a |d | 1-a 3 |b | 3fd 3fad |c | 3out 4fd Interactions Degrees of Freedom
  • 12. dista actio this i intera it to tem. Fo tween 7 C’ 3 B’ 2 1D d C B decom A’ 4 6 A sions rc 5 1 0 allow rc 3 1 detai 0 8th-sphere comm most FIG. 3: The zones to communicate to the proces FIG. 2: The domain decomposition cells (1-7)for details. see the text that communinami cate coordinates to cell 0. Cell 2 is hidden below cell 7. The parts zones that need to be communicated to cell 0 are dashed, rc ensure that all bonded interaction between ch cut-o is the cut-o radius. can be assigned to a processor, it is su⌅cien balan that the charge groups within a sphere of ra muni present on at least one processor for every p ter of the sphere. In Fig. ?? this means we a balan communicate volumes B’ and C’. When no bo are calculated. in cel actions are present between charge groups, th are not communicated. For 2D decomposition bond Bonded interactions are distributed over the processors C’ are the only extra volumes that need to calcu by finding the smallest x, y and z coordinate of the charge pictures be For 3D domain decomposition the be tions groups involved and assigning the ainteraction to thebut the procedure i bit more complicated, pro- CPU trick 3: Non-rectangular cells & decomposition Load balancing works for arbitrary triclinic cells Lysozyme, 25k atoms All these “tricks” now work fine Rhombic dodecahedron (36k atoms in cubic cell) with GPUs in GROMACS-4.6! apart from more extensive book-keeping. All
  • 13. From neighborlists to cluster pair lists in GROMACS-4.6 x,y grid z sort z bin x,y,z gridding Organize as tiles with Cluster pairlist all-vs-all interactions: X X X X X X X X X X X X X X X X
  • 14. Tiling circles is difficult Need a lot of cubes to cover a sphere Interactions outside cutoff should be 0.0 Group cutoff • • Verlet cutoff GROMACS-4.6 calculates a “large enough” buffer zone so no interactions are missed Optimize nstlist for performance - no need to worry about missing any interactions with Verlet!
  • 15. Tixel algorithm work-efficiency 8x8x8 tixels compared to a non performance-optimized Verlet scheme 0.36 rc=1.5, rl=1.6 0.58 0.82 Verlet Tixel Pruned Tixel non-pruned 0.29 rc=1.2, rl=1.3 0.52 0.75 0.21 rc=0.9, rl=1.0 0.42 0.73 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Highly memory-efficient algorithm: Can handle 20-40 million atoms with 2-3GB memory Even cheap consumer cards will get you a long way
  • 16. PME weak scaling Xeon X5650 3T + C2075 / process 0.35 480 μs/step (1500 atoms) 1xC2075 CUDA F kernel 1xC2075 CPU total 2xC2075 CPU total 4xC2075 CPU total Iteration time per 1000 atoms (ms/step) 0.3 0.25 0.2 Text 0.15 700 μs/step (6000 atoms) 0.1 Complete time step including kernel, h2d, d2h, CPU constraints, CPU PME, CPU integration,OpenMP & MPI 0.05 0 1.5 3 6 12 24 48 96 192 System size/GPU (1000s of atoms) 384 768 1536 3072
  • 17. Example performance: Systems with ~24,000 atoms, 2 fs time steps, NPT Amber: DHFR CPU, 96 CPU cores GPU, 1xGTX680 GPU, 4xGTX680 0 Gromacs: RNAse 100 ns/day 200 300 200 300 CPU, 6 cores CPU, 2*8 cores 6 CPU cores +1xK20c GPU 6 CPU cores +1xGTX680 GPU dodec+vsites(5fs), 6 CPU cores dodec+vsites(5fs), 2*8 CPU cores dodec+vsites(5fs), 6 cores + 1xK20c dodec+vsites(5fs), 6 cores + 1xGTX680 0 100
  • 18. The Villin headpiece ~8,000 atoms, 5 fs steps explicit solvent triclinic box PME electrostatics i7 3930K (GMX 4.5) i7 3930K (GMX 4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 200 400 600 ns/day 800 1000 2,546 FPS (beat that, Battlefield 4) 1200
  • 19. GLIC: Ion channel membrane protein 150,000 atoms Running on a simple desktop! i7 3930K (GMX4.5) i7 3930K (GMX4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 10 20 ns/day 30 40
  • 20. Strong scaling of Reaction-Field and & Scaling of Reaction-fieldPME PME 1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff Performance (ns/day) 100 10 1 RF RF linear scaling PME PME linear scaling 0.1 1 10 100 #Processes-GPUs Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth Small systems often work best using only a single GPU!
  • 21. GROMACS 4.6 extreme scaling Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9 1000 XK6/X2090 XK7/K20X XK6 CPU only XE6 CPU only ns/day 100 10 1 1 2 4 8 16 #sockets (CPU or CPU+GPU) 32 64
  • 22. Using GROMACS with GPUs in practice
  • 23. Compiling GROMACS with CUDA • • • • Make sure CUDA driver is installed Make sure CUDA SDK is in /usr/local/cuda Use the default GROMACS distribution Just run ‘cmake’ and we will detect CUDA automatically and use it • • gcc-4.7 works great as a compiler On Macs, you want to use icc (commercial) Longer Mac story: Clang does not support OpenMP, which gcc does. However, the current gcc versions for Macs do not support AVX on the CPU. icc supports both!
  • 24. Using GPUs in practice In your mdp file: cutoff-scheme nstlist coulombtype vdw-type nstcalcenergy = = = = = Verlet 10 ; likely 10-50 pme ; or reaction-field cut-off -1 ; only when writing edr • Verlet cutoff-scheme is more accurate • Necessary for GPUs in GROMACS • Use -testverlet mdrun option to force it w. old tpr files • Slower on a single CPU, but scales well on CPUs too! Shift modifier is applied to both coulomb and VdW by default on GPUs - change with coulomb/vdw-modifier
  • 25. Load balancing rcoulomb fourierspacing = 1.0 = 0.12 • If we increase/decrease the coulomb direct-space • • cutoff and the reciprocal space PME grid spacing by the same amount, we maintain accuracy ... but we move work between CPU & GPU! By default, GROMACS-4.6 does this automatically at the start of each run - you will see diagnostic output GROMACS excels when you combine a fairly fast CPU and GPU. Currently, this means Intel CPUs.
  • 26. Demo
  • 27. Acknowledgments • • • • GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov Multi-Threaded PME: Roland Shultz, Berk Hess Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!  
  • 28. Test Drive K20 GPUs! Questions? Run GROMACS on Tesla K20 GPU today Devang Sachdev - NVIDIA dsachdev@nvidia.com @DevangSachdev Contact us Experience The Acceleration Sign up for FREE GPU Test Drive on remotely hosted clusters www.nvidia.com/GPUTestDrive GROMACS questions Check  www.gromacs.org gmx-users@gromacs.org   mailing  list Stream other webinars from GTC Express: http://www.gputechconf.com/ page/gtc-express-webinar.html
  • 29. Register for the Next GTC Express Webinar Molecular Shape Searching on GPUs Paul Hawkins, Applications Science Group Leader, OpenEye Wednesday, May 22, 2013, 9:00 AM PDT Register at www.gputechconf.com/gtcexpress