Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Cmsb realtime

Toward Real-Time Simulation of Cardiac Dynamics

Computational Methods in Systems Biology (CMSB) September 23, 2011. Paris, France.

  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Cmsb realtime

  1. 1. Toward Real-Time Simulation <br />of Cardiac Dynamics<br />EzioBartocci<br />Stony Brook University<br />Computational Methods in Systems Biology (CMSB)<br />September 23, 2011. Paris, France. <br />Joint work with E. Cherry, J. Glimm, R. Grosu, S. A. Smolka, F. Fenton <br />aspartof<br />NSF’s Computational Modeling and Analysis of Complex Systems (CMACS)<br /><br />
  2. 2. Outline<br /><ul><li>Biological Background
  3. 3. Cardiac Models Analysis
  4. 4. CUDA Programming Model
  5. 5. Reaction Diffusion in CUDA
  6. 6. Case Studies
  7. 7. Work in Progress</li></li></ul><li>Biological Background<br /><ul><li>Cardiac Arrhythmia is a disruption of the heart’s normal rhythm
  8. 8. It is the leading cause of death in industrialized countries
  9. 9. Atrial and Ventricular Fibrillation are the most common cardiac arrhythmias</li></ul>Normal heart <br />rhythm <br />ventricular <br />tachycardia <br />ventricular <br />fibrillation<br />
  10. 10. CardiacModelsasReactionDiffusionSystems<br />Schematic Action Potential<br />Membrane’s AP depends on: <br /><ul><li>Stimulus (voltage or current):
  11. 11. External / Neighboring cells
  12. 12. Cell’s state </li></ul>voltage<br />AP has nonlinear behavior!<br /><ul><li>Reaction diffusion system:</li></ul>Threshold<br />Stimulus<br />failed initiation<br />Resting potential<br />time<br />Behavior<br />In time<br />Reaction<br />Diffusion<br />
  13. 13. Cardiac Dynamics AnalysisbySimulation<br /><ul><li>Spiral Formation
  14. 14. Spiral Breakup
  15. 15. Tip Tracking
  16. 16. Front Wave Tracking
  17. 17. Curvature Analysis
  18. 18. Conduction Velocity</li></ul>Simulation-Based Analysis <br />
  19. 19. Cardiac Dynamics AnalysisbySimulation<br /><ul><li>Spiral Formation
  20. 20. Spiral Breakup
  21. 21. Tip Tracking
  22. 22. Front Wave Tracking
  23. 23. Curvature Analysis
  24. 24. Conduction Velocity</li></ul>Simulation-Based Analysis <br />E. Bartocci, R. Singh, F. von Stein, A. Amedome, Alan-Joseph Caceres, J. Castillo, E. Closser, G. Deards, A. Goltsev, R. Sta. Ines, C. Isbilir, J. Marc, D. Moore, D. Pardi, S. Sadhu, S. Sanchez, P. Sharma, A. Singh, J. Rogers, A. Wolinetz, T. Grosso-Applewhite, K. Zhao, A. Filipski, R. Gilmour Jr, R. Grosu, J. Glimm, S. A. Smolka, E. Cherry, E. Clarke, N. Griffeth, and F. Fenton. Teaching cardiac electrophysiology modeling to undergraduate students; using Java Applets and GPU programs to study arrhythmias and spiral wave dynamics.   Journal of Advances in Physiology of Education, To appear <br />
  25. 25. Cardiac Dynamics AnalysisbySimulation<br /><ul><li>Spiral Formation
  26. 26. Spiral Breakup
  27. 27. Tip Tracking
  28. 28. Front Wave Tracking
  29. 29. Curvature Analysis
  30. 30. Conduction Velocity</li></ul>Simulation-Based Analysis <br />
  31. 31. Cardiac Dynamics AnalysisbySimulation<br /><ul><li>Spiral Formation
  32. 32. Spiral Breakup
  33. 33. Tip Tracking
  34. 34. Front Wave Tracking
  35. 35. Curvature Analysis
  36. 36. Conduction Velocity</li></ul>Simulation-Based Analysis <br />
  37. 37. CardiacModels<br /><ul><li>Karma Model (2 v) Generic
  38. 38. Minimal Model (Flavio-Cherry) (4 v) Human
  39. 39. Beeler-Reuter (8 v) Canine
  40. 40. Ten TussherPanfilov (19 v) Human
  41. 41. Iyer (67 v) Human</li></li></ul><li>Available Technologies<br />CPU based<br />GPU based<br />The GPU devotes more transistors to data processing <br />This image is from CUDA programming guide <br />
  42. 42. GPU vs CPU<br />This image is from CUDA programming guide <br />
  43. 43. GPU Architecture<br />MULTIPROCESSORS<br />Each GPU consists of a <br />Set of multiprocessors.<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  44. 44. GPU Architecture<br />MULTIPROCESSORS<br />Each Multiprocesssor can have 8/32 Stream Processors (SP) (called by NVIDIA also cores) which share access to local memory.<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  45. 45. GPU Architecture<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />Each Stream Processor<br />(core) contains a fused <br />multiply-adder capable <br />of single precision <br />arithmetic. It is capable <br />of completing 3 floating <br />point operations per <br />cycle - a fused MADD <br />and a MUL. <br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  46. 46. GPU Architecture<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />Each multiprocessor can<br />contain one or more 64-bit fused multiple adder for double precision operations.<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  47. 47. Memory Hierarchy<br />MULTIPROCESSORS<br />The fastest available <br />Memory for GPU <br />computation is device <br />registers. Each <br />multiprocessor contains <br />16/32KB of registers. <br />The registers are <br />partitioned among the<br /> MP-resident threads<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  48. 48. Memory Hierarchy<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared memory (16KB) is <br />primarily intended <br />as a means to provide<br />fast communication <br />between threads of the<br />executed by the same multiprocessor, although, due to its speed, it can also be used as a programmer controlled memory cache.<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  49. 49. Memory Hierarchy<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />GPUs have also DRAM <br />The latency is 150x is slower<br />then registers and <br />shared memory<br />Global Memory<br />
  50. 50. Memory Hierarchy<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant memory, as the <br />name implies, is a <br />read-only region which <br />also has a small cache.<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  51. 51. Memory Hierarchy<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Texture memory is read-only <br />with a small cache optimized <br />for manipulation of textures.<br />It also provides built-in linear interpolation of the data.<br />also provides built-in linear interpolation of the data.<br />Constant Cache<br />Texture Cache<br />Global Memory<br />
  52. 52. Memory Hierarchy<br />MULTIPROCESSORS<br />Registers<br />Registers<br />Registers<br />Registers<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />SP<br />…<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Shared Memory<br />Constant Cache<br />Texture Cache<br />Global memory is available <br />to all threads and is persistent <br />between GPU calls.<br />Global Memory<br />
  53. 53. CUDA Programming Model<br />Single Instruction Multiple Threads (SIMT)<br /> similar to <br />Single Instruction Multiple Data (SIMD)<br />CPU Host<br />GPU Device<br />Grid k<br />THREAD ID 0<br />THREAD ID 1<br />THREAD ID 2<br />THREAD ID 3<br />Serial Code i<br />A[ID]=ID<br />A[ID]=ID<br />A[ID]=ID<br />A[ID]=ID<br />Block (0,1)<br />Block (0,0)<br />if (ID%2)<br />else<br />endif<br />if (ID%2)<br />if (ID%2)<br />if (ID%2)<br />else<br />else<br />else<br />endif<br />endif<br />endif<br />Block (1,1)<br />Block (1,0)<br />A[ID]+=2;<br />A[ID]*=2;<br />A[ID]=0;<br />A[ID]+=2;<br />A[ID]+=2;<br />A[ID]+=2;<br />A[ID]+=2;<br />A[ID]*=2;<br />A[ID]*=2;<br />A[ID]*=2;<br />A[ID]=0;<br />A[ID]=0;<br />A[ID]=0;<br />A[ID]+=2;<br />A[ID]+=2;<br />A[ID]+=2;<br />Kernel <br />Invocation<br />Block (2,1)<br />Block (2,0)<br />Block (3,1)<br />Block (3,0)<br />Block (4,1)<br />Block (4,0)<br />0<br />1<br />2<br />3<br />2<br />0<br />4<br />0<br />4<br />0<br />8<br />0<br />6<br />2<br />10<br />2<br />0<br />1<br />2<br />3<br />4<br />0<br />8<br />0<br />Serial Code j<br />Block (5,1)<br />Block (5,0)<br />A vector<br />When branches occur in the code (e.g. due to if statements) the divergent <br />threads will become inactive until the conforming threads complete their <br />separate execution.<br />When execution merges, the threads can continue to operate in parallel.<br />
  54. 54. CUDA Programming Model<br />Different threads are multiplexed<br />and executed by the same core <br />in order to reduce the latency of <br />memory access.<br />GRID<br />BLOCK<br />Each Thread block is executed <br />by a multiprocessors<br />The max number of threads<br />for a thread block is 512 <br />and it depends on the<br />amount of registers <br />that each thread may need.<br />GPU DEVICE<br />THREAD BLOCK<br />REGISTERS<br />Global Memory<br />SHARED MEMORY<br />
  55. 55. Tesla C1060 Tesla C2070 Tesla C2090<br />This image is from<br /><br />14 Multiprocessors<br />448 Cores<br />Processor core clock: 1.15 GHz<br />1030 Gigaflops (Single precision)<br />515 Gigaflops (Double precision)<br />Max Bandwidth (144 GBytes/sec)<br />6 GB of DRAM<br />16 Multiprocessors<br />512 Cores<br />1330 Gigaflops (Single precision)<br />655Gigaflops (Double precision)<br />Max Bandwidth (177.6 GBytes/sec)<br />6 GB of DRAM<br />30 Multiprocessors<br />240 Cores<br />Processor core clock: 1.296 GHz<br />933 Gigaflops (Single precision)<br />78 Gigaflops (Double Precision)<br />Max Bandwidth(102 Gigabytes/sec)<br />4 GB of DRAM<br />
  56. 56. Reaction-Diffusion in CUDA<br />For each time step, a set of (ODEs) and Partial<br />Differential Equations (PDEs) must be solved.<br />for (timestep=1; timestep < nend; i++){<br />solveODEs <<grid, block>> (….);<br />calcLaplacian <<grid, block>> (….); <br />}<br />Solving ODEs using different method <br />depending on the model:<br />-Runge-Kutta 4th order<br />-Forward Euclid <br />-Backward Euclid <br />-Semi-Implicit<br />……<br />Solving PDEs (Calc the Laplacian)<br />
  57. 57. Optimize the Reaction Term in CUDA<br />Heaviside Simplification<br /><ul><li>In order to avoid divergent branches among threads we substitute if then else condition of Heaviside functions in this way:</li></ul>Precomputing Lookup Tables using Texture<br /><ul><li>We build tables where we precompute nonlinear part of the ODEs, we bind these table to textures and we exploit the built-in linear interpolation feature of the texture.</li></ul>If (x > a){<br />y = b;<br />}else{<br />y = c; <br />}<br />y =b + (x > a)*(b_c)<br />a, b, b_c (b-c) are constant<br />
  58. 58. Optimize the Reaction Term in CUDA<br />Kernel splitting:<br /><ul><li>For complex models, we need to split the ODEs solving in many Kernels in order to have enough registers for thread to perform our calculation.</li></ul>Use –use_fast_math compiler option to substitute <br /><ul><li>In the models that use log, exp, sqrt, functions we substitute them with the GPU built-in functions.</li></ul>Using Costantmemory for common parameters (dt, dx, ..)<br />
  59. 59. Optimize the Diffusion Term in CUDA<br />Solving PDEs (Calc the Laplacian)<br />Each location is a float<br />(4 bytes) The global <br />memory latency is <br />very slow. The memory is<br />accessed in multiples of <br />64 bytes <br />Using texture we can reduce the latency<br />In texture the data is cached (optimize for 2D Locality)<br />Drawback: It supports only single precision<br />
  60. 60. Optimize the Diffusion Term in CUDA<br />Another Technique is using SHARED MEMORY<br />The yellow and red threads read<br />the location from the global memory<br />into the shared memory.<br />The red threads calculates the <br />laplacian using the values <br />in the shared memory. <br />SYNCH<br />THREAD BLOCK<br />REGISTERS<br />SHARED MEMORY<br />Step 1<br />Step 2<br />Step 3<br />Drawback: The number of threads is greater than the number of elements<br />This technique supports single and double precision<br />
  61. 61. An Example of Perfomances<br />CPU Computation<br />GPU Computation<br />
  62. 62. Results (1)<br />KarmaModel (2V)<br />Minimal Model (4V)<br />
  63. 63. Results (2)<br />Beeler-Reuter Model (8V)<br />Ten TusscherPanfilovModel (19V)<br />
  64. 64. Results (3)<br />Iyer Model (67 V)<br />Simulation time normalized to real time <br />as a function of the number of model variables. <br />
  65. 65. Double vs Single<br />Spiral waves generated with the<br />TP model using single precision (left) and double precision (right) after 10 mins.<br />Time evolution of the intracellular K+ (left) and Na+ (right) concentrations observed at a representative grid point for the TP model with single and double precision. <br />
  66. 66. Work in Progress 3D Models<br />3D Model of a Pig Heart<br />3D Model of a Rabbit Heart<br />
  67. 67. Work in ProgressWebGL<br />
  68. 68. Thank you<br />