SlideShare una empresa de Scribd logo
1 de 26
A Scalable Tridiagonal Solver    For GPUs Team:WenMin Xiao&ChaoQun Li Institute of information science and  technology of Hunan University
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is a tridiagonal system?
What is it used for? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Two Applications on GPU Depth of field blur, Michael Kass et al. Shallow water simulation OpenGL and Shader language  CUDA Cyclic reduction Cyclic reduction 2006 2007
A Classic Serial Algorithm ,[object Object],Phase 1:Forword Reduction Phase 2:Backward Substitution Elimination steps? Complexity? 2n-1 O(n)=2(n-1)+1
Parallel Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A set of equations mapped to one thread A single equation mapped to one thread
Cyclic Reduction 2-4  threads working Forward Reduction Backward Substitution 8-unkown system 4-unkown system 2-unkown system Solve 2 unkowns Solve the rest 2 unkowns Solve the rest 4 unkonws 2*log2(8)-1 = 2*3 -1 = 5 steps
Parallel Cyclic Reduction(PCR) Forward Redution No Backward Substitution One 8-unkown system Two 4-unkown systems Four 2-unkown systems Solve all unkowns 4  threads working log 2 (8)=3 steps
Advantages of Previous Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hybird Algorithm ,[object Object],[object Object],One 8-unkown system One PCR step Parallel Thomas
GPU Implementation ,[object Object],[object Object],[object Object],[object Object]
Tiled PCR ,[object Object],Redundancy of  naive tiling  of PCR ,[object Object],[object Object],[object Object]
Dependency & Parallelism How to Reduce Redundancy? ,[object Object],[object Object],Solution 1 Redundancy is also exist!
Dependency & Parallelism cont Fine-grained tiling ,[object Object],[object Object],Solution 2 Without redundancy Sequential   Computation
Cache Design Buffered Sliding Window Illustration of the buffered sliding window 1. Immedicate   results  are cached 2.Each tile are processed  parallel 3.Each of tile has multiple sub tiles 4.Sub tiles are processed  sequentially  using cache
Components of Buffered Sliding Window ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example
Advantages of TPCR ,[object Object],[object Object],[object Object],[object Object]
Thread-level Parallel  Thomas Algorithm ,[object Object],[object Object],64B aligned segment 128B aligned segment
Performance Evaluation Test-Platform ,[object Object],[object Object],[object Object],[object Object]
Performance Results Parameter  M  and  N : number of systems and system size 8.3x and 49x speedups 5x and 30x speedups
Performance Analysis ,[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object]
Reference ,[object Object],[object Object],[object Object],[object Object]
Question? Thanks

Más contenido relacionado

La actualidad más candente

Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal ConstraintsIdentifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Lionel Briand
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Multicore programmingandtpl
Multicore programmingandtplMulticore programmingandtpl
Multicore programmingandtpl
Yan Drugalya
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
Hajime Tazaki
 

La actualidad más candente (20)

Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Nicpaper2009
Nicpaper2009Nicpaper2009
Nicpaper2009
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal ConstraintsIdentifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
 
Time space trade off
Time space trade offTime space trade off
Time space trade off
 
Java/Scala Lab 2016. Владимир Гарбуз: Написание безопасного кода на Java.
Java/Scala Lab 2016. Владимир Гарбуз: Написание безопасного кода на Java.Java/Scala Lab 2016. Владимир Гарбуз: Написание безопасного кода на Java.
Java/Scala Lab 2016. Владимир Гарбуз: Написание безопасного кода на Java.
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious Algorithms
 
An area efficient relaxed half-stochastic decoding architecture for nonbinary...
An area efficient relaxed half-stochastic decoding architecture for nonbinary...An area efficient relaxed half-stochastic decoding architecture for nonbinary...
An area efficient relaxed half-stochastic decoding architecture for nonbinary...
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Ch5 answers
Ch5 answersCh5 answers
Ch5 answers
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Multicore programmingandtpl
Multicore programmingandtplMulticore programmingandtpl
Multicore programmingandtpl
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Kernelvm 201312-dlmopen
Kernelvm 201312-dlmopenKernelvm 201312-dlmopen
Kernelvm 201312-dlmopen
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 

Destacado (15)

Ch026
Ch026Ch026
Ch026
 
Linked in series b pitch
Linked in series b pitchLinked in series b pitch
Linked in series b pitch
 
Pisa sokk
Pisa sokkPisa sokk
Pisa sokk
 
互联网人类学研究室
互联网人类学研究室互联网人类学研究室
互联网人类学研究室
 
Mantas of maldives part 1
Mantas of maldives part 1Mantas of maldives part 1
Mantas of maldives part 1
 
互联网人类学研究室
互联网人类学研究室互联网人类学研究室
互联网人类学研究室
 
Ensayo dominio público
Ensayo dominio públicoEnsayo dominio público
Ensayo dominio público
 
Deadgirl_horror film
Deadgirl_horror filmDeadgirl_horror film
Deadgirl_horror film
 
Előadás
ElőadásElőadás
Előadás
 
úJ nemzedék
úJ nemzedékúJ nemzedék
úJ nemzedék
 
[14 10-2011 16-19_32]ds_du_thi_xep_lop_16_10_11
[14 10-2011 16-19_32]ds_du_thi_xep_lop_16_10_11[14 10-2011 16-19_32]ds_du_thi_xep_lop_16_10_11
[14 10-2011 16-19_32]ds_du_thi_xep_lop_16_10_11
 
Látlelet a magyarországi szegénységről
Látlelet a magyarországi szegénységrőlLátlelet a magyarországi szegénységről
Látlelet a magyarországi szegénységről
 
Manual guardar agua chuva unhabitat
Manual guardar agua chuva unhabitatManual guardar agua chuva unhabitat
Manual guardar agua chuva unhabitat
 
Iskolai előadás pedagógus és intézményi ellenőrzésről
Iskolai előadás pedagógus és intézményi ellenőrzésrőlIskolai előadás pedagógus és intézményi ellenőrzésről
Iskolai előadás pedagógus és intézményi ellenőrzésről
 
Totyik tbemutatóóra
Totyik tbemutatóóraTotyik tbemutatóóra
Totyik tbemutatóóra
 

Similar a Tridiagonal solver in gpu

Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
wjunjmt
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
Obsidian Software
 

Similar a Tridiagonal solver in gpu (20)

Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Packet sniffing
Packet sniffingPacket sniffing
Packet sniffing
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
 
GPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingGPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print Imaging
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Unit II Arm 7 Introduction
Unit II Arm 7 IntroductionUnit II Arm 7 Introduction
Unit II Arm 7 Introduction
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Dasia 2022
Dasia 2022Dasia 2022
Dasia 2022
 
4g lte matlab
4g lte matlab4g lte matlab
4g lte matlab
 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Tridiagonal solver in gpu

  • 1. A Scalable Tridiagonal Solver For GPUs Team:WenMin Xiao&ChaoQun Li Institute of information science and technology of Hunan University
  • 2.
  • 3. What is a tridiagonal system?
  • 4.
  • 5. Two Applications on GPU Depth of field blur, Michael Kass et al. Shallow water simulation OpenGL and Shader language CUDA Cyclic reduction Cyclic reduction 2006 2007
  • 6.
  • 7.
  • 8. Cyclic Reduction 2-4 threads working Forward Reduction Backward Substitution 8-unkown system 4-unkown system 2-unkown system Solve 2 unkowns Solve the rest 2 unkowns Solve the rest 4 unkonws 2*log2(8)-1 = 2*3 -1 = 5 steps
  • 9. Parallel Cyclic Reduction(PCR) Forward Redution No Backward Substitution One 8-unkown system Two 4-unkown systems Four 2-unkown systems Solve all unkowns 4 threads working log 2 (8)=3 steps
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Cache Design Buffered Sliding Window Illustration of the buffered sliding window 1. Immedicate results are cached 2.Each tile are processed parallel 3.Each of tile has multiple sub tiles 4.Sub tiles are processed sequentially using cache
  • 17.
  • 19.
  • 20.
  • 21.
  • 22. Performance Results Parameter M and N : number of systems and system size 8.3x and 49x speedups 5x and 30x speedups
  • 23.
  • 24.
  • 25.