SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
MUDA
MUltiple Data Accelerator language

        Project Overview
          Feb 24, 2008
            Syoyo FUJITA
?
Nikkei 225 index
?
GPU slumps
CPU soars
                              Geforce 9800 GX2 rumor

                              1 TFlops?( 3x of G80)
                              500 GFlops? (+50% of G80)


                                                  ?
                                    No
                                  update !


                PS3                     Mac Pro octa
             179.2 Gflops
                            +800 %
                                  204 Gflops




                           2007         Feb/2008
Nikkei 225 index
Subprime shock!
Nikkei 225 index   Credit boom ends!
                   US economy declines!
                   Green IT!




     Future of GPU trend
Accelerated
             computing

 many-core                 GPGPU




CPU                                GPU
Accelerated
             computing

 many-core                 GPGPU


                           NO!
CPU                                  GPU

                    GPGPU was dead!!
                    GPU will be dead soon!!
Why GPU -> GPGPU is
          BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
 • Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on specific GPU maker’s
  GPU
 • Not portable.
Why CPU -> Accelerated computing is
            GOOD

• Easy to program
• CPU maker provides good internal spec
  documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile
Accelerated
             computing

 many-core



        MUDA
CPU
MUDA’s goal

• Withdraw CPU’s maximum
 floating point performance for
 large data
 • SIMD
 • Cache optimized computation
MUDA example
MUDA code
vec sqrtmu(vec x)
{
    vec y0, y0x, y0xhalf;
    vec oneish = bit(0x3f800001);

    y0 = rsqrt(x);
    y0x = y0 * x;
    y0xhalf = 0.5 * y0x;

    return ((oneish - y0 * y0x) * y0xhalf + y0x);
}
__m128 sqrtmu (const __m128 * x)
{
                                                                  x86/SSE output
  __m128 y0 ;

    __m128 y0x ;

    __m128 y0xhalf ;

    const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ;
    __m128 oneish = t_vec4 ;

    const __m128 t_vec6 = (*x) ;
    const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ;
    y0 = t_vec5 ;

    const __m128 t_vec8 = y0 ;
    const __m128 t_vec9 = (*x) ;
    const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ;
    y0x = t_vec7 ;

    const float t_float13 = 0.5 ;
    const float t_float12 = t_float13 ;
    const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ;
    const __m128 t_vec14 = y0x ;
    const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ;
    y0xhalf = t_vec11 ;

    const __m128 t_vec19 = oneish ;
    const __m128 t_vec20 = y0 ;
    const __m128 t_vec21 = y0x ;
    const __m128 t_vec15 = _mm_mul_ps( t_vec20 ,    t_vec21 ) ;
    const __m128 t_vec16 = _mm_sub_ps( t_vec19 ,    t_vec15 ) ;
    const __m128 t_vec22 = y0xhalf ;
    const __m128 t_vec17 = _mm_mul_ps( t_vec16 ,    t_vec22 ) ;
    const __m128 t_vec23 = y0x ;
    const __m128 t_vec18 = _mm_add_ps( t_vec17 ,    t_vec23 ) ;
    return t_vec18 ;
}
Why MUDA?
No unified way to
    describe SIMD op

• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add
CPU ISA changes
      frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future
  CPU?
• Keeping up with them is hard and
  not productive. Waste of your
  time.
SSE2 C code


                                   SSE4 C code
                   MUDA
   MUDA
                  compiler
                                   VMX C code
   Portable,
CPU independent
  description
                                    LLVM IR

                             CPU or Arch dependent
                                     code
Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
     = I’m currently working on
Future direction
•   Cache miss analysis and memory access
    optimization

    •   Valgrind, Cache Miss Equation(CME)

• Automatic optimization
  • Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for
    floating point computation

    •   Interval Arithmetic, Affine Arithmetic, Gappa
Performance gap
         100



          75

Better
          50


                Scalar:SIMD   cache miss:cache hit
          25
                      =                =
                     1:4             1:100
           0
                   SIMD           Memory
Performance gap
         100


                Optimizing memory access is much
          75
                more important than SIMDization
Better
          50


                Scalar:SIMD     cache miss:cache hit
          25
                      =                  =
                     1:4               1:100
           0
                   SIMD             Memory

Más contenido relacionado

Similar a Muda Proposal

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
Edge AI and Vision Alliance
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
Roberto Agostino Vitillo
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Similar a Muda Proposal (20)

Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Provision Intel® Optane™ DC Persistent Memory in Linux*
Provision Intel® Optane™ DC Persistent Memory in Linux*Provision Intel® Optane™ DC Persistent Memory in Linux*
Provision Intel® Optane™ DC Persistent Memory in Linux*
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Дмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформыДмитрий Вовк: Векторизация кода под мобильные платформы
Дмитрий Вовк: Векторизация кода под мобильные платформы
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Muda Proposal

  • 1. MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA
  • 2. ?
  • 4. ?
  • 5. GPU slumps CPU soars Geforce 9800 GX2 rumor 1 TFlops?( 3x of G80) 500 GFlops? (+50% of G80) ? No update ! PS3 Mac Pro octa 179.2 Gflops +800 % 204 Gflops 2007 Feb/2008
  • 7. Subprime shock! Nikkei 225 index Credit boom ends! US economy declines! Green IT! Future of GPU trend
  • 8. Accelerated computing many-core GPGPU CPU GPU
  • 9. Accelerated computing many-core GPGPU NO! CPU GPU GPGPU was dead!! GPU will be dead soon!!
  • 10. Why GPU -> GPGPU is BAD • Larger latency : host <-> PCI-ex • Internal architecture is black box • Only GPU maker knows it • Larger cost of branching • Debugger? • Program only runs on specific GPU maker’s GPU • Not portable.
  • 11. Why CPU -> Accelerated computing is GOOD • Easy to program • CPU maker provides good internal spec documentation • Fast execution of branching • gdb :-) • Portable & Versatile
  • 12. Accelerated computing many-core MUDA CPU
  • 13. MUDA’s goal • Withdraw CPU’s maximum floating point performance for large data • SIMD • Cache optimized computation
  • 14. MUDA example MUDA code vec sqrtmu(vec x) { vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001); y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }
  • 15. __m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }
  • 17. No unified way to describe SIMD op • SSE: _mm_add_ps() • AltiVec: vec_add • SPE: spu_add
  • 18. CPU ISA changes frequently • SSE2(2000), SSE3(2004), SSE4(2006) • SSE5 and Coming New CPU design(?) • 8-element SIMD?, no SIMD in the future CPU? • Keeping up with them is hard and not productive. Waste of your time.
  • 19. SSE2 C code SSE4 C code MUDA MUDA compiler VMX C code Portable, CPU independent description LLVM IR CPU or Arch dependent code
  • 20. Status • SSE2 backend : 75 % • SSE4 backend : 0 % • VMX backend : 20 % • LLVM IR backend : 30 % • SIMD math function for MUDA : 5 % • Automatic optimizer : TODO = I’m currently working on
  • 21. Future direction • Cache miss analysis and memory access optimization • Valgrind, Cache Miss Equation(CME) • Automatic optimization • Such like FFTW, ATLAS and Spiral are doing • Automatic error measurement for floating point computation • Interval Arithmetic, Affine Arithmetic, Gappa
  • 22. Performance gap 100 75 Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory
  • 23. Performance gap 100 Optimizing memory access is much 75 more important than SIMDization Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory