SlideShare a Scribd company logo
1 of 38
Download to read offline
Learning and Development
       Presents




OPEN TALK SERIES
A series of illuminating talks and interactions that open our minds to new ideas
and concepts; that makes us look for newer or better ways of doing what we did;
or point us to exciting things we have never done before. A range of topics on
Technology, Business, Fun and Life.

Be part of the learning experience at Aditi.
Join the talks. Its free. Free as in freedom at work, not free-beer.
Speak at these events. Or bring an expert/friend to talk.
Mail LEAD with topic and availability.
Parallel Programming

    Sundararajan Subramanian
        Aditi Technologies



2
Introduction to Parallel Computing
• The challenge
  – Provide the abstractions , programming
    paradigms, and algorithms needed to
    effectively design, implement, and maintain
    applications that exploit the parallelism
    provided by the underlying hardware in order
    to solve modern problems.
Single-core CPU chip
                  the single core




                                    4
Multi-core architectures




     Core 1           Core 2   Core 3   Core 4




Multi-core CPU chip                              5
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                               6
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                                            7
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                                         8
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years

                                             9
Instruction level parallelism
• For(int i-0;i<1000;i++)
    { a[0]++; a[0]++; }


• For(int i-0;i<1000;i++)
    { a[0]++; a[1]++; }
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                               11
A technique complementary to multi-core:
         Simultaneous multithreading

• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                               Integer       Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB      Trace Cache           uCode
    arrive from memory                                                               ROM
                                                               Decoder
 Other execution units         Bus

 wait unused                                                BTB and I-TLB
                                                                         Source: Intel

                                                                                         12
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                                     13
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer            Floating Point

                                        Schedulers

                                     Uop queues

                                     Rename/Alloc

                             BTB     Trace Cache          uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                                                 Thread 1: floating point
                                                                            14
Without SMT, only a single thread can
        run at any given time
                                   L1 D-Cache D-TLB

      L2 Cache and Control
                              Integer              Floating Point

                                        Schedulers

                                      Uop queues

                                     Rename/Alloc

                             BTB      Trace Cache         uCode ROM

                                         Decoder
      Bus




                                     BTB and I-TLB

                               Thread 2:
                               integer operation                      15
SMT processor: both threads can run
          concurrently
                                  L1 D-Cache D-TLB

     L2 Cache and Control
                             Integer            Floating Point

                                       Schedulers

                                     Uop queues

                                    Rename/Alloc

                            BTB      Trace Cache         uCode ROM

                                        Decoder
     Bus




                                    BTB and I-TLB

                              Thread 2:         Thread 1: floating point
                              integer operation                            16
But: Can’t simultaneously use the
       same functional unit
                                 L1 D-Cache D-TLB

    L2 Cache and Control
                            Integer            Floating Point

                                      Schedulers

                                   Uop queues

                                   Rename/Alloc

                           BTB     Trace Cache        uCode ROM

                                       Decoder        This scenario is
                                                      impossible with SMT
    Bus




                                   BTB and I-TLB
                                                      on a single core
                             Thread 1 Thread 2        (assuming a single
                                 IMPOSSIBLE           integer unit)       17
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources

                                              18
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                          L1 D-Cache D-TLB

                        Integer         Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                          L2 Cache and Control
                              Schedulers                                                Schedulers

                              Uop queues                                                Uop queues

                             Rename/Alloc                                              Rename/Alloc

                       BTB      Trace Cache       uCode                          BTB       Trace Cache      uCode
                                                  ROM                                                       ROM
                                Decoder                                                   Decoder
                                                          Bus
Bus




                             BTB and I-TLB                                             BTB and I-TLB

                             Thread 1                                                  Thread 2                    19
Multi-core:
                       threads can run on separate cores
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer       Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                              Schedulers

                             Uop queues                                              Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB      Trace Cache        uCode
                                                ROM                                                        ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                                     Thread 3                                                   Thread 4       20
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
                                                  21
SMT Dual-core: all four threads can run
            concurrently
                         L1 D-Cache D-TLB                                        L1 D-Cache D-TLB

                        Integer       Floating Point                            Integer         Floating Point
L2 Cache and Control




                                                        L2 Cache and Control
                             Schedulers                                               Schedulers

                             Uop queues                                               Uop queues

                             Rename/Alloc                                            Rename/Alloc

                       BTB      Trace Cache     uCode                          BTB       Trace Cache        uCode
                                                ROM                                                         ROM
                                Decoder                                                 Decoder
                                                        Bus
Bus




                             BTB and I-TLB                                           BTB and I-TLB

                         Thread 1 Thread 3                                           Thread 2    Thread 4        22
Designs with private L2 caches




                                                             CORE0
CORE1




                   CORE0




                                      CORE1
        L1 cache           L1 cache           L1 cache               L1 cache

        L2 cache           L2 cache           L2 cache          L2 cache

                                              L3 cache          L3 cache
              memory
                                                      memory
  Both L1 and L2 are private
                                              A design with L3 caches
  Examples: AMD Opteron,
  AMD Athlon, Intel Pentium D                 Example: Intel Itanium 2
Private vs shared caches?
• Advantages/disadvantages?




                               25
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                                   26
Parallel Architectures
• Use multiple
  – Datapaths
  – Memory units
  – Processing units
Parallel Architectures
• SIMD
  – Single instruction stream, multiple data stream
                     Processing
                        Unit
                     Processing
                        Unit




                                               Interconnect
 Control
                     Processing
  Unit
                        Unit
                     Processing
                        Unit
                     Processing
                        Unit
Parallel Architectures
• MIMD
 – Multiple instruction stream, multiple data stream
          Processing/Control
                 Unit

          Processing/Control




                                          Interconnect
                 Unit

          Processing/Control
                 Unit

          Processing/Control
                 Unit
Parallelism in Visual Studio 2010
Integrated    Programming Models                                                       Programming Models
Tooling
                          PLINQ
   Parallel            Task Parallel                                                     Parallel Pattern      Agents
  Debugger                                                                                   Library           Library
                         Library
Toolwindows




                                                   Data Structures

                                                                     Data Structures
              Concurrency Runtime                                                        Concurrency Runtime

                       ThreadPool
  Profiler                                                                                       Task Scheduler
Concurrency           Task Scheduler
  Analysis
                    Resource Manager
                                                                                              Resource Manager

                                    Operating System

                                       Threads

               Key:         Tools        Native Library                                Managed Library
Multi threading Today
• Divide the total number of activites across n
  processors
• In case of 2 Procs, divide it by 2.
User Mode Scheduler
CLR Thread Pool

    Global
    Queue




               Worker    …    Worker
              Thread 1       Thread p


Program
 Thread
User Mode Scheduler For Tasks
     CLR Thread Pool: Work-Stealing

                           Local       …     Local
           Global          Queue             Queue
           Queue




                        Worker     …        Worker
                       Thread 1            Thread p
                                            Task 6
Task 1              Task Task 3
                         4
 Task 2Program            Task 5
       Thread
DEMO
Task-based Programming
       ThreadPool Summary
ThreadPool.QueueUserWorkItem(…);



System.Threading.Tasks
Starting                      Parent/Child
Task.Factory.StartNew(…);     var p = new Task(() => {
                                  var t = new Task(…);
                              });
Continue/Wait/Cancel
Task t = …                    Tasks with results
                              Task<int> f =
Task p = t.ContinueWith(…);     new Task<int>(() => C());
t.Wait(2000);                 …
t.Cancel();                   int result = f.Result;
Coordination Data Structures (1 of
                                      3)
                                      Block if full
Concurrent Collections                         P          C
•   BlockingCollection<T>                  P                  C
•   ConcurrentBag<T>                           P          C
•   ConcurrentDictionary<TKey,TValu
    e>                                                Block if empty
•   ConcurrentLinkedList<T>
•   ConcurrentQueue<T>
•   ConcurrentStack<T>
•   IProducerConsumerCollection<T>
•   Partitioner, Partitioner<T>,
    OrderablePartitioner<T>
Coordination Data Structures (2 of
                           3)
Synchronization Primitives
•   Barrier
•   CountdownEvent




                             Loop
•   ManualResetEventSlim                   Barrier   postPhaseAction

•   SemaphoreSlim
•   SpinLock
•   SpinWait




                              CountdownEvent.
Coordination Data Structures (3 of
                                           3)
Initialization Primitives
•   Lazy<T>, LazyVariable<T>, LazyInitializer   Cancellation    MyMethod( )
•   ThreadLocal<T>                                Source

                                                         Foo(…, CancellationToken ct)
Cancellation Primitives                                         Thread Boundary

•   CancellationToken
•   CancellationTokenSource                              Bar(…, CancellationToken ct)
•   ICancelableOperation

                                                       ManualResetEventSlim.Wait( ct )


                                                Cancellation
                                                   Token

More Related Content

What's hot

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
Fraboni Ec
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
Fraboni Ec
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Mr SMAK
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
Ashish Kumar
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1
Mr SMAK
 

What's hot (19)

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
 
Reduce course notes class xi
Reduce course notes class xiReduce course notes class xi
Reduce course notes class xi
 
Concept of thread
Concept of threadConcept of thread
Concept of thread
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS                          MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Lect06
Lect06Lect06
Lect06
 
Aca2 07 new
Aca2 07 newAca2 07 new
Aca2 07 new
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1
 

Similar to Parallel Programming

Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
Amol Barewar
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
nextlib
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
Jawid Ahmad Baktash
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
Young Alista
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
shubhangisonawane6
 

Similar to Parallel Programming (20)

Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
27 multicore
27 multicore27 multicore
27 multicore
 
27 multicore
27 multicore27 multicore
27 multicore
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Extlect04
Extlect04Extlect04
Extlect04
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
 
Lect15
Lect15Lect15
Lect15
 
Advanced Pipelining in ARM Processors.pptx
Advanced Pipelining  in ARM Processors.pptxAdvanced Pipelining  in ARM Processors.pptx
Advanced Pipelining in ARM Processors.pptx
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicores
 

More from HARMAN Services

More from HARMAN Services (20)

3 Dimensions Of Transformation
3 Dimensions Of Transformation3 Dimensions Of Transformation
3 Dimensions Of Transformation
 
Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance Testing Strategies to Deliver Consistent App Performance
Testing Strategies to Deliver Consistent App Performance
 
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHow to Manage APIs in your Enterprise for Maximum Reusability and Governance
How to Manage APIs in your Enterprise for Maximum Reusability and Governance
 
Digital Transformation: Connected API Ecosystems
Digital Transformation: Connected API EcosystemsDigital Transformation: Connected API Ecosystems
Digital Transformation: Connected API Ecosystems
 
Webinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoTWebinar - Transforming Manufacturing with IoT
Webinar - Transforming Manufacturing with IoT
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D Kesharia
 
15 Big Data Billionaires
15 Big Data Billionaires15 Big Data Billionaires
15 Big Data Billionaires
 
Digital Transformation in Travel
Digital Transformation in TravelDigital Transformation in Travel
Digital Transformation in Travel
 
Digital Transformation in Retail
Digital Transformation in RetailDigital Transformation in Retail
Digital Transformation in Retail
 
Digital Transformation in Media
Digital Transformation in MediaDigital Transformation in Media
Digital Transformation in Media
 
Digital Transformation in Hospitality
Digital Transformation in HospitalityDigital Transformation in Hospitality
Digital Transformation in Hospitality
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow Top LinkedIn Influencers Every CIO Must Follow
Top LinkedIn Influencers Every CIO Must Follow
 
Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study Ladbrokes and Aditi - Digital Transformation Case study
Ladbrokes and Aditi - Digital Transformation Case study
 
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHow Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
How Internet of Things (IoT) is Reshaping the Automotive Sector - Infographic
 
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
Finding the important bugs- A talk by John Scarborough, Director of Testing, ...
 
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership Analyzing Gartner's CIO Study: Fliping to Digital Leadership
Analyzing Gartner's CIO Study: Fliping to Digital Leadership
 
24 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 2424 Connected Car features to look out for before the release of Bond 24
24 Connected Car features to look out for before the release of Bond 24
 
Webinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected CustomerWebinar: How I Met Your Connected Customer
Webinar: How I Met Your Connected Customer
 
5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference5 Takeaways From The UX India Conference
5 Takeaways From The UX India Conference
 

Recently uploaded

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Parallel Programming

  • 1. Learning and Development Presents OPEN TALK SERIES A series of illuminating talks and interactions that open our minds to new ideas and concepts; that makes us look for newer or better ways of doing what we did; or point us to exciting things we have never done before. A range of topics on Technology, Business, Fun and Life. Be part of the learning experience at Aditi. Join the talks. Its free. Free as in freedom at work, not free-beer. Speak at these events. Or bring an expert/friend to talk. Mail LEAD with topic and availability.
  • 2. Parallel Programming Sundararajan Subramanian Aditi Technologies 2
  • 3. Introduction to Parallel Computing • The challenge – Provide the abstractions , programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems.
  • 4. Single-core CPU chip the single core 4
  • 5. Multi-core architectures Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 5
  • 6. Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 8
  • 9. Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 9
  • 10. Instruction level parallelism • For(int i-0;i<1000;i++) { a[0]++; a[0]++; } • For(int i-0;i<1000;i++) { a[0]++; a[1]++; }
  • 11. Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. A technique complementary to multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Decoder Other execution units Bus wait unused BTB and I-TLB Source: Intel 12
  • 13. Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 13
  • 14. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 14
  • 15. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 15
  • 16. SMT processor: both threads can run concurrently L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 16
  • 17. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB L2 Cache and Control Integer Floating Point Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is impossible with SMT Bus BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 17
  • 18. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 18
  • 19. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 2 19
  • 20. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 Thread 4 20
  • 21. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 21
  • 22. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread 2 Thread 4 22
  • 23.
  • 24. Designs with private L2 caches CORE0 CORE1 CORE0 CORE1 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2
  • 25. Private vs shared caches? • Advantages/disadvantages? 25
  • 26. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 26
  • 27. Parallel Architectures • Use multiple – Datapaths – Memory units – Processing units
  • 28. Parallel Architectures • SIMD – Single instruction stream, multiple data stream Processing Unit Processing Unit Interconnect Control Processing Unit Unit Processing Unit Processing Unit
  • 29. Parallel Architectures • MIMD – Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Interconnect Unit Processing/Control Unit Processing/Control Unit
  • 30. Parallelism in Visual Studio 2010 Integrated Programming Models Programming Models Tooling PLINQ Parallel Task Parallel Parallel Pattern Agents Debugger Library Library Library Toolwindows Data Structures Data Structures Concurrency Runtime Concurrency Runtime ThreadPool Profiler Task Scheduler Concurrency Task Scheduler Analysis Resource Manager Resource Manager Operating System Threads Key: Tools Native Library Managed Library
  • 31. Multi threading Today • Divide the total number of activites across n processors • In case of 2 Procs, divide it by 2.
  • 32. User Mode Scheduler CLR Thread Pool Global Queue Worker … Worker Thread 1 Thread p Program Thread
  • 33. User Mode Scheduler For Tasks CLR Thread Pool: Work-Stealing Local … Local Global Queue Queue Queue Worker … Worker Thread 1 Thread p Task 6 Task 1 Task Task 3 4 Task 2Program Task 5 Thread
  • 34. DEMO
  • 35. Task-based Programming ThreadPool Summary ThreadPool.QueueUserWorkItem(…); System.Threading.Tasks Starting Parent/Child Task.Factory.StartNew(…); var p = new Task(() => { var t = new Task(…); }); Continue/Wait/Cancel Task t = … Tasks with results Task<int> f = Task p = t.ContinueWith(…); new Task<int>(() => C()); t.Wait(2000); … t.Cancel(); int result = f.Result;
  • 36. Coordination Data Structures (1 of 3) Block if full Concurrent Collections P C • BlockingCollection<T> P C • ConcurrentBag<T> P C • ConcurrentDictionary<TKey,TValu e> Block if empty • ConcurrentLinkedList<T> • ConcurrentQueue<T> • ConcurrentStack<T> • IProducerConsumerCollection<T> • Partitioner, Partitioner<T>, OrderablePartitioner<T>
  • 37. Coordination Data Structures (2 of 3) Synchronization Primitives • Barrier • CountdownEvent Loop • ManualResetEventSlim Barrier postPhaseAction • SemaphoreSlim • SpinLock • SpinWait CountdownEvent.
  • 38. Coordination Data Structures (3 of 3) Initialization Primitives • Lazy<T>, LazyVariable<T>, LazyInitializer Cancellation MyMethod( ) • ThreadLocal<T> Source Foo(…, CancellationToken ct) Cancellation Primitives Thread Boundary • CancellationToken • CancellationTokenSource Bar(…, CancellationToken ct) • ICancelableOperation ManualResetEventSlim.Wait( ct ) Cancellation Token